Three mental images from thinking about AGI debate & corrigibility
The claims "If X drifts away from corrigibility along dimension {N}, it will get pulled back" are clearly structurally similar, and the broad basin of corrigibility argument is meant to be an argument that argues for all of them.

To be clear, I think there are two very different arguments here:

1) If we have an AGI that is corrigible, it will not randomly drift to be not corrigible, because it will proactively notice and correct potential errors or loss of corrigibility.

2) If we have an AGI that is partly corrigible, it will help us 'finish up' the definition of corrigibility / edit itself to be more corrigible, because we want it to be more corrigible and it's trying to do what we want.

The first is "corrigibility is a stable attractor", and I think there's structural similarity between arguments that different deviations will be corrected. The second is the "broad basin of corrigibility", where for any barely acceptable initial definition of "do what we want", it will figure out that "help us find the right definition of corrigibility and implement it" will score highly on its initial metric of "do what we want."

Like, it's not the argument that corrigibility is a stable attractor; it's an argument that corrigibility is a stable attractor with no nearby attractors. (At least in the dimensions that it's 'broad' in.)

I find it less plausible that missing pieces in our definition of "do what we want" will be fixed in structurally similar ways, and I think there are probably a lot of traps where a plausible sketch definition doesn't automatically repair itself. One can lean here on "barely acceptable", but I don't find that very satisfying. [In particular, it would be nice if we had a definition of corrigibility where could look at it and say "yep, that's the real deal or grows up to be the real deal," tho that likely requires knowing what the "real deal" is; the "broad basin" argument seems to me to be meaningful only in that it claims "something that grows into the real deal is easy to find instead of hard to find," and when I reword that claim as "there aren't any dead ends near the real deal" it seems less plausible.

1. Why aren't the dimensions symmetric?

In physical space, generally things are symmetric between swapping the dimensions around; in algorithm-space, that isn't true. (Like, permute the weights in a layer and you get different functional behavior.) Thus while it's sort of wacky in a physical environment to say "oh yeah, df/dx, df/dy, and dy/dz are all independently sampled from a distribution" it's less wacky to say that of neural network weights (or the appropriate medium-sized analog).

Three mental images from thinking about AGI debate & corrigibility
And by the way, how do these AGIs come up with the best argument for their side anyway? Don't they need to be doing good deliberation internally? If so, can't we just have one of them deliberate on the top-level question directly? Or if not, do the debaters spawn sub-debaters recursively, or something?

This is an arbitrary implementation detail, and one of the merits of debate is that it lets the computer figure this out instead of requiring that the AGI designer figure this out.

Three mental images from thinking about AGI debate & corrigibility
e.g. I could argue against "1 + 1 = 2" by saying that it's an infinite conjunction of "1 + 1 != 3" AND "1 + 1 != 4" AND ... and so it can't possibly be true.

Uh, when I learned addition (in the foundation-of-mathematics sense) the fact that 2 was the only possible result of 1+1 was a big part of what made it addition / made addition useful.

There's a huge structural similarity between the proof that '1 + 1 != 3' and '1+1 != 4'; like, both are generic instances of the class '1 + 1 != n \forall n != 2'. We can increase the number of numbers without decreasing the plausibility of this claim (like, consider it in Z/4, then Z/8, then Z/16, then...).

But if instead I make a claim of the form "I am the only person who uses the name 'Vaniver'", we don't have the same sort of structural similarity, and we do have to check the names of everyone else, and the more people there are, the less plausible the claim becomes.

Similarly, if we make an argument that something is an attractor in N-dimensional space, that does actually grow less plausible the more dimensions there are, since there are more ways for the thing to have a derivative that points away from the 'attractor,' if we think the dimensions aren't all symmetric. (If there's only gravity, for example, we seem in a better position to end up with attractors than if there's a random force field, even in 4d, 8d, 16d, etc.; similarly if there's a random potential function whose derivative is used to compute the forces.)

Three mental images from thinking about AGI debate & corrigibility
I think there argument might be misleading in that local stability isn't that rare in practice

Surely this depends on the number of dimensions, with local stability being rarer the more dimensions you have. [Hence the argument that, in the infinite-dimensional limit, everything that would have been an "local minimum" is instead a saddle point.]

Three mental images from thinking about AGI debate & corrigibility
To say "corrigibility is a broad basin of attraction", you need ALL of the following to be true:

At some point, either in an in-person conversation or a post, Paul clarified that obviously it will be 'narrow' in some dimensions and 'broad' in others. I think it's not obvious how the geometric intuition goes here, and this question mostly hinges on "if you have some parts of corrigibility, do you get the other parts?", to which I think "no" and Paul seems to think "yes." [He might think some limited version of that, like "it makes it easier to get the other parts." which I still don't buy yet.]

Unifying the Simulacra Definitions
Can somebody enlighten me as to how the "we live in the Matrix, and it's inescapable" perspective might be reasonable, just on an everyday lived-experience level?

A lot of this depends on "what matters" to you; how much is lunch about the food, versus who you're eating it with, versus what you're talking about? It seems quite easy for someone to be in a realm where everything that matters to them relates to social reality instead of physical reality. (This is mostly because the aspects of physical reality that do matter to them have had their variance reduced significantly, or have been declared not to matter.)

Much of the pandemic response, for example, makes sense when you think of people acting in social reality, and little sense when you think of people as acting in physical reality.

And, particularly if you're ambitious, you might care a lot about what your era rewards. Military success? New discoveries? Building new systems? Careful and patient accumulation of capital? Timely decisions? Influence accumulation and deployment? Clever argumentation?

On that front, I feel confused about whether our era is good or bad. It still seems like we're still in one of the better eras for ambition being achieved through new discoveries / building new systems, but also complaints seem valid that startups are too much about marketing and getting money from investors instead of building a great product, and more broadly that we only have progress in bits instead of atoms.

Rereading Atlas Shrugged
Ayn Rand wrote a ton of material on concept-formation: some of it is in ITOE, and some of it is scattered amongst essays on other topics. For example, her essay "The Art of Smearing" opens by examining the use of the flawed concept "extremism" by certain political groups to attack their opponents, and then opens out into a discussion of the formation of "anti-concepts" in general, and their effects on cognition. She has several essays of a similar nature.

My prediction, having read a few of these, is that I will agree with them more than I disagree with them; when she points to someone making an error, at least 90% of the time they'll actually be making an error. I think the phrase 'anti-pattern' is more common on LW than 'anti-concept', but they seem overall the same and to have similar usages. (37 Ways That Words Can Be Wrong feels like the good similar example from LW.)

That said, there's a somewhat complicated point here that was hammered home for me by thinking about causal reasoning. Specifically, humans are pretty good at intuitive causal reasoning, and so philosophers discussing the differences between causal decision theory and evidential decision theory found it easy to compute 'what CDT would say' for a particular situation, by checking what their intuitive causal sense of the situation was.

But some situations are very complicated; see figure 1 of this paper, for example. In order to do causal reasoning in an environment like that, it helps if it's 'math so simple a computer could do it,' which involves really getting to the heart of the thing and finding the simple core.

From what I can tell, the Objectivists are going after the right sort of thing (the point of concepts is to help with reasoning to achieve practice ends in the real world, i.e. rationalists should win and beliefs should pay rent in anticipated experience), and so I'm unlikely to actually uncover any fundamental disagreements in goals. [Even on the probabilistic front, you could go from Peikoff's "knowledge is contextual" to a set-theoretic definition of probability and end up Bayesian-ish.]

But it feels to me like... it should be easy to summarize, in some way? Or, like, the 'LW view' has a lot of "things it's against" (the whole focus on heuristics and biases seems important here), and "things it's for" ('beliefs should pay rent' feels like potentially a decent summary here), and it feels like it has a clear view of both of them. I feel like the Objectivist epistemology is less clear about the "things it's for", potentially obscured by being motivated mostly by the "things it's against." Like, I think LW gets a lot of its value by trying to get down to the level of "we could program a computer to do it," in a way that requires peering inside some cognitive modules and algorithms that Rand could assume her audience had.

Compare with examples in biology, where there is no confusion over whether, say, a duck-billed platypus is a bird or a mammal.

Tho I do think the history of taxonomy in biology includes many examples of confused concepts and uncertain boundaries. Even the question of "what things are alive?" runs into perverse corner cases.

Irrational Resistance to Business Success
Merely because they are complaining or for some other reason?

The complaints are the smoke, but there's some hints elsewhere as well. Again, the letter-writer's telling of the first big meeting about it:

The chief concern when we told the big wigs we were going to this, was that the cost of freight would go up because our transfer batch sizes would get small. I told them correct but we would stop shipping product back and forth between distribution centers and repacking of product would be almost non-existant. YTD: Our total freight dollars spent is 10% less than the previous year but they look at $/ lb of freight which has gone up. I know this is wrong, they state they know it is wrong, but it still gets measured and used for evaluations.

Whose evaluations? My guess would be the people who are in charge of moving things around--the distribution centers--instead of the people who are in charge of making things--the plant manager, who is the writer of this letter.

From my reading of this, it sounds like when they made the tradeoff they didn't get agreement that the new methodology required new standards, and that total freight dollars would be the judge instead of $/lb of freight.

If they changed the metric to something like $ spent on freight / $ of product delivered to customers, they should have enough data to backcalculate the metric (so they can fairly use it going forward) and it should be focusing the relevant managers on something more relevant to the overall business. [A ratio is still not quite right--now that manager is opposed to many small transactions or heavy items, because the revenue ratio will go down even if total profit goes up--but it's still better than baking volume discounts in to someone's KPI!]


At this point, I read Eli's response, which takes a different tack on the tactical level (he knows more about how DCs get judged) but seems broadly the same to me on the strategic level. The DCs are complaining, and they might not know how to solve their problem or what it even is, but they are damn sure there is a problem. And so you apply ToC at the higher level and say "ok, time to invest my attention into the constraint that is now tight."

A major component of ToC is that you need system-level thinking in order to solve system-level problems, because most of the effects of decisions are invisible instead of visible. How, then, could it possibly make sense to say "well, it's not in my area; I'm hitting my metrics!" or "well, they can't explain it, so clearly it's not real"?

The DCs were unable to substantively identify any problem that was created for them. And they spent 9 months refusing to use measurements or evidence to address this matter, in addition to failing to explain any cause-and-effect logic about what the problem they're now facing is and how it was caused by the change in production.

The letter-writer has an ability to see things other people in the business can't, because he has this mode of system-level thinking and they don't. That they can't articulate the problems they're facing or solve them shouldn't be a surprise, and is not a reasonable expectation to have of them now. Hence Eli's suggestion, that the company spend to get the DCs access to this new mode of thinking so that they can solve these problems, and others.


Sometimes there are just obnoxious people who are in the way, who insist that the company spend a million dollars to save ten thousand dollars. If they're below you, correct them or fire them. If those people are sideways to you, you need to solve their problems for them, or get your management to support you over them (which requires focusing on the numbers that management cares about, instead of just the numbers you care about). If these people are above you in the chain, like a CEO who cares a lot about a metric despite the more detailed causal model suggesting the metric is mispriced or confused, the first step is to convince them, and then when that fails, one should move to a company that is interested in using the right metrics, or ignore it as a cost of doing business.

Irrational Resistance to Business Success
What insight can LW bring to this problem of negative response to rational improvement?

Some statement carry with them implied worldviews, and then there will be disagreement over the worldview that unhelpfully masquerades as disagreement about the statement. It takes some clever seeing to surface those subterranean issues and handle them. For example, presumably the people causing trouble for our letter-writer aren't against "rational improvement" because it is "rational improvement"; they're against it because it makes them worse off in some way, and that's the battle to try to fight. [Simply knowing that something is a Kaldor-Hicks improvement doesn't mean you should do it, or you should expect no opposition to it!]

In this situation, it sounds like the problem is that improvement for the plant came at cost for the DCs--not necessarily financial, but perhaps in terms of them now having to do more work than they used to, or their KPIs (key performance indicators) being harder to hit.

One solution is to apply TOC one level up. In the old mode of operation, the plant is the tight constraint, and the DC is slack; in the new mode, the plant is now a slack constraint, and the DC is tight. Maybe upper management needs to give the DCs additional resources in order to handle the new demand; maybe the DCs need to apply TOC to be able to better handle the new demand; maybe the DCs are actually operating fine, but need their environment shifted (the level of the KPIs adjusted to the new normal, or which variables are measured adjusted, or so on).

My guess is that their best bet would have been to somehow take on the risk when the concern was raised in the first meeting ("when we told the big wigs") or get an explicit agreement to shift the measurement criterion to the one that made sense.

Rereading Atlas Shrugged

Thanks for the links!

Basically -- the element of Objectivist philosophy that is by far and away the most useful is the epistemology.

I find this interesting, since I think epistemology is one of the most well-developed parts of the "LW view." If Objectivism has something to add, we should definitely incorporate it; if Objectivism has a major challenge for it, we should definitely address it.

I have memories of reading IToE, or at least leafing through it, in my college days when I hung out with a bunch of Objectivists, but I think this was before the I read The Sequences, and so I don't think I ever made an explicit comparison between them. I do remember being less impressed with it than I was with Epistemology and the Psychology of Human Judgment, which I definitely had read then.

Reading the linked first chapter of IToE, and bouncing around the Epistemology section of the Ayn Rand Lexicon, I haven't yet found something that stood out as "beyond" LW but did find a few things that seemed to me to be "behind" LW. As an example of something that didn't seem 'beyond' LW, Rand's Razor seems like a useful habit for dispensing with perverse concepts like grue and bleen, but Follow the Improbability feels like the better version of it. Like, how is 'necessity' measured? How is the multiplication of concepts measured? The LW-style Bayesianism has quantitative answers, there, but my sense is the Objectivist writers mostly don't.

As for things that seem behind LW, the section on Chance is more of a rejection of the field than a study of it, and the section on certainty reminded me of this Slate Star Codex article, which I think lays out why Aristotelianism is a really unsatisfactory base to build off of, instead of a Bayesian base.

I'm curious what you think about that, or where I should be looking further.

Load More