Rafael Harth


Understanding Machine Learning


Existential Risk is a single category
To the extent that climate change might lead to increasing other x-risks it's because it destroys capital (buildings next to the sea and fertile land) and societies has to deal that that loss of capital. 
The more economic growth we have the easier it is for society to deal with loss of capital. An intervention that pays for a lower carbon foodprint with lower economic growth might very well make things worse instead of better.

That seems to me to be another argument against the standard framing. If you look at "x-risk from climate change," you could accurately conclude that your intervention decreases x-risk from climate change – without realizing that it increases existential risk overall.

If you ask instead, "By how much does my intervention on climate change affect existential risk?" (I agree that using 'mitigate' was bad for the reasons you say), you could conclude that it leads to an increase because it stifles economic growth. Once again, the standard framing doesn't ask the right question.

In general, the new framing does not prevent you from isolating factors, it only prevents you from ignoring part of the effect of a factor.

Existential Risk is a single category

Yes, in the sense that I think what you said describes how the views differ. It's not how I would justify the view; I think the fundamental reason why classical view is inaccurate is that

Existential risk refers to the probability of a set of outcomes, and those outcomes are not defined in terms of their cause.

I.e., there is nothing in the definition of existential risk that Bostrom or anyone else gives that references the cause.

Proofs, Implications, and Models
What is a valid proof in algebra? It's a proof where, in each step, we do something that is universally allowed, something which can only produce true equations from true equations, and so the proof gradually transforms the starting equation into a final equation which must be true if the starting equation was true. Each step should also - this is part of what makes proofs useful in reasoning - be locally verifiable as allowed, by looking at only a small number of previous points, not the entire past history of the proof.

Note that this doesn't really solve the same problem that the quotes in the introduction are trying to solve. Proofs need to be valid and convincing. The 'valid' part is explained in the above. But the definition of the 'convincing' part is hidden in the phrase 'locally verifiable'. It is not clear what locally verifiable means, so this simply leaves open that part of the definition.

Put differently, you could still define

locally verifiablethe mathematicians who read it can correctly assess whether this step is valid

which is the social approach; or

locally verifiable the mathematician who wrote it is capable of transforming the step into a sequence of steps that are valid according to a formal proof system

which is arguably the proper version of the symbolic approach. Both would fit with the explanation above.

(I'm not claiming that leaving this part open makes it a worthless definition, just that it can't be fairly compared to the opening quotes, as those are, in fact, trying to define the convincing part. They're attempts (not necessarily good ones) at solving a strictly harder problem.)

Existential Risk is a single category

With the "existential risk from " framing, I've heard people say things like "climate change is not an existential risk, but it might contribute to other existential risks." Other people have talked about things like "second-order existential risks." This strikes me as fairly confused. In particular, to assess the expected impact of some intervention, you don't care about whether effects are first-order, second-order, or even less direct, but the "classical" view pushes you to regard them as qualitatively different things. Conversely, the framing "how does climate change contribute to existential risk" subsumes -th order effects for all .

Less abstractly, suppose you work on mitigating climate change and want to assess how much this influences existential risk. The question you care about is

  • By how much does my intervention on climate change mitigate existential risk?

This immediately leads to the follow-up question

  • How much does climate change contribute to existential risk?

Which is precisely the framing I suggest. Thus, it perfectly captures the thing we care about. Conversely, the classical framing "existential risk from climate change" would ask something analogous to

  • How likely are we to end up in a world where climate change is the easily recognized primary cause for the end of the world?

And this is simply not the right question.

Inner Alignment: Explain like I'm 12 Edition

This doesn’t seem like the bottleneck in many situations in practice. For example, a lot of young men feel like they want to have as much sex as possible, but not father as many kids as possible. I’m not sure exactly what the reason is, but I don’t think it’s the computational difficulty of representing having kids vs. having sex, because humans already build a world model containing the concept of “my kids”.

In this case, I would speculate that the kids objective wouldn't work that well because the reward is substantially delayed. The sex happens immediately, the kids only after 9 months. Humans tend to discount their future.

Also, how exactly would the kids objective even be implemented?

What else have people said on this subject? 

I believe that Miri was aware of this problem for a long time, but that it didn't have the nice, comparatively non-confused and precise handle of "Inner Alignment" until Evan published the 'risks from learned optimizations' paper. But I'm not the right person to say anything else about this.

Do folks think that scenarios where we solve outer alignment most likely involve us not having to struggle much with inner alignment? Because fully solving outer alignment implies a lot of deep progress in alignment.

Probably not. I think Inner alignment is, if anything, probably the harder problem. It strikes me as reasonably plausible that Debate is a proposal which solves outer alignment, but as very unlikely that it automatically solves Inner Alignment.

Conclusion to 'Reframing Impact'

Fantastic sequence! Certainly, for anyone other than you, the deconfusion/time investment ratio of reading this is excellent. You really succeeded in making the core insights accessible. I'd even say it compares favorably to the recommended sequences in the Alignment Forum in that regard.

I've never read the "Towards a new Impact Measure" post, but I assume doing so is redundant now since this sequence is the 'updated' version.

Attainable Utility Preservation: Scaling to Superhuman

(This sequence inspired me to re-read Reinforcement Learning: An Introduction, hence the break.)

I realize that impact measures always lead to a tradeoff between safety and performance competitiveness. But setting seems to sacrifice quite a lot of performance. Is this real or am I missing something?

Namely, whenever there's an action which doesn't change the state and leads to 1 reward, and a sequence of actions such that has reward with (and all have 0 reward), then it's conceivable that would choose the sequence while would just stubbornly repeat , even if the represent something very tailored to that doesn't involve obtaining a lot of resources. In other words, it seems to penalize reasonable long-term thinking more than the formulas where . This feels like a rather big deal since we arguably want an agent to think long-term as long as it doesn't involve gaining power. I guess the scaling step might help here?

Separately and very speculatively, I'm wondering whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model. The decision to make such a hack should come with a vast increase in AU for its primary goal, but it wouldn't be caught by your penalty since it's about an internal change. If so, that might be a sign that it'll be difficult to fix. More generally, if you don't consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?

The "AI Dungeons" Dragon Model is heavily path dependent (testing GPT-3 on ethics)

Someone else said in a comment on LW that they think "custom" uses GPT-2, whereas using another setting and then editing the opening post will use GPT-3. I wanted to give them credit in response to your comment, but I can't find where they said it. (They still wouldn't get full points since they didn't realize custom would use GPT-3 after the first prompt.) I initially totally rejected the comment since it implies that all of the custom responses use GPT-2, which seemed quite hard to believe given how good some of them are.

Some of the twitter responses sound quite annoyed with this, which is a sentiment I share. I thought that getting the AI to generate good responses was important at every step, but (if this is true and I understand it correctly), it doesn't matter at all after the first reply. That's some non-negligible amount of wasted effort.

Inner Alignment: Explain like I'm 12 Edition

Many thanks for taking the time to find errors.

I've fixed #1-#3. Arguments about the universal prior are definitely not something I want to get into with this post, so for #2 I've just made a vague statement that misalignment can arise for other reasons and linked to Paul's post.

I'm hesitant to change #4 before I fully understand why.

I'm not exactly sure what you're trying to say here. The way I would describe this is that internalization requires an expensive duplication where the objective is represented separately from the world model despite the world model including information about the objective.

So, there are these two channels, input data and SGD. If the model's objective can only be modified by SGD, then (since SGD doesn't want to do super complex modifications), it is easier for SGD to create a pointer rather than duplicate the [model of the base objective] explicitly.

But the bolded part seemed like a necessary condition, and that's what I'm trying to say in the part you quoted. Without this condition, I figured the model could just modify [its objective] and [its model of the Base Objective] in parallel through processing input data. I still don't think I quite understand why this isn't plausible. If the [model of Base objective] and the [Mesa Objective] get modified simultaneously, I don't see any one step where this is harder than creating a pointer. You seem to need an argument for why [the model of the base objective] gets represented in full before the Mesa Objective is modified.

Edit: I slightly rephrased it to say

If we further assume that processing input data doesn't directly modify the model's objective (the Mesa Objective), or that its model of the Base Objective is created first,

Load More