TurnTrout

Alex Turner, Oregon State University PhD student working on AI alignment.

Sequences

Reframing Impact

Becoming Stronger

Comments

Conclusion to 'Reframing Impact'

I'm very glad you enjoyed it! 

I've never read the "Towards a new Impact Measure" post, but I assume doing so is redundant now since this sequence is the 'updated' version.

I'd say so, yes. 

Attainable Utility Preservation: Scaling to Superhuman

I realize that impact measures always lead to a tradeoff between safety and performance competitiveness. 

For optimal policies, yes. In practice, not always - in SafeLife, AUP often had ~50% improved performance on the original task, compared to just naive reward maximization with the same algorithm!

it seems to penalize reasonable long-term thinking more than the formulas where .

Yeah. I'm also pretty sympathetic to arguments by Rohin and others that the  variant isn't quite right in general; maybe there's a better way to formalize "do the thing without gaining power to do it" wrt the agent's own goal.

whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model.

I think this is plausible, yep. This is why I think it's somewhat more likely than not there's no clean way to solve this; however, I haven't even thought very hard about how to solve the problem yet.

More generally, if you don't consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?

Depends on how that shows up in the non-embedded formalization, if at all. If it doesn't show up, then the optimal policy won't be able to predict any benefit and won't do it. If it does... I don't know. It might. I'd need to think about it more, because I feel confused about how exactly that would work - what its model of itself is, exactly, and so on. 

Three mental images from thinking about AGI debate & corrigibility

Maybe. What I was arguing was: just because all of the partial derivatives are 0 at a point, doesn't mean it isn't a saddle point. You have to check all of the directional derivatives; in two dimensions, there are uncountably infinitely many.

 Thus, I can prove to you that we are extremely unlikely to ever encounter a valley in real life:

  1. A valley must have a lowest point .
  2. For  to be a local minimum, all of its directional derivatives must be 0:
    1. Direction N (north), AND
    2. Direction NE (north-east), AND
    3. Direction NNE, AND
    4. Direction NNNE, AND
    5. ...

This doesn't work because the directional derivatives aren't probabilistically independent in real life; you have to condition on the underlying geological processes, instead of supposing you're randomly drawing a topographic function from  to 

For the corrigibility argument to go through, I claim we need to consider more information about corrigibility in particular.

Three mental images from thinking about AGI debate & corrigibility

S1 measures the corrigibility of S2 and does gradient ascent on corrigibility, then the system as a whole has a broad basin of attraction for corrigibility, for sure. But we can't measure corrigibility as far as I know, so the corrigibility-basin-of-attraction is not a maximum or minimum of anything relevant here. So this isn't about calculus, as far as I understand.

I'm not saying anything about an explicit representation of corrigibility. I'm saying the space of likely updates for an intent-corrigible system might form a "basin" with respect to our intuitive notion of corrigibility. 

I'm also not convinced that the space of changes is low-dimensional. Imagine every possible insight an AGI could have in its operating lifetime. Each of these is a different algorithm change, right?

I said relatively low-dimensional! I agree this is high-dimensional; it is still low-dimensional relative to all the false insights and thoughts the AI could have. This doesn't necessarily mitigate your argument, but it seemed like an important refinement - we aren't considering corrigibility along all dimensions - just those along which updates are likely to take place.

"value drift" feels unusually natural from my perspective

I agree value drift might happen, but I'm somewhat comforted if the intent-corrigible AI is superintelligent and trying to prevent value drift as best it can, as an instrumental subgoal. 

Three mental images from thinking about AGI debate & corrigibility

With each AND, the claim gets stronger and more unlikely, such that by the millionth proposition, it starts to feel awfully unlikely that corrigibility is really a broad basin of attraction after all! (Unless this intuitive argument is misleading, of course.)

I think there argument might be misleading in that local stability isn't that rare in practice, because we aren't drawing local stability independently across all possible directional derivatives around the proposed local minimum.

From my post ODE to Joy: Insights from 'A First Course in Ordinary Differential Equations'.

Gradient updates or self-modification will probably fall into a few (relatively) low-dimensional subspaces (because most possible updates are bad, which is part of why learning is hard). A basin of corrigibility is then just that, for already-intent-corrigible agents, the space of likely gradient updates is going to have local stability wrt corrigibility. 

Separately, I think the informal reasoning goes: you probably wouldn't take a pill that makes you slightly more willing to murder people. You will be particularly wary if you will be presented with even more pill ingestion opportunities (a.k.a. algorithm modifications); you will be even more willing to take more pills, as you will be more okay with the prospect of wanting to murder people. So, even offered large immediate benefit, you should not take the pill. 

I think this argument is sound, for a wide range of goal-directed agents which can properly reason about their embedded agency. So, for your intuitive argument to survive this reductio ad absurdum, what is the disanalogy with corrigibility in this situation?

Perhaps the AI might not reason properly about embedded agency and accidentally jump out of the basin. Or, perhaps the basin is small and the AI won't land in it - corrigibility won't be so important that it doesn't get traded away for other benefits.

Dealing with Curiosity-Stoppers

I really like this post. Before, I just knew that sometimes I "didn't feel like studying", and that was that. Silly, but that's the nature of a thoughtless mistake. Now, I have a specific concept and taxonomy for these failure modes, and you suggested good ways of combating them. Thanks for writing this!

Power as Easily Exploitable Opportunities

I mean, we already know about epilepsy. I would be surprised if there were did not exist some way to disable a given person's brain, just by having them look at you. 

TurnTrout's shortform feed

If you measure death-badness from behind the veil of ignorance, you’d naively prioritize well-liked, famous people with large families.

What are you looking for in a Less Wrong post?

Usually I strong-upvote when I feel like a post made something click for me, or that it's very important and deserves more eyeballs. I weak-upvote well-written posts which taught me something new in a non-boring way. 

As an author, my model of this is also impoverished. I'm frequently surprised by posts getting more or less attention than I expected.

What specific dangers arise when asking GPT-N to write an Alignment Forum post?

we already see that; we're constantly amazed by it, despite little meaning of created texts

But GPT-3 is only trained to minimize prediction loss, not to maximize response. GPT-N may be able to crowd-please if it's trained on approval, but I don't think that's what's currently happening.

Load More