PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter.

rohinmshah's Comments

How can Interpretability help Alignment?

Planned summary for the Alignment Newsletter:

Interpretability seems to be useful for a wide variety of AI alignment proposals. Presumably, different proposals require different kinds of interpretability. This post analyzes this question to allow researchers to prioritize across different kinds of interpretability research.

At a high level, interpretability can either make our current experiments more informative to help us answer _research questions_ (e.g. “when I set up a <@debate@>(@AI safety via debate@) in this particular way, does honesty win?”), or it could be used as part of an alignment technique to train AI systems. The former only have to be done once (to answer the question), and so we can spend a lot of effort on them, while the latter must be efficient in order to be competitive with other AI algorithms.

They then analyze how interpretability could apply to several alignment techniques, and come to several tentative conclusions. For example, they suggest that for recursive techniques like iterated amplification, we may want comparative interpretability, that can explain the changes between models (e.g. between distillation steps, in iterated amplification). They also suggest that by having interpretability techniques that can be used by other ML models, we can regularize a trained model to be aligned, without requiring a human in the loop.

Planned opinion:

I like this general direction of thought, and hope that people continue to pursue it, especially since I think interpretability will be necessary for inner alignment. I think it would be easier to build on the ideas in this post if they were made more concrete.
Conclusion to 'Reframing Impact'

I don't know what it means. How do you optimize for something without becoming more able to optimize for it? If you had said this to me and I hadn't read your sequence and so knew what you were trying to say, I'd have given you a blank stare -- the closest thing I have to an interpretation is "be myopic / greedy", but that limits your AI system to the point of uselessness.

Like, "optimize for X" means "do stuff over a period of time such that X goes up as much as possible". "Becoming more able to optimize for X" means "do a thing such that in the future you can do stuff such that X goes up more than it otherwise would have". The only difference between these two is actions that you can do for immediate reward.

(This is just saying in English what I was arguing for in the math comment.)

Conclusion to 'Reframing Impact'

Specific proposal.

If the conceptual version is "we keep A's power low", then that probably works.

If the conceptual version is "tell A to optimize R without becoming more able to optimize R", then I have the same objection.

Conclusion to 'Reframing Impact'

Some thoughts on this discussion:

1. Here's the conceptual comment and the math comment where I'm pessimistic about replacing the auxiliary set with the agent's own reward.

However, the agent's reward is usually not the true human utility, or a good approximation of it. If the agent's reward was the true human utility, there would be no need to use an impact measure in the first place.


You seem to have misunderstood. Impact to a person is change in their AU. The agent is not us, and so it's insufficient for the agent to preserve its ability to do what we want – it has to preserve our ability to do we want!

Hmm, I think you're misunderstanding Vika's point here (or at least, I think there is a different point, whether Vika was saying it or not). Here's the argument, spelled out in more detail:

1. Impact to an arbitrary agent is change in their AU.

2. Therefore, to prevent catastrophe via regularizing impact, we need to have an AI system that is penalized for changing a human's AU.

3. By assumption, the AI's utility function is different from the human's (otherwise there wouldn't be any problem).

4. We need to ensure can pursue , but we're regularizing pursuing . Why should we expect the latter to cause the former to happen?

One possible reason is there's an underlying factor which is how much power has, and as long as this is low it implies that any agent (including ) can pursue their own reward about as much as they could in 's absence (this is basically CCC). Then, if we believe that regularizing pursuing keeps 's power low, we would expect it also means that remains able to pursue . I don't really believe the premise there (unless you regularize so strongly that the agent does nothing).

Coherence arguments do not imply goal-directed behavior

I finally read Rational preference: Decision theory as a theory of practical rationality, and it basically has all of the technical content of this post; I'd recommend it as a more in-depth version of this post. (Unfortunately I don't remember who recommended it to me, whoever you are, thanks!) Some notable highlights:

It is, I think, very misleading to think of decision theory as telling you to maximize your expected utility. If you don't obey its axioms, then there is no utility function constructable for you to maximize the expected value of. If you do obey the axioms, then your expected utility is always maximized, so the advice is unnecessary. The advice, 'Maximize Expected Utility' misleadingly suggests that there is some quantity, definable and discoverable independent of the formal construction of your utility function, that you are supposed to be maximizing. That is why I am not going to dwell on the rational norm, Maximize Expected Utility! Instead, I will dwell on the rational norm, Attend to the Axioms!

Very much in the spirit of the parent comment.

Unfortunately, the Fine Individuation solution raises another problem, one that looks deeper than the original problems. The problem is that Fine Individuation threatens to trivialize the axioms.

(Fine Individuation is basically the same thing as moving from preferences-over-snapshots to preferences-over-universe-histories.)

All it means is that a person could not be convicted of intransitive preferences merely by discovering things about her practical preferences. [...] There is no possible behavior that could reveal an impractical preference

His solution is to ask people whether they were finely individuating, and if they weren't, then you can conclude they are inconsistent. This is kinda sorta acknowledging that you can't notice inconsistency from behavior ("practical preferences" aka "choices that could actually be made"), though that's a somewhat inaccurate summary.

There is no way that anyone could reveal intransitive preferences through her behavior. Suppose on one occasion she chooses X when the alternative was Y, on another she chooses Y when the alternative was Z, and on a third she chooses g when the alternative was X. But that is nonsense; there is no saying that the Y she faced in the first occasion was the same as the Y she faced on the second. Those alternatives could not have been just the same, even leaving aside the possibility of individuating them by reference to what else could have been chosen. They will be alternatives at different times, and they will have other potentially significant differentia.

Basically making the same point with the same sort of construction as the OP.

AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah
Using the inaction baseline in the driving example compares to the other driver never leaving their garage (rather than falling asleep at the wheel).

Maybe? How do you decide where to start the inaction baseline? In RL the episode start is an obvious choice, but it's not clear how to apply that for humans.

(I only have this objection when trying to explain what "impact" means to humans; it seems fine in the RL setting. I do think we'll probably stop relying on the episode abstraction eventually, so we would eventually need to not rely on it ourselves, but plausibly that can be dealt with in the future.)

Also, under this inaction baseline, the roads are perpetually empty, and so you're always feeling impact from the fact that you can't zoom down the road at 120 mph, which seems wrong.

I agree that counterfactuals are hard, but I'm not sure that difficulty can be avoided. Your baseline of "what the human expected the agent to do" is also a counterfactual, since you need to model what would have happened if the world unfolded as expected.

Sorry, what I meant to imply was "baselines are counterfactuals, and counterfactuals are hard, so maybe no 'natural' baseline exists". I certainly agree that my baseline is a counterfactual.

On the other hand, since (as you mentioned) this is not intended as a baseline for impact penalization, maybe it doesn't need to be well-defined or efficient in terms of human input, and it is a good source of intuition on what feels impactful to humans.

Yes, that's my main point. I agree that there's no clear way to take my baseline and implement it in code, and that it depends on fuzzy concepts that don't always apply (even when interpreted by humans).

[AN #100]: What might go wrong if you learn a reward function while acting

Thank you for reading closely enough to notice the 5 characters used to mark the occasion :)

Reward functions and updating assumptions can hide a multitude of sins

My main note is that my comment was just about the concept of rigging a learning process given a fixed prior over rewards. I certainly agree that the general strategy of "update a distribution over reward functions" has lots of as-yet-unsolved problems.

2 Rigging learning or reward-maximisation?

I agree that you can cast any behavior as reward maximization with a complicated enough reward function. This does imply that you have to be careful with your prior / update rule when you specify an assistance game / CIRL game.

I'm not arguing "if you write down an assistance game you automatically get safety"; I'm arguing "if you have an optimal policy for some assistance game you shouldn't be worried about it rigging the learning process relative to the assistance game's prior". Of course, if the prior + update rule themselves lead to bad behavior, you're in trouble; but it doesn't seem like I should expect that to be via rigging as opposed to all the other ways reward maximization can go wrong.

3 AI believing false facts is bad

Tbc I agree with this and was never trying to argue against it.

4 Changing preferences or satisfying them
Thus there is no distinction between "compound" and "non-compound" rewards; we can't just exclude the first type. So saying a reward is "fixed" doesn't mean much.

I agree that updating on all reward functions under the assumption that humans are rational is going to be very strange and probably unsafe.

5 Humans learning new preferences

I agree this is a challenge that assistance games don't even come close to addressing.

6 The AI doesn't trick any part of itself

Your explanation in this section involves a compound reward function, instead of rigged learning process. I agree that these are problems; I was really just trying to make a point about rigged learning processes.

Reasons for Excitement about Impact of Impact Measure Research
The key point is that AUPconceptual relaxes the problem:

If we could robustly penalize the agent for intuitively perceived gains in power (whatever that means), would that solve the problem?

This is not trivial.

Probably I'm just missing something, but I don't see why you couldn't say something similar about:

"preserve human autonomy", "be nice", "follow norms", "do what I mean", "be corrigible", "don't do anything I wouldn't do", "be obedient"


If we could robustly reward the agent for intuitively perceived nice actions (whatever that means), would that solve the problem?

It seems like the main difference is that for power in particular is that there's more hope that we could formalize power without reference to humans (which seems harder to do for e.g. "niceness"), but then my original point applies.

Reasons for Excitement about Impact of Impact Measure Research
has to do with easily exploitable opportunities in a given situation

Sorry, I don't understand what you mean here.

However, still note that this solution doesn't have anything to do with human values in particular.

I feel like I can still generate lots of solutions that have that property. For example, "preserve human autonomy", "be nice", "follow norms", "do what I mean", "be corrigible", "don't do anything I wouldn't do", "be obedient".

All of these depend on the AI having some knowledge about humans, but so does penalizing power.

Load More