G Gordon Worley III

Director of Research at PAISRI


Zen and Rationality


Formal Alignment

Map and Territory Cross-Posts

Phenomenological AI Alignment


Social Interfaces

This sort of framing has been useful to me. In particular, I often think in terms of what kind of interface am I offering for people to interact with me via the clothes I wear, the way I talk, etc. as in what sort of options are salient to them. Although there can be issues with changing this interface if you are identified with it (the way you look is part of your personal identity rather than a fact of the world that you control), even without that ability to fluidly change your interface just knowing it's there can be helpful for making sense of, say, why people treat you the way they do.

G Gordon Worley III's Shortform

Personality quizzes are fake frameworks that help us understand ourselves.

What-character-from-show-X-are-you quizzes, astrology, and personality categorization instruments (think Big-5, Myers-Briggs, Magic the Gathering colors, etc.) are perennially popular. I think a good question is to ask, why do humans seem to like this stuff so much that even fairly skeptical folks tend to object not to categorization but that the categorization of any particular system is bad?

My stab at an answer: humans are really confused about themselves, and are interested in things that seem to have even a little explanatory power to help them become less confused about who they are. Metaphorically, this is like if we lived in a world without proper mirrors, and people got really excited about anything moderately reflective because it let them see themselves, if only a little.

On this view, these kinds of things, while perhaps not very scientific, are useful to folks because they help them understand themselves. This is not to say we can totally rehabilitate all such systems, since often they perform their categorization by mechanisms with very weak causal links that may not even rise above the level of noise (*cough* astrology *cough*), nor that we should be satisfied with personality assessments that involve lots of conflation and don't resolve much confusion, but on the whole we should be happy that these things exist because they help us see our psyches in the absence of proper mental mirrors.

(FWIW, I do think there is a way to polish you mind into a mirror that can see itself and that I have managed to do this to some extent, but that's a bit besides the point I want to make here.)

Dealing with Curiosity-Stoppers

Oh, I definitely experience this even with things that are not professional. For example, I'm not very musically talented, and haven't done much other than sing badly in the shower and at karaoke since I left high school. I have little things around I could play with to get better, like a harmonica, but I just don't have fun engaging in the play of exploring the instrument, and I think part of this is because I'm not already good enough to feel like I'm having fun. There's this kind of subtle way curiosity and play get stopped for me when it feels too hard or like my play will only pay off far down the road.

Maybe this is the right choice ultimately, but it's hard to know since curiosity and the related notion of play seem to valuable in their own right much of the time.

AllAmericanBreakfast's Shortform

Yep. Having worked both as a mathematician and a programmer, the idea of objectivity and clear feedback loops starts to disappear as the complexity amps up and you move away from the learning environment. It's not unusual to discover incorrect proofs out on the fringes of mathematical research that have not yet become part of the cannon, nor is it uncommon (in fact, it's very common) to find running production systems where the code works by accident due to some strange unexpected confluence of events.

The Fusion Power Generator Scenario

I somewhat hopeful that this is right, but I'm also not so confident that I feel like we can ignore the risks of GPT-N.

For example, this post makes the argument that, because of GPT's design and learning mechanism, we need not worry about it coming up with significantly novel things or outperforming humans because it's optimizing for imitating existing human writing, not saying true things. On the other hand, it's managing to do powerful things it wasn't trained for, like solve math equations we have no reason to believe it saw in the training set or write code hasn't seen before, which makes it possible that even if GPT-N isn't trained to say true things and isn't really capable of more than humans are, doesn't mean it might not function like a Hansonian em and still be dangerous by simply doing what humans can do, only much faster.

The Fusion Power Generator Scenario

I really like your examples in this post, and it made me think of a tangential but ultimately related issue.

I feel like there's long been something like two camps in the AI safety space: the people who think it's very hard to make AI safe and the people who think it's very very hard like threading a needle from 10 miles away using hand-made binoculars and a long stick (yes, there's a third camp that thinks it will be easy, but they aren't really in the AI safety conversation due to selection effects). And I suspect some of this difference is in how much purposed example failure scenarios feel likely and realistic to them. Being myself in the latter camp, I sometimes find I hard to articulate why I think this, and often want better, more evocative examples. Thus I was happy to read your examples because I think they achieve a level of evocativeness that I at least often find hard to create.

Refining the Evolutionary Analogy to AI

For me your modified argument still hits home, because I think of the evolutionary argument as one for plausibility without saying much about likelihood beyond more than epsilon. That the causal mechanism for the discontinuity may be different than originally thought, it doesn't make the discontinuity go away, nor the possibility that a discontinuity might still arise.

Dealing with Curiosity-Stoppers

Perhaps a subclass of "fear of pain", I sometimes find my curiosity is stopped when I think about how much work it would be to get good at something. For example, let's say it's some new programming language or database I might work with, but I start looking and realize I'm going to have to spend what feels like a long time before I'm going to be able to do much with it, and even longer before I'll be as expect at it as the things I'm already expert at.

Much more tempting to learn something more about the things I'm already good at so I can get even better at them than to learn something totally new so I can still not be very good at it, but at least now slightly less bad.

Mati_Roy's Shortform

IANAL, but this sounds right to me. It's fine if, say, the police hide out at a shop that is tempting and easy to rob and encourage the owner not to make their shop less tempting or easy to rob so that it can function as a "honeypot" that lets them nab people in the act of committing crimes. On the other hand, although the courts often decide that it's not entrapment, undercover cops soliciting prostitutes or illegal drugs are much closer to being entrapment, because then the police are actively creating the demand for crime to supply.

Depending on how you feel about it, I'd say this suggests the main flaw in your idea, which is that it will be abused on the margin to catch people who otherwise would not have committed crimes, even if you try to circumscribe it such that the traps you can create are far from causing more marginal crime, because incentives will push for expansion of this power. At least, that would be the case in the US, because it already is.

Load More