Beth Barnes


Writeup: Progress on AI Safety via Debate

That's correct about simultaneity.

Yeah, the questions and answers can be arbitrary, doesn't have to be X and ¬X.

I'm not completely sure whether Scott's method would work given how we're defining the meaning of questions, especially in the middle of the debate. The idea is to define the question by how a snapshot of the questioner taken when they wrote the question would answer questions about what they meant. So in this case, if you asked the questioner 'is your question equivalent to 'should I eat potatoes tonight?'', they wouldn't know. On the other hand, you could ask them ' if I think you should eat potatoes tonight, is your question equivalent to 'should I eat potatoes tonight?''. This would work as long as you were referring only to what one debater believed you should eat tonight, I think.

I feel fairly ok about this as a way to define the meaning of questions written by debaters within the debate. I'm less sure about how to define the top-level question. It seems like there's only really one question, which is 'what should I do?', and it's going to have to be defined by how the human asker clarifies their meaning. I'm not sure whether the meaning of the question should be allowed to include things the questioner doesn't know at the time of asking.

Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns

Yeah I also thought this might just be true already, for similar reasons

$1000 bounty for OpenAI to show whether GPT3 was "deliberately" pretending to be stupider than it is

Of course GPT-3 isn't aligned, its objective is to output the most likely next word, ie imitate text on the internet. It seems pretty certain that if you give it a prompt that tells it it should be imitating some part of the internet where someone says something dumb, it will say something dumb, and if you give it a prompt that tells it it's imitating something where someone says something smart, it will "try" to say something smart. This question seems weird to me, Am I missing something?

Using vector fields to visualise preferences and make them consistent

You might find this paper interesting. It does a similar decomposition with the dynamics of differentiable games (where the 'preferences' for how to change your strategy may not be the gradient of any function)

"The key result is to decompose the second-order dynamics into two components. The first is related to potential games, which reduce to gradient descent on an implicit function; the second relates to Hamiltonian games, a new class of games that obey a conservation law, akin to conservation laws in classical mechanical systems."