I'm Steve Byrnes, a professional physicist in the Boston area. I have a summary of my AGI safety research interests at:


Draft report on AI timelines

General feedback: my belief is that brain algorithms and today's deep learning models are different types of algorithms, and therefore regardless of whether TAI winds up looking like the former or the latter (or something else entirely), this type of exercise (i.e. where you match the two up along some axis) is not likely to be all that meaningful.

Having said that, I don't think the information value is literally zero, I see why someone pretty much has to do this kind of analysis, and so, might as well do the best job possible. This is a very impressive effort and I applaud it, even though I'm not personally updating on it to any appreciable extent.

Draft report on AI timelines

Let me try again. Maybe this will be clearer.

The paradigm of the brain is online learning. There are a "small" number of adjustable parameters on how the process is set up, and then each run is long—a billion subjective seconds. And during the run there are a "large" number of adjustable parameters that get adjusted. Almost all the information content comes within a single run.

The paradigm of today's popular ML approaches is train-then-infer. There are a "large" number of adjustable parameters, which are adjusted over the course of an extremely large number of extremely short runs. Almost all the information content comes from the training process, not within the run. Meanwhile, sometimes people do multiple model-training runs with different hyperparameters—hyperparameters are a "small" number of adjustable parameters that sit outside the gradient-descent training loop.

I think the appropriate analogy is:

  • (A) Brain: One (billion-subjective-second) run ↔ ML: One gradient-descent model training
  • (B) Brain: Adjustable parameters on the genome ↔ ML: Hyperparameters
  • (C) Brain: Settings of synapses (or potential synapses) in a particular adult ↔ ML: parameter settings of a fully-trained model

This seems to work reasonably well all around: (A) takes a long time and involves a lot of information content in the developed "intelligence", (B) is a handful of (perhaps human-interpretable) parameters, (C) is the final "intelligence" that you wind up wanting to deploy.

So again I would analogize one run of the online-learning paradigm with one training of today's popular ML approaches. Then I would try to guess how many runs of online-learning you need, and I would guess 10-100, not based on anything in particular, but you can get a better number by looking into the extent to which people need to play with hyperparameters in their ML training, which is "not much if it's very important not to".

Sure, you can do a boil-the-oceans automated hyperparameter search, but in the biggest projects where you have no compute to spare, they can't do that. Instead, you sit and think about the hyperparameters, you do smaller-scale studies, you try to carefully diagnose the results of each training, etc. etc. Like, GPT-3 only did one training of their largest model, I believe—they worked hard to figure out good hyperparameter settings by extrapolating from smaller studies.

...Whereas it seems that the report is doing a different analogy:

  • (A) Brain: One (billion-subjective-second) run ↔ ML: One run during training (one play of an Atari game etc.)
  • (B) Brain: Adjustable parameters on the genome ↔ ML: Learnable parameters in the model
  • (C) Brain: Many (billion-subjective-second) runs ↔ ML: One model-training session

I think that analogy is much worse than the one I proposed. You're mixing short tests with long-calculations-that-involve-a-ton-of-learning, you're mixing human tweaking of understandable parameters with gradient descent, etc.

To be clear, I don't think my proposed analogy is perfect, because I think that brain algorithms are rather different than today's ML algorithms. But I think it's a lot better than what's there now, and maybe it's the best you can do without getting into highly speculative and controversial inside-view-about-brain-algorithms stuff.

I could be wrong or confused :-)

Draft report on AI timelines

I'm not seeing the merit of the genome anchor.  I see how it would make sense if humans didn't learn anything over the course of their lifetime. Then all the inference-time algorithmic complexity would come from the genome, and you would need your ML process to search over a space of models that can express that complexity. However, needless to say, humans do learn things over the course of their lifetime! I feel even more strongly about that than most, but I imagine we can all agree that the inference-time algorithmic complexity of an adult brain is not limited by what's in the genome, but rather also incorporates information from self-supervised learning etc.

The opposite perspective would say: the analogy isn't between the ML trained model and the genome, but rather between the ML learning algorithm and the genome on one level, and between the ML trained model and the synapses at the other level. So, something like ML parameter count = synapse count, and meanwhile the genome size would correspond to "how complicated is the architecture and learning algorithm?"—like, add up the algorithmic complexity of backprop plus dropout regularization plus BatchNorm plus data augmentation plus xavier initialization etc. etc. Or something like that.

I think the truth is somewhere in between, but a lot closer to the synapse-anchor side (that ignores instincts) than the genome-anchor side (that ignores learning), I think...

Sorry if I'm misunderstanding or missing something, or confused.

UPDATE: Or are we supposed to imagine an RNN wherein the genomic information corresponds to the weights, and the synapse information corresponds to the hidden state activations? If so, I didn't think you could design an RNN (of the type typically used today) where the hidden state activations have many orders of magnitude more information content than the weights. Usually there are more weights than hidden state activations, right?

UPDATE 2: See my reply to this comment.

Why GPT wants to mesa-optimize & how we might change this

I think the Transformer is successful in part because it tends to solve problems by considering multiple possibilities, processing them in parallel, and picking the one that looks best. (Selection-type optimization.) If you train it on text prediction, that's part of how it will do text prediction. If you train it on a different domain, that's part of how it will solve problems in that domain too.

I don't think GPT builds a "mesa-optimization infrastructure" and then applies that infrastructure to language modeling. I don't think it needs to. I think the Transformer architecture is already raring to go forth and mesa-optimize, as soon as you as you give it any optimization pressure to do so.

So anyway your question is: can it display foresight / planning in a different domain via without being trained in that domain? I would say, "yeah probably, because practically every domain is instrumentally useful for text prediction". So somewhere in GPT-3's billions of parameters I think there's code to consider multiple possibilities, process them in parallel, and pick the best answer, in response to the question of What will happen next when you put a sock in a blender? or What is the best way to fix an oil leak?—not just those literal words as a question, but the concepts behind them, however they're invoked.

(Having said that, I don't think GPT-3 specifically will do side-channel attacks, but for other unrelated reasons off-topic. Namely, I don't think it is capable of make the series of new insights required to develop an understanding of itself and its situation and then take appropriate actions. That's based on my speculations here.)

Why GPT wants to mesa-optimize & how we might change this

Suppose I said (and I actually believe something like this is true):

"GPT often considers multiple possibilities in parallel for where the text is heading—including both where it's heading in the short-term (is this sentence going to end with a prepositional phrase or is it going to turn into a question?) and where it's heading in the long-term (will the story have a happy ending or a sad ending?)—and it calculates which of those possibilities are most likely in light of the text so far. It chooses the most likely next word in light of this larger context it figured out about where the text is heading."

If that's correct, would you call GPT a mesa-optimizer?

Why GPT wants to mesa-optimize & how we might change this

In this instance, GPT has an incentive to do internal lookahead. But it's unclear how frequently these situations actually arise

I'm going with "very frequently, perhaps universally". An example I came up with here was choosing "a" vs "an" which depends on the next word.

I think writing many, maybe most, sentences, requires some idea of how the sentence structure is going to be laid out, and that "idea" extends beyond the next token. Ditto at the paragraph level etc.

So I think it already does lookahead in effect, but I don't think it does it by "beam search" per se. I think it's more like "using concepts that extend over many tokens", concepts like "this sentence has the following overall cadence..." and "this sentence conveys the following overall idea..." and "we're in the middle of writing out this particular idiomatic phrase". The training simultaneously incentives both finding the right extended concepts for where you're at in the text, and choosing a good word in light of that context.

My computational framework for the brain

Where is "human values" in this model

Well, all the models in the frontal lobe get, let's call it, reward-prediction points (see my comment here), which feels like positive vibes or something.

If the generative model "I eat a cookie" has lots of reward-prediction points (including the model itself and the downstream models that get activated by it in turn), we describe that as "I want to eat a cookie".

Likewise If the generative model "Michael Jackson" has lots of reward prediction points, we describe that as "I like Michael Jackson. He's a great guy.".

If somebody says that justice is one of their values, I think it's at least partly (and maybe primarily) up a level in meta-cognition. It's not just that there's a generative model "justice" and it has lots of reward-prediction points ("justice is good"), but there's also a generative model of yourself valuing justice, and that has lots of reward-prediction points too. That feels like "When I think of myself as the kind of person who values justice, it's a pleasing thought", and "When I imagine other people saying that I'm a person who values justice, it's a pleasing thought".

This isn't really answering your question of what human values are or should be—this is me saying a little bit about what happens behind the scenes when you ask someone "What are your values?". Maybe they're related, or maybe not. This is a philosophy question. I don't know.

If cortical algorithm will be replaced with GPT-N in some human mind model, will the whole system work?

My belief (see post here) is that GPT-N is running a different kind of algorithm, but learning to imitate some steps of the brain algorithm (including neocortex and subcortex and the models that result from a lifetime of experience, and even hormones, body, etc.—after all, the next-token-prediction task is the whole input-output profile, not just the neocortex.) in a deep but limited way. I can't think of a way to do what you suggest, but who knows.

My computational framework for the brain

Your posts about the neocortex have been a plurality of the posts I've been most excited reading this year.

Thanks so much, that really means a lot!!

...ratio of "listing candidate answers" to "explaining why you think those candidate answers are promising, relative to nearby alternatives."

I agree with "theories/frameworks relatively scarce". I don't feel like I have multiple gears-level models of how the brain might work, and I'm trying to figure out which one is right. I feel like I have zero, and I'm trying to grope my way towards one. It's almost more like deconfusion.

I mean, what are the alternatives?

Alternative 1: The brain is modular and super-complicated

Let's take all those papers that say: "Let's just pick some task and try to explain how adult brains do it based on fMRI and lesion studies", and it ends up being some complicated vague story like "region 37 breaks down the sounds into phonemes and region 93 helps with semantics but oh it's also involved in memory and ...". It's not a gears-level model at all!

So maybe the implicit story is "the brain is doing a complicated calculation, and it is impossible with the tools we have to figure out how it works in a way that really bridges from neurons to algorithms to behavior". I mean, a priori, that could be the answer! In which case, people proposing simple-ish gears-level models would all be wrong, because no such model exists!

Going back to the analogy from my comment yesterday...

In a parallel universe without ML, the aliens drop a mysterious package from the sky with a fully-trained ImageNet classifier. Scientists around the world try to answer the question: How does this thing work?

90% of the scientists would immediately start doing the obvious thing, which is the OpenAI Microscope Project. This part of the code looks for corners, this thing combines those other things to look for red circles on green backgrounds, etc. etc. It's a great field of research for academics—there's an endless amount of work, you keep discovering new things. You never wind up with any overarching theory, just more and more complicated machinery the deeper you dive. Steven Pinker and Gary Marcus would be in this group, writing popular books about the wondrous variety of modules in the aliens' code.

Then the other 10% of scientists come up with a radical, complementary answer: the "way this thing works" is it was built by gradient descent on a labeled dataset. These scientists still have a lot of stuff to figure out, but it's totally different stuff from what the first group is learning about—this group is not learning about corner-detecting modules and red-circle-on-green-background modules, but they are learning about BatchNorm, xavier initialization, adam optimizers, etc. etc. And while the first group toils forever, the second group finds that everything snaps into place, and there's an end in sight.

(I think this analogy is a bit unfair to the "the brain is modular and super-complicated" crowd, because the "wiring diagram" does create some degree of domain-specificity, modularity, etc. But I think there's a kernel of truth...)

Anyway, someone in the second group tells their story, and someone says: "Hey, you should explain why the 'gradient descent on a labeled dataset' description of what's going on is more promising than the 'OpenAI microscope' description of what's going on".

Umm, that's a hard question to answer! In this thought experiment, both groups are sorta right, but in different ways... More specifically, if you want to argue that the second group is right, it does not involve arguing that the first group is wrong!

So that's one thing...

Alternative 2: Predictive Processing / Free Energy Principle

I've had a hard time putting myself in their shoes and see things from their perspective. Part of it is that I don't find it gears-level-y enough—or at least I can't figure out how to see it that way. Speaking of which...

Are you sure PP deemphasizes the "multiple simultaneous generative models" frame?

No I'm not sure. I can say that, in what I've read, if that's part of the story, it wasn't stated clearly enough to get through my thick skull. :-)

I do think that a (singular) prior is supposed to be mathematically a probability distribution, and a probability distribution in  a high-dimensional space can look like, for example, a weighted average of 17 totally different scenarios. So in that sense I suppose you can say that it's at most a difference of emphasis & intuition. 

My quick, ~90 min investigation into whether neuroscience as a field buys the neocortical uniformity hypothesis suggested it's fairly controversial. Do you know why?

Nope! Please let me know if you discover anything yourself!

Do you just mean you suspect there is something in the general vicinity of a belief propagation algorithm going on here, or is your intuition more specific? If the latter, is the Dileep George paper the main thing motivating that intuition?

It's not literally just belief propagation ... Belief propagation (as far as I know) involves a graph of binary probabilistic variables that depend on each other, whereas here we're talking about a graph of "generative models" that depend on each other. A generative model is more complicated than a binary variable—for one thing, it can be a function of time.

Dileep George put the idea of PGMs in my head, or at least solidified my vague intuitions by using the standard terminology. But I mostly like it for the usual reason that if it's true then everything snaps into place and makes sense, and I don't know any alternative with that property. The examples like "purple jar" (or Eliezer's triangular light bulb) seems to me to require some component that comes with a set of probabilistic predictions about the presence/absence/features of other components ... and bam, you pretty much have "belief propagation in a probabilistic graphical model" right there. Or "stationary dancing" is another good example—as you try to imagine it, you can just feel the mutually-incompatible predictions fighting it out :-) Or Scott Alexander's "ethnic tensions" post—it's all about manipulating connections among a graph of concepts, and watching the reward prediction (= good vibes or bad vibes) travel along the edges of the graph. He even describes it as nodes and edges and weights!

If you explain it as genes having the ability to tweak hyperparameters or the gross wiring diagram in order to degrade or improve certain circuits' ability to run algorithms this domain-specific, is it still explanatorily useful to describe the neocortex as uniform?

I dunno, it depends on what question you're trying to answer.

One interesting question would be: If a scientist discovers the exact algorithm for one part of the neocortex subsystem, how far are we from superhuman AGI? I guess my answer would be "years but not decades" (not based on terribly much—things like how people who lose parts of the brain early in childhood can sometimes make substitutions; how we can "cheat" by looking at neurodevelopmental textbooks; etc.). Whereas if I were an enthusiastic proponent of modular-complicated-brain-theory, I would give a very different answer, which assumed that we have to re-do that whole discovery process over and over for each different part of the neocortex.

Another question would be: "How does the neocortex do task X in an adult brain?" Then knowing the base algorithm is just the tiny first step. Most of the work is figuring out the space of generative models, which are learned over the course of the person's life. Subcortex, wiring diagram, hyperparameters, a lifetime's worth of input data and memes—everything is involved. What models do you wind up with? How did they get there? What do they do? How do they interact? It can be almost arbitrarily complicated.

Say there exist genes that confer advantage in math-ey reasoning. By what mechanism is this advantage mediated

Well my working assumption is that it's one or more of the three possibilities of hyperparameters, wiring diagram, and something in the subcortex that motivates some (lucky) people to want to spend time thinking about math. Like I'll be eating dinner talking with my wife about whatever, and my 5yo kid will just jump in and interrupt the conversation to tell me that 9×9=81. Not trying to impress us, that's just what he's thinking about! He loves it! Lucky kid. I have no idea how that motivational drive is implemented. (In fact I haven't thought about how curiosity works in general.) Thanks for the good question, I'll comment again if I think of anything.

Dehaene has a book about math-and-neuroscience I've been meaning to read. He takes a different perspective from me but brings an encyclopedic knowledge of the literature.

Do you have the intuition that aspects of the neocortical algorithm itself (or the subcortical algorithms themselves) might be safety-relevant? 

I interpret your question as saying: let's say people publish on GitHub how to make brain-like AGIs, so we're stuck with that, and we're scrambling to mitigate their safety issues as best as we can. Do we just work on the subcortical steering mechanism, or do we try to change other things too? Well, I don't know. I think the subcortical steering mechanism would be an especially important thing to work on, but everything's on the table. Maybe you should box the thing, maybe you should sanitize the information going into it, maybe you should strategically gate information flow between different areas, etc. etc. I don't know of any big ways to wholesale change the neocortical algorithm and have it continue to work at least as effectively as before, although I'm open to that being a possibility.

how credit assignment is implemented

I've been saying "generative models make predictions about reward just like they make predictions about everything else", and the algorithm figures it out just like everything else. But maybe that's not exactly right. Instead we have the nice "TD learning" story. If I understand it right, it's something like: All generative models (in the frontal lobe) have a certain number of reward-prediction points. You predict reward by adding it up over the active generative models. When the reward is higher than you expected, all the active generative models get some extra reward-prediction points. When it's lower than expected, all the active generative models lose reward-prediction points. I think this is actually implemented in the basal ganglia, which has a ton of connections all around the frontal lobe, and memorizes the reward-associations of arbitrary patterns, or something like that. Also, when there are multiple active models in the same category, the basal ganglia makes the one with higher reward-prediction points more prominent, and/or squashes the one with lower reward-prediction points.

In a sense, I think credit assignment might work a bit better in the neocortex than in a typical ML model, because the neocortex already has hierarchical planning. So, for example, in chess, you could plan a sequence of six moves that leads to an advantage. When it works better than expected, there's a generative model representing the entire sequence, and that model is still active, so that model gets more reward-prediction points, and now you'll repeat that whole sequence in the future. You don't need to do six TD iterations to figure out that that set of six moves was a good idea. Better yet, all the snippets of ideas that contributed to the concept of this sequence of six moves are also active at the time of the surprising success, and they also get credit. So you'll be more likely to do moves in the future that are related in an abstract way to the sequence of moves you just did.

Something like that, but I haven't thought about it much.

My computational framework for the brain

Have you thought much about whether there are parts of this research you shouldn't publish?

Yeah, sure. I have some ideas about the gory details of the neocortical algorithm that I haven't seen in the literature. They might or might not be correct and novel, but at any rate, I'm not planning to post them, and I don't particularly care to pursue them, under the circumstances, for the reasons you mention.

Also, there was one post that I sent for feedback to a couple people in the community before posting, out of an abundance of caution. Neither person saw it as remotely problematic, in that case.

Generally I think I'm contributing "epsilon" to the project of reverse-engineering neocortical algorithms, compared to the community of people who work on that project full-time and have been at it for decades. Whereas I'd like to think that I'm contributing more than epsilon to the project of safe & beneficial AGI. (Unless I'm contributing negatively by spreading wrong ideas!) I dunno, but I think my predispositions are on the side of an overabundance of caution.

I guess I was also taking solace from the fact that nobody here said anything to me, until your comment just now. I suppose that's weak evidence—maybe nobody feels it's their place. or nobody's thinking about it, or whatever.

If you or anyone wants to form an IRB that offers a second opinion on my possibly-capabilities-relevant posts, I'm all for it. :-)

By the way, full disclosure, I notice feeling uncomfortable even talking about whether my posts are info-hazard-y or not, since it feels quite arrogant to even be considering the possibility that my poorly-researched free-time blog posts are so insightful that they materially advance the field. In reality, I'm super uncertain about how much I'm on a new right track, vs right but reinventing wheels, vs wrong, when I'm not directly parroting people (which at least rules out the first possibility). Oh well. :-P

My computational framework for the brain

Good questions!!!

Where are qualia and consciousness in this model?

See my Book Review: Rethinking Consciousness.

Is this model address difference between two hemispheres?

Insofar as there are differences between the two hemispheres—and I don't know much about that—I would treat it like any other difference between different parts of the cortex (Section 2), i.e. stemming from (1) the innate large-scale initial wiring diagram, and/or (2) differences in "hyperparameters".

There's a lot that can be said about how an adult neocortex represents and processes information—the dorsal and ventral streams, how do Wernicke's area and Broca's area interact in speech processing, etc. etc. ad infinitum. You could spend your life reading papers about this kind of stuff!! It's one of the main activities of modern cognitive neuroscience. And you'll notice that I said nothing whatsoever about that. Why not?

I guess there's a spectrum of how to think about this whole field of inquiry:

  • On one end of the spectrum (the Gary Marcus / Steven Pinker end), this line of inquiry is directly attacking how the brain works, so obviously the way to understand the brain is to work out all these different representations and mechanisms and data flows etc.
  • On the opposite end of the spectrum (maybe the "cartoonish connectionist" end?), this whole field is just like the OpenAI Microscope project. There is a simple, generic learning algorithm, and all this rich structure—dorsal and ventral streams, phoneme processing in such-and-such area, etc.—just naturally pops out of the generic learning algorithm. So if your goal is just to make artificial intelligence, this whole field of inquiry is entirely unnecessary—in the same way that you don't need to study the OpenAI Microscope project in order to train and use a ConvNet image classifier. (Of course maybe your goal is something else, like understanding adult human cognition, in which case this field is still worth studying.)

I'm not all the way at the "cartoonish connectionist" end of the spectrum, because I appreciate the importance of the initial large-scale wiring diagram and the hyperparameters. But I think I'm quite a bit farther in that direction than is the median cognitive neuroscientist. (I'm not alone out here ... just in the minority.) So I get more excited than mainstream neuroscientists by low-level learning algorithm details, and less excited than mainstream neuroscientists about things like hemispherical specialization, phoneme processing chains, dorsal and ventral streams, and all that kind of stuff. And yeah, I didn't talk about it at all in this blog post.

What about long term-memory? Is it part of neocortex?

There's a lot about how the neocortex learning algorithm works that I didn't talk about, and indeed a lot that is unknown, and certainly a lot that I don't know! For example, the generative models need to come from somewhere!

My impression is that the hippocampus is optimized to rapidly memorize arbitrary high-level patterns, but it only holds on to those memories for like a couple years, during which time it recalls them when appropriate to help the neocortex deeply embed that new knowledge into its world model, with appropriate connections and relationships to other knowledge. So the final storage space for long-term memory is the neocortex.

I'm not too sure about any of this.

This video about the hippocampus is pretty cool. Note that I count the hippocampus as part of the "neocortex subsystem", following Jeff Hawkins.

How this model explain the phenomenon of night dreams?

I don't know. I assume it somehow helps optimize the set of generative models and their connections.

I guess dreaming could also have a biological purpose but not a computational purpose (e.g., some homeostatic neuron-maintenance process, that makes the neurons fire incidentally). I don't think that's particularly likely, but it's possible. Beats me.

Load More