I'm Steve Byrnes, a professional physicist in the Boston area. I have a summary of my AGI safety research interests at:

steve2152's Comments

Human instincts, symbol grounding, and the blank-slate neocortex

Thanks for the comment! When I think about it now (8 months later), I have three reasons for continuing to think CCA is broadly right:

  1. Cytoarchitectural (quasi-) uniformity. I agree that this doesn't definitively prove anything by itself, but it's highly suggestive. If different parts of the cortex were doing systematically very different computations, well maybe they would start out looking similar when the differentiation first started to arise millions of years ago, but over evolutionary time you would expect them to gradually diverge into superficially-obviously-different endpoints that are more appropriate to their different functions.

  2. Narrowness of the target, sorta. Let's say there's a module that takes specific categories of inputs (feedforward, feedback, reward, prediction-error flags) and has certain types of outputs, and it systematically learns to predict the feedforward input and control the outputs according to generative models following this kind of selection criterion (or something like that). This is a very specific and very useful thing. Whatever the reward signal is, this module will construct a theory about what causes that reward signal and make plans to increase it. And this kind of module automatically tiles—you can connect multiple modules and they'll be able to work together to build more complex composite generative models integrating more inputs to make better reward predictions and better plans. I feel like you can't just shove some other computation into this system and have it work—it's either part of this coordinated prediction-and-action mechanism, or not (in which case the coordination prediction-and-action mechanism will learn to predict it and/or control it, just like it does for the motor plant etc.). Anyway, it's possible that some part of the neocortex is doing a different sort of computation, and not part of the prediction-and-action mechanism. But if so, I would just shrug and say "maybe it's technically part of the neocortex, but when I say "neocortex", I'm using the term loosely and excluding that particular part." After all, I am not an anatomical purist; I am already including part of the thalamus when I say "neocortex" for example (I have a footnote in the article apologizing for that). Sorry if this description is a bit incoherent, I need to think about how to articulate this better.

  3. Although it's probably just the Dunning-Kruger talking, I do think I at least vaguely understand what the algorithm is doing and how it works, and I feel like I can concretely see how it explains everything about human intelligence including causality, counterfactuals, hierarchical planning, task-switching, deliberation, analogies, concepts, etc. etc.

Building brain-inspired AGI is infinitely easier than understanding the brain

The human neocortical algorithm probably wouldn't work very well if it were applied in a brain 100x smaller

I disagree, as I discussed here, I think the neocortex is uniform-ish and that a cortical column in humans is doing a similar calculation as a cortical column in rats or the equivalent bundle of cells (arranged not as a column) in a bird pallium or lizard pallium. I do think you need lots and lots of cortical columns, initialized with appropriate region-to-region connections, to get human intelligence. Well, maybe that's what you meant by "human neocortical algorithm", in which case I agree. You also need appropriate subcortical signals guiding the neocortex, for example to flag human speech sounds as being important to attend to.

human intelligence minus rat intelligence is probably easier to understand and implement than rat intelligence alone..

Well, I do think that there's a lot of non-neocortical innovations between humans and rats, particularly to build our complex suite of social instincts, see here. I don't think understanding those innovations is necessary for AGI, although I do think it would be awfully helpful to understand them if we want aligned AGI. And I think they are going to be hard to understand, compared to the neocortex.

I don't think we can learn arbitrarily domains, not even close

Sure. A good example is temporal sequence learning. If a sequence of things happens, we expect the same sequence to recur in the future. In principle, we can imagine an anti-inductive universe where, if a sequence of things happens, then it's especially unlikely to recur in the future, at all levels of abstraction. Our learning algorithm would crash and burn in such a universe. This is a particular example of the no-free-lunch theorem, and I think it illustrates that, while there are domains that the neocortical learning algorithm can't learn, they may be awfully weird and unlikely to come up.

Building brain-inspired AGI is infinitely easier than understanding the brain

For example, I mentioned 1, 2, 3 earlier in the article... That should get you started but I'm happy to discuss more. :-)

[Link] Lex Fridman Interviews Karl Friston

Friston was also on Sean Carroll's podcast recently - link. I found it slightly helpful. I may listen to this one too; if I do I'll comment on how they compare.

On the construction of the self

I'm really enjoying all these posts, thanks a lot!

something about the argument brought unpleasant emotions into your mind. A subsystem activated with the goal of making those emotions go away, and an effective way of doing so was focusing your attention on everything that could be said to be wrong with your spouse.

Wouldn't it be simpler to say that righteous indignation is a rewarding feeling (in the moment) and we're motivated to think thoughts that bring about that feeling?

the “regretful subsystem” cannot directly influence the “nasty subsystem”

Agreed, and this is one of the reasons that I think normal intuitions about how agents behave don't necessarily carry over to self-modifying agents whose subagents can launch direct attacks against each other, see here.

it looks like craving subsystems run on cached models

Yeah, just like every other subsystem right? Whenever any subsystem (a.k.a. model a.k.a. hypothesis) gets activated, it turns on a set of associated predictions. If it's a model that says "that thing in my field of view is a cat", it activates some predictions about parts of the visual field. If it's a model that says "I am going to brush my hair in a particular way", it activates a bunch of motor control commands and related sensory predictions. If it's a model that says "I am going to get angry at them", it activates, um, hormones or something, to bring about the corresponding arousal and valence. All these examples seem like the same type of thing to me, and all of them seem like "cached models".

From self to craving (three characteristics series)


I don't really see the idea of hypotheses trying to prove themselves true. Take the example of saccades that you mention. I think there's some inherent (or learned) negative reward associated with having multiple active hypotheses (a.k.a. subagents a.k.a. generative models) that clash with each other by producing confident mutually-inconsistent predictions about the same things. So if model A says that the person coming behind you is your friend and model B says it's a stranger, then that summons model C which strongly predicts that we are about to turn around and look at the person. This resolves the inconsistency, and hence model C is rewarded, making it ever more likely to be summoned in similar circumstances in the future.

You sorta need multiple inconsistent models for it to make sense for one to prove one of them true. How else would you figure out which part of the model to probe? If a model were trying to prevent itself from being falsified, that would predict that we look away from things that we're not sure about rather than towards them.

OK, so here's (how I think of) a typical craving situation. There are two active models.

Model A: I will eat a cookie and this will lead to an immediate reward associated with the sweet taste

Model B: I won't eat the cookie, instead I'll meditate on gratitude and this will make me very happy

Now in my perspective, this is great evidence that valence and reward are two different things. If becoming happy is the same as reward, why haven't I meditated in the last 5 years even though I know it makes me happy? And why do I want to eat that cookie even though I totally understand that it won't make me smile even while I'm eating it, or make me less hungry, or anything?

When you say "mangling the input quite severely to make it fit the filter", I guess I'm imagining a scenario like, the cookie belongs to Sally, but I wind up thinking "She probably wants me to eat it", even if that's objectively far-fetched. Is that Model A mangling the evidence to fit the filter? I wouldn't really put it that way...

The thing is, Model A is totally correct; eating the cookie would lead to an immediate reward! It doesn't need to distort anything, as far as it goes.

So now there's a Model A+D that says "I will eat the cookie and this will lead to an immediate reward, and later Sally will find out and be happy that I ate the cookie, which will be rewarding as well". So model A+D predicts a double reward! That's a strong selective pressure helping advance that model at the expense of other models, and thus we expect this model to be adopted, unless it's being weighed down by a sufficiently strong negative prior, e.g. if this model has been repeatedly falsified in the past, or if it contradicts a different model which has been repeatedly successful and rewarded in the past.

(This discussion / brainstorming is really helpful for me, thanks for your patience.)

Wrist Issues

Here's my weird story from 15 years ago where I had a year of increasingly awful debilitating RSI and then eventually I read a book and a couple days later it was gone forever.

My dopey little webpage there has helped at least a few people over the years, including apparently the co-founder of, so I continue to share it, even though I find it mildly embarrassing. :-P

Needless to say, but YMMV; I can speak to my own experience, but I don't pretend to know how to cure anyone else.

Anyway, that sucks and I hope you feel better soon!

Source code size vs learned model size in ML and in humans?

OK, I think that helps.

It sounds like your question should really be more like how many programmer-hours go into putting domain-specific content / capabilities into an AI. (You can disagree.) If it's very high, then it's the Robin-Hanson-world where different companies make AI-for-domain-X, AI-for-domain-Y, etc., and they trade and collaborate. If it's very low, then it's more plausible that someone will have a good idea and Bam, they have an AGI. (Although it might still require huge amounts of compute.)

If so, I don't think the information content of the weights of a trained model is relevant. The weights are learned automatically. Changing the code from num_hidden_layers = 10 to num_hidden_layers = 100 is not 10× the programmer effort. (It may or may not require more compute, and it may or may not require more labeled examples, and it may or may not require more hyperparameter tuning, but those are all different things, and in no case is there any reason to think it's a factor of 10, except maybe some aspects of compute.)

I don't think the size of the PyTorch codebase is relevant either.

I agree that the size of the human genome is relevant, as long as we all keep in mind that it's a massive upper bound, because perhaps a vanishingly small fraction of that is "domain-specific content / capabilities". Even within the brain, you have to synthesize tons of different proteins, control the concentrations of tons of chemicals, etc. etc.

I think the core of your question is generalizability. If you have AlphaStar but want to control a robot instead, how much extra code do you need to write? Do insights in computer vision help with NLP and vice-versa? That kind of stuff. I think generalizability has been pretty high in AI, although maybe that statement is so vague as to be vacuous. I'm thinking, for example, it's not like we have "BatchNorm for machine translation" and "BatchNorm for image segmentation" etc. It's the same BatchNorm.

On the brain side, I'm a big believer in the theory that the neocortex has one algorithm which simultaneously does planning, action, classification, prediction, etc. (The merging of action and understanding in particular is explained in my post here, see also Planning By Probabilistic Inference.) So that helps with generalizability. And I already mentioned my post on cortical uniformity. I think a programmer who knows the core neocortical algorithm and wants to then imitate the whole neocortex would mainly need (1) a database of "innate" region-to-region connections, organized by connection type (feedforward, feedback, hormone receptors) and structure (2D array of connections vs 1D, etc.), (2) a database of region-specific hyperparameters, especially when the region should lock itself down to prevent further learning ("sensitive periods"). Assuming that's the right starting point, I don't have a great sense for how many bits of data this is, but I think the information is out there in the developmental neuroscience literature. My wild guess right now would be on the order of a few KB, but with very low confidence. It's something I want to look into more when I get a chance. Note also that the would-be AGI engineer can potentially just figure out those few KB from the neuroscience literature, rather than discovering it in a more laborious way.

Oh, you also probably need code for certain non-neocortex functions like flagging human speech sounds as important to attend to etc. I suspect that that particular example is about as straightforward as it sounds, but there might be other things that are hard to do, or where it's not clear what needs to be done. Of course, for an aligned AGI, there could potentially be a lot of work required to sculpt the reward function.

Just thinking out loud :)

From self to craving (three characteristics series)


My running theory so far (a bit different from yours) would be:

  • Motivation = prediction of reward
  • Craving = unhealthily strong motivation—so strong that it breaks out of the normal checks and balances that prevent wishful thinking etc.
  • When empathetically simulating someone's mental state, we evoke the corresponding generative model in ourselves (this is "simulation theory"), but it shows up in attenuated form, i.e. with weaker (less confident) predictions (I've already been thinking that, see here).
  • By meditative practice, you can evoke a similar effect in yourself, sorta distancing yourself from your feelings and experiencing them in a quasi-empathetic way.
  • ...Therefore, this is a way to turn cravings (unhealthily strong motivations) into mild, healthy, controllable motivations

Incidentally, why is the third bullet point true, and how is it implemented? I was thinking about that this morning and came up with the following ... If you have generative model A that in turn evokes (implies) generative model B, the model B inherits the confidence of A, attenuated by a measure of how confident you are that A leads to B (basically, P(A) × P(B|A)). So if you indirectly evoke a generative model, it's guaranteed to appear with a lower confidence value than if you directly evoke it.

In empathetic simulation, A would be the model of the person you're simulating, and B would be the model that you think that person is thinking / feeling.

Sorry if that's stupid, just thinking out loud :)

Get It Done Now

it includes detailed advice that approximately no one will follow

Hey, I read the book in 2012, and I still have a GTD-ish alphabetical file, GTD-ish desk "inbox", and GTD-ish to-do list. Of course they've all gotten watered down a bit over the years from the religious fervor of the book, but it's still something.

If you decide to eventually do a task that requires less than two minutes to do, that can efficiently be done right now, do it right now.

Robert Pozen Extreme Productivity has a closely-related principle he calls "OHIO"—Only Handle It Once. If you have all the decision-relevant information that you're likely to get, then just decide right away. He gives an example of getting an email invitation to something, checking his calendar, and immediately booking a flight and hotel. I can't say I follow that one very well, but at least I acknowledge it as a goal to aspire to.

Load More