Steven Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , X/Twitter , Bluesky , Mastodon , Threads , GitHub , Wikipedia , Physics-StackExchange , LinkedIn

Sequences

Intuitive Self-Models
Valence
Intro to Brain-Like-AGI Safety

Wiki Contributions

Comments

Sorted by

(COI: I’m a lesswrong power-user)

Instead of the hands-on experimentation I expected, what I see is a culture heavily focused on long-form theoretical posts.

FWIW if you personally want to see more of those you can adjust the frontpage settings to boost the posts with a "practical" tag. Or for a dedicated list: https://www.lesswrong.com/tag/practical?sortedBy=magic. I agree that such posts are currently a pretty small fraction of the total, for better or worse. But maybe the absolute number is a more important metric than the fraction?

I’ve written a few “practical” posts on LW, and I generally get very useful comments on them.

Consensus-Building Tools

I think these mostly have yet to be invented.

Consensus can be SUPER HARD. In my AGI safety work, on ~4 occasions I’ve tried to reconcile my beliefs with someone else, where it wound up being the main thing I was doing for about an entire month, just to get to the point where I could clearly articulate what the other person believed and why I disagreed with it. As for actually reaching consensus with the other person, I gave up before getting that far! (See e.g. here, here, here)

I don't really know what would make that kind of thing easier but I hope someone figures it out!

“It is more valuable to provide accurate forecasts than add new, relevant, carefully-written considerations to an argument”

On what margin? In what context? I hope we can all think of examples where one thing is valuable, and where the other thing is valuable. If Einstein predicted the Eddington experiment results but didn’t explain the model underlying his prediction, I don’t think anyone would have gotten much out of it, and really probably nobody would have bothered doing the Eddington experiment in the first place.

Manifold, polymarket, etc. already exist, and I'm very happy they do!! I think lesswrong is filling a different niche, and that's fine.

High status could be tied to demonstrated good judgment through special user flair for accurate forecasters or annual prediction competitions.

As for reputation, I think the idea is that you should judge a comment or post by its content and not by the karma of the person who wrote it. Comment karma and post karma are on display, but by contrast user karma is hidden behind a hover or click. That seems good to me. I myself write posts and comments of widely varying quality, and other people sure do too.

An important part of learning is feeling free and safe to be an amateur messing around with half-baked ideas in a new area—overly-central “status” systems can sometimes discourage that kind of thing, which is bad. (Cf. academia.)

(I think there’s a mild anticorrelation between my own posts’ karma and how objectively good and important they are, see here, so it’s a good thing that I don’t care too much about karma!) (Of course the anticorrelation doesn’t mean high karma is bad, rather it’s from conditioning on a collider.)

For long-time power users like me, I can benefit from the best possible “reputation system”, which is actually knowing most of the commenters. That’s great because I don’t just know them as "good" or "bad", but rather "coming from such-and-such perspective" or "everything they say sounds nuts, and usually is, but sometimes they have some extraordinary insight, and I should especially be open-minded to anything they say in such-and-such domain".

If there were a lesswrong prediction competition, I expect that I probably wouldn’t participate because it would be too time-consuming. There are some people where I would like them to take my ideas seriously, but such people EITHER (1) already take my ideas seriously (e.g. people really into AGI safety) OR (2) would not care whether or not I have a strong forecasting track-record (e.g. Yann LeCun).

There’s also a question about cross-domain transferability of good takes. If we want discourse about near-term geopolitical forecasting, then of course we should platform people with a strong track record of near-term geopolitical forecasting. And if we want discourse about the next ML innovation, then we should platform people with a strong track record of coming up with ML innovations. I’m most interested in neither of those, but rather AGI / ASI, which doesn’t exist yet. Empirically, in my opinion, “ability to come up with ML innovations” transfers quite poorly to “ability to have reasonable expectations about AGI / ASI”. I’m thinking of Yann LeCun for example. What about near-term geopolitical forecasting? Does that transfer? Time will tell—mostly when it’s already too late. At the very least, there are skilled forecasters who strongly disagree with each other about AGI / ASI, so at least some of them are wrong.

(If someone in 1400 AD were quite good at predicting the next coup or war or famine, I wouldn’t expect them to be particularly good at predicting how the industrial revolution would go down. Right? And I think AGI / ASI is kinda like the latter.)

So anyway, probably best to say that we can’t predict a priori who is going to have good takes on AGI, just based on track-record in some different domain. So that’s yet another reason to not have a super central and visible personal reputation system, IMO.

The main insight of the post (as I understand it) is this:

  • In the context of a discussion of whether we should be worried about AGI x-risk, someone might say “LLMs don't seem like they're trying hard to autonomously accomplish long-horizon goals—hooray, why were people so worried about AGI risk?”
  • In the context of a discussion among tech people and VCs about how we haven't yet made an AGI that can found and run companies as well as Jeff Bezos, someone might say “LLMs don't seem like they're trying hard to autonomously accomplish long-horizon goals—alas, let's try to fix that problem.”

One sounds good and the other sounds bad, but there’s a duality connecting them. They’re the same observation. You can’t get one without the other.

This is an important insight because it helps us recognize the fact that people are trying to solve the second-bullet-point problem (and making nonzero progress), and to the extent that they succeed, they’ll make things worse from the perspective of the people in the first bullet point.

This insight is not remotely novel! (And OP doesn’t claim otherwise.) …But that’s fine, nothing wrong with saying things that many readers will find obvious.

(This “duality” thing is a useful formula! Another related example that I often bring up is the duality between positive-coded “the AI is able to come up with out-of-the-box solutions to problems” versus the negative-coded “the AI sometimes engages in reward hacking”. I think another duality connects positive-coded “it avoids catastrophic forgetting” to negative-coded “it’s hard to train away scheming”, at least in certain scenarios.)

(…and as comedian Mitch Hedberg sagely noted, there’s a duality between positive-coded “cheese shredder” and negative-coded “sponge ruiner”.)

The post also chats about two other (equally “obvious”) topics:

  • Instrumental convergence: “the AI seems like it's trying hard to autonomously accomplish long-horizon goals” involves the AI routing around obstacles, and one might expect that to generalize to “obstacles” like programmers trying to shut it down
  • Goal (mis)generalization: If “the AI seems like it's trying hard to autonomously accomplish long-horizon goal X”, then the AI might actually “want” some different Y which partly overlaps with X, or is downstream from X, etc.

But the question on everyone’s mind is: Are we doomed?

In and of itself, nothing in this post proves that we’re doomed. I don’t think OP ever explicitly claimed it did? In my opinion, there’s nothing in this post that should constitute an update for the many readers who are already familiar with instrumental convergence, and goal misgeneralization, and the fact that people are trying to build autonomous agents. But OP at least gives a vibe of being an argument for doom going beyond those things, which I think was confusing people in the comments.

Why aren’t we necessarily doomed? Now this is my opinion, not OP’s, but here are three pretty-well-known outs (at least in principle):

  • The AI can “want” to autonomously accomplish a long-horizon goal, but also simultaneously “want” to act with integrity, helpfulness, etc. Just like it’s possible for humans to do. And if the latter “want” is strong enough, it can outvote the former “want” in cases where they conflict. See my post Consequentialism & corrigibility.
  • The AI can behaviorist-“want” to autonomously accomplish a long-horizon goal, but where the “want” is internally built in such a way as to not generalize OOD to make treacherous turns seem good to the AI. See e.g. my post Thoughts on “Process-Based Supervision”, which is skeptical about the practicalities, but I think the idea is sound in principle.
  • We can in principle simply avoid building AIs that autonomously accomplish long-horizon goals, notwithstanding the economic and other pressures—for example, by keeping humans in the loop (e.g. oracle AIs). This one came up multiple times in the comments section.

There’s plenty of challenges in these approaches, and interesting discussions to be had, but the post doesn’t engage with any of these topics.

Anyway, I’m voting strongly against including this post in the 2023 review. It’s not crisp about what it’s arguing for and against (and many commenters seem to have gotten the wrong idea about what it’s arguing for), it’s saying obvious things in a meandering way, and it’s not refuting or even mentioning any of the real counterarguments / reasons for hope. It’s not “best of” material.

How is it that some tiny number of man made mirror life forms would be such a threat to the millions of naturally occurring life forms, but those millions of naturally occurring life forms would not be an absolutely overwhelming symmetrical threat to those few man made mirror forms?

Can’t you ask the same question for any invasive species? Yet invasive species exist. “How is it that some people putting a few Nile perch into Lake Victoria in the 1950s would cause ‘the extinction or near-extinction of several hundred native species’, but the native species of Lake Victoria would not be an absolutely overwhelming symmetrical threat to those Nile perch?”

If I'm not mistaking, you've already changed the wording

No, I haven’t changed anything in this post since Dec 11, three days before your first comment.

valid EA response … EA forum … EA principles …

This isn’t EA forum. Also, you shouldn’t equate “EA” with “concerned about AGI extinction”. There are plenty of self-described EAs who think that AGI extinction is astronomically unlikely and a pointless thing to worry about. (And also plenty of self-described EAs who think the opposite.)

prevent spam/limit stupid comments without causing distracting emotions

If Hypothetical Person X tends to write what you call “stupid comments”, and if they want to be participating on Website Y, and if Website Y wants to prevent Hypothetical Person X from doing that, then there’s an irreconcilable conflict here, and it seems almost inevitable that Hypothetical Person X is going to wind up feeling annoyed by this interaction. Like, Website Y can do things on the margin to make the transaction less unpleasant, but it’s surely going to be somewhat unpleasant under the best of circumstances.

(Pick any popular forum on the internet, and I bet that either (1) there’s no moderation process and thus there’s a ton of crap, or (2) there is a moderation process, and many of the people who get warned or blocked by that process are loudly and angrily complaining about how terrible and unjust and cruel and unpleasant the process was.)

Anyway, I don’t know why you’re saying that here-in-particular. I’m not a moderator, I have no special knowledge about running forums, and it’s way off-topic. (But if it helps, here’s a popular-on-this-site post related to this topic.)

[EDIT: reworded this part a bit.] 

what would be a valid EA response to the arguments coming from people fitting these bullets:

  • Some are over-optimistic based on mistaken assumptions about the behavior of humans;
  • Some are over-optimistic based on mistaken assumptions about the behavior of human institutions;

That’s off-topic for this post so I’m probably not going to chat about it, but see this other comment too.

I think of myself as having high ability and willingness to respond to detailed object-level AGI-optimist arguments, for example:

…and more.

I don’t think this OP involves “picturing AI optimists as stubborn simpletons not being able to get persuaded finally that AI is a terrible existential risk”. (I do think AGI optimists are wrong, but that’s different!) At least, I didn’t intend to do that. I can potentially edit the post if you help me understand how you think I’m implying that, and/or you can suggest concrete wording changes etc.; I’m open-minded.

Yeah, the word “consummatory” isn’t great in general (see here), maybe I shouldn’t have used it. But I do think walking is an “innate behavior”, just as sneezing and laughing and flinching and swallowing are. E.g. decorticate rats can walk. As for human babies, they’re decorticate-ish in effect for the first months but still have a “walking / stepping reflex” from day 1 I think.

There can be an innate behavior, but also voluntary cortex control over when and whether it starts—those aren’t contradictory, IMO. This is always true to some extent—e.g. I can voluntarily suppress a sneeze. Intuitively, yeah, I do feel like I have more voluntary control over walking than I do over sneezing or vomiting. (Swallowing is maybe the same category as walking?) I still want to say that all these “innate behaviors” (including walking) are orchestrated by the hypothalamus and brainstem, but that there’s also voluntary control coming via cortex→hypothalamus and/or cortex→brainstem motor-type output channels.

I’m just chatting about my general beliefs.  :)  I don’t know much about walking in particular, and I haven’t read that particular paper (paywall & I don’t have easy access).

Oh I forgot, you’re one of the people who seems to think that the only conceivable reason that anyone would ever talk about AGI x-risk is because they are trying to argue in favor of, or against, whatever AI government regulation was most recently in the news. (Your comment was one of the examples that I mockingly linked in the intro here.)

If I think AGI x-risk is >>10%, and you think AGI x-risk is 1-in-a-gazillion, then it seems self-evident to me that we should be hashing out that giant disagreement first; and discussing what if any government regulations would be appropriate in light of AGI x-risk second. We’re obviously not going to make progress on the latter debate if our views are so wildly far apart on the former debate!! Right?

So that’s why I think you’re making a mistake whenever you redirect arguments about the general nature & magnitude & existence of the AGI x-risk problem into arguments about certain specific government policies that you evidently feel very strongly about.

(If it makes you feel any better, I have always been mildly opposed to the six month pause plan.)

I’ve long had a tentative rule-of-thumb that:

  • medial hypothalamus neuron groups are mostly “tracking a state variable”;
  • lateral hypothalamus neuron groups are mostly “turning on a behavior” (especially a “consummatory behavior”).

(…apart from the mammillary areas way at the posterior end of the hypothalamus. They’re their own thing.)

State variables are things like hunger, temperature, immune system status, fertility, horniness, etc.

I don’t have a great proof of that, just some indirect suggestive evidence. (Orexin, contiguity between lateral hypothalamus and PAG, various specific examples of people studying particular hypothalamus neurons.) Anyway, it’s hard to prove directly because changing a state variable can lead to taking immediate actions. And it’s really just a rule of thumb; I’m sure there’s exceptions, and it’s not really a bright-line distinction anyway.

The literature on the lateral hypothalamus is pretty bad. The main problem IIUC is that LH is “reticular”, i.e. when you look at it under the microscope you just see a giant mess of undifferentiated cells. That appearance is probably deceptive—appropriate stains can reveal nice little nuclei hiding inside the otherwise-undifferentiated mess. But I think only one or a few such hidden nuclei are known (the example I’m familiar with is “parvafox”).

Yup! I think discourse with you would probably be better focused on the 2nd or 3rd or 4th bullet points in the OP—i.e., not “we should expect such-and-such algorithm to do X”, but rather “we should expect people / institutions / competitive dynamics to do X”.

I suppose we can still come up with “demos” related to the latter, but it’s a different sort of “demo” than the algorithmic demos I was talking about in this post. As some examples:

  • Here is a “demo” that a leader of a large active AGI project can declare that he has a solution to the alignment problem, specific to his technical approach, but where the plan doesn’t stand up to a moment’s scrutiny.
  • Here is a “demo” that a different AGI project leader can declare that even trying to solve the alignment problem is already overkill, because misalignment is absurd and AGIs will just be nice, again for reasons that don’t stand up to a moment’s scrutiny.
  • (And here’s a “demo” that at least one powerful tech company executive might be fine with AGI wiping out humanity anyway.)
  • Here is a “demo” that if you give random people access to an AI, one of them might ask it to destroy humanity, just to see what would happen. Granted, I think this person had justified confidence that this particular AI would fail to destroy humanity …
  • … but here is a “demo” that people will in fact do experiments that threaten the whole world, even despite a long track record of rock-solid statistical evidence that the exact thing they’re doing is indeed a threat to the whole world, far out of proportion to its benefit, and that governments won’t stop them, and indeed that governments might even fund them.
  • Here is a “demo” that, given a tradeoff between AI transparency (English-language chain-of-thought) and AI capability (inscrutable chain-of-thought but the results are better), many people will choose the latter, and pat themselves on the back for a job well done.
  • Every week we get more “demos” that, if next-token prediction is insufficient to make a powerful autonomous AI agent that can successfully pursue long-term goals via out-of-the-box strategies, then many people will say “well so much the worse for next-token prediction”, and they’ll try to figure some other approach that is sufficient for that.
  • Here is a “demo” that companies are capable of ignoring or suppressing potential future problems when they would interfere with immediate profits.
  • Here is a “demo” that it’s possible for there to be a global catastrophe causing millions of deaths and trillions of dollars of damage, and then immediately afterwards everyone goes back to not even taking trivial measures to prevent similar or worse catastrophes from recurring.
  • Here is a “demo” that the arrival of highly competent agents with the capacity to invent technology and to self-reproduce is a big friggin’ deal.
  • Here is a “demo” that even small numbers of such highly competent agents can maneuver their way into dictatorial control over a much much larger population of humans.

I could go on and on. I’m not sure your exact views, so it’s quite possible that none of these are crux-y for you, and your crux lies elsewhere.  :)

Thanks!

I feel like the actual crux between you and OP is with the claim in post #2 that the brain operates outside the neuron doctrine to a significant extent.

I don’t think that’s quite right. Neuron doctrine is pretty specific IIUC. I want to say: when the brain does systematic things, it’s because the brain is running a legible algorithm that relates to those things. And then there’s a legible explanation of how biochemistry is running that algorithm. But the latter doesn’t need to be neuron-doctrine. It can involve dendritic spikes and gene expression and astrocytes etc.

All the examples here are real and important, and would impact the algorithms of an “adequate” WBE, but are mostly not “neuron doctrine”, IIUC.

Basically, it’s the thing I wrote a long time ago here: “If some [part of] the brain is doing something useful, then it's humanly feasible to understand what that thing is and why it's useful, and to write our own CPU code that does the same useful thing.” And I think “doing something useful” includes as a special case everything that makes me me.

I don't get what you mean when you say stuff like "would be conscious (to the extent that I am), and it would be my consciousness (to a similar extent that I am)," since afaik you don't actually believe that there is a fact of the matter as to the answers to these questions…

Just, it’s a can of worms that I’m trying not to get into right here. I don’t have a super well-formed opinion, and I have a hunch that the question of whether consciousness is a coherent thing is itself a (meta-level) incoherent question (because of the (A) versus (B) thing here). Yeah, just didn’t want to get into it, and I haven’t thought too hard about it anyway.  :)

Load More