Steven Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, LinkedIn, and more at my website.

Sequences

Intuitive Self-Models
Valence
Intro to Brain-Like-AGI Safety

Wikitag Contributions

Comments

Sorted by

I suggest making anonymity compulsary

It’s an interesting idea, but the track records of the grantees are important information, right? And if the track record includes, say, a previous paper that the funder has already read, then you can’t submit the paper with author names redacted.

Also, ask people seeking funding to make specific, unambiguous, easily falsiable predictions of positive outcomes from their work. And track and follow up on this!

Wouldn’t it be better for the funder to just say “if I’m going to fund Group X for Y months / years of work, I should see what X actually accomplished in the last Y months / years, and assume it will be vaguely similar”? And if Group X has no comparable past experience, then fine, but that equally means that you have no basis for believing their predictions right now.

Also, what if someone predicts that they’ll do A, but then realizes it would be better if they did B? Two possibilities are: (1) You the funder trust their judgment. Then you shouldn’t be putting even minor mental barriers in the way of their pivoting. Pivoting is hard and very good and important! (2) You the funder don’t particular trust the recipient’s judgment, you were only funding it because you wanted that specific deliverable. But then the normal procedure is that the funder and recipient work together to determine the deliverables that the funder wants and that the recipient is able to provide. Like, if I’m funding someone to build a database of AI safety papers, then I wouldn’t ask them to “make falsifiable predictions about the outcomes from their work”, instead I would negotiate a contract with them that says they’re gonna build the database. Right? I mean, I guess you could call that a falsifiable prediction, of sorts, but it’s a funny way to talk about it.

You might find this post helpful? Self-dialogue: Do behaviorist rewards make scheming AGIs? In it, I talk a lot about whether the algorithm is explicitly thinking about reward or not. I think it depends on the setup.

(But I don’t think anything I wrote in THIS post hinges on that. It doesn’t really matter whether (1) the AI is sociopathic because being sociopathic just seems to it like part of the right and proper way to be, versus (2) the AI is sociopathic because it is explicitly thinking about the reward signal. Same result.)

subjecting it to any kind of reinforcement learning at all

When you say that, it seems to suggest that what you’re really thinking about is really the LLMs as of today, for which the vast majority of their behavioral tendencies comes from pretraining, and there’s just a bit of RL sprinkled on top to elevate some pretrained behavioral profiles over other pretrained behavioral profiles. Whereas what I am normally thinking about (as are Silver & Sutton, IIUC) is that either 100% or ≈100% of the system’s behavioral tendencies are ultimately coming from some kind of RL system. This is quite an important difference! It has all kinds of implications. More on that in a (hopefully) forthcoming post.

don't necessarily lead to coherent behavior out of distribution

As mentioned in my other comment, I’m assuming (like Sutton & Silver) that we’re doing continuous learning, as is often the case in the RL literature (unlike LLMs). So every time the agent does some out-of-distribution thing, it stops being out of distribution! So yes there are a couple special cases in which OOD stuff is important (namely irreversible actions and deliberately not exploring certain states), but you shouldn’t be visualizing the agent as normally spending its time in OOD situations—that would be self-contradictory.  :)

I think that sounds off to AI researchers. They might (reasonably) think something like "during the critical value formation period the AI won't have the ability to force humans to give positive feedback without receiving negative feedback".

If an AI researcher said “during the critical value formation period, AlphaZero-chess will learn that it’s bad to lose your queen, and therefore it will never be able to recognize the value of a strategic queen sacrifice”, then that researcher would be wrong.

(But also, I would be very surprised if they said that in the first place! I’ve never heard anyone in AI use the term “critical value formation periods”.)

RL algorithms can get stuck in local optima of course, as can any other ML algorithm, but I’m implicitly talking about future powerful RL algorithms, algorithms that can do innovative science, run companies, etc., which means that they’re doing a good job of exploring a wide space of possible strategies and not just getting stuck in the first thing they come across.

I think your example [“the AI can potentially get a higher score by forcing the human to give positive feedback, or otherwise exploiting edge-cases in how this feedback is operationalized and measured”] isn't very realistic - more likely would be that the value function just learned a complex proxy for predicting reward which totally misgeneralizes in alien ways once you go significantly off distribution. 

I disagree. I think you’re overgeneralizing from RL algorithms that don’t work very well (e.g. RLHF), to RL algorithms that do work very well, like human brains or the future AI algorithms that I think Sutton & Silver have in mind.

For example, if I apply your logic there to humans 100,000 years ago, it would fail to predict the fact that humans would wind up engaging in activities like: eating ice cream, playing video games, using social media, watching television, raising puppies, virtual friends, fentanyl, etc. None of those things are “a complex proxy for predicting reward which misgeneralizes”, rather they are a-priori-extraordinarily-unlikely strategies, that do strongly trigger the human innate reward function, systematically and by design.

Conversely, I think you’re overstating the role of goal misgeneralization. Specifically, goal misgeneralization usually corrects itself: If there’s an OOD action or plan that seems good to the agent because of goal misgeneralization, then the agent will do that action or plan, and then the reward function will update the value function, and bam, now it’s no longer OOD, and it’s no longer misgeneralizing in that particular way. Remember, we’re talking about agents with continuous online learning.

Goal misgeneralization is important in a couple special cases—the case where it leads to irreversible actions (like editing the reward function or self-replicating), and the case where it leads to deliberately not exploring certain states. In these special cases, the misgeneralization can’t necessarily correct itself. But usually it does, so I think your description there is wrong.

I think the post focuses way too much on specification gaming

I did mention in §2.5 that it’s theoretically possible for specification gaming and goal misgeneralization to cancel each other out, but claimed that this won’t happen for their proposal. If the authors had said “yeah of course specification gaming itself is unsolvable, but we’re going to do that cancel-each-other-out thing”, then of course I would have elaborated on that point more. I think the authors are making a more basic error so that’s what I’m focusing on.

Yeah I said that to Matt Barnett 4 months ago here. For example, one man's "avoiding conflict by reaching negotiated settlement" may be another man's "acceding to extortion". Evidently I did not convince him. Shrug.

OK, here’s my argument that, if you take {intelligence, understanding, consequentialism} as a unit, it’s sufficient for everything:

  • If durability and strength are helpful, then {intelligence, understanding, consequentialism} can discover that durability and strength are helpful, and then build durability and strength.
    • Even if “the exact ways in which durability and strength will be helpful” does not constitute a learnable pattern, “durability and strength will be helpful” is nevertheless a (higher-level) learnable pattern.
  • If some other evolved aspects of the brain and body are helpful, then {intelligence, understanding, consequentialism} can likewise discover that they are helpful, and build them.
    • After all, if ‘those things are helpful’ wasn’t a learnable pattern, then evolution would not have discovered and exploited that pattern!
    • If the number of such aspects is dozens or hundreds or thousands, then whatever, {intelligence, understanding, consequentialism} can still get to work systematically discovering them all. The recipe for a human is not infinitely complex.
  • If reducing heterogeneity is helpful, then {intelligence, understanding, consequentialism} can discover that fact, and figure out how to reduce heterogeneity.
  • Etc.

I kinda agree, but that’s more a sign that schools are bad at teaching things, than a sign that human brains are bad at flexibly applying knowledge. See my comment here.

See my other comment. I find it distressing that multiple people here are evidently treating acknowledgements as implying that the acknowledged person endorses the end product. I mean, it might or might be true in this particular case, but the acknowledgement is no evidence either way.

(For my part, I’ve taken to using the formula “Thanks to [names] for critical comments on earlier drafts”, in an attempt to preempt this mistake. Not sure if it works.)

Chiang and Rajaniemi are on board

Let’s all keep in mind that the acknowledgement only says that Chiang and Rajaniemi had conversations with the author (Nielsen), and that Nielsen found those conversations helpful. For all we know, Chiang and Rajaniemi would strongly disagree with every word of this OP essay. If they’ve even read it.

Learning from strategies that stood the test of time would be tradition moreso than intelligence. I think tradition requires intelligence, but it also requires something else that's less clear (and possibly not simple enough to be assembled manually, idk).

Right, that’s what I was gonna say. You need intelligence to sort out which traditions should be copied and which ones shouldn’t. There was a 13-billion-year “tradition” of not building e-commerce megastores, but Jeff Bezos ignored that “tradition”, and it worked out very well for him (and I’m happy about it too). Likewise, the Wright Brothers explicitly followed the “tradition” of how birds soar, but not the “tradition” of how birds flap their wings.

I do think there’s a “something else” (most [but not all] humans have an innate drive to follow and enforce social norms, more or less), but I don’t think it’s necessary. The Wright Brothers didn’t have any innate drive to copy anything about bird soaring tradition, but they did it anyway purely by intelligence.

Random street names aren't necessarily important though?

I feel like I’ve lost the plot here. If you think there are things that are very important, but rare in the training data, and that LLMs consequently fail to learn, can you give an example?

Often the rare important things are very well known (after all, they are important, so people put a lot of effort into knowing them), they just can't efficiently be derived from empirical data (except essentially by copying someone else's conclusion blindly, and that leaves you vulnerable to deception).

I guess you’re using “empirical data” in a narrow sense. If Joe tells me X, I have gained “empirical data” that Joe told me X. And then I can apply my intelligence to interpret that “data”. For example, I can consider a number of hypotheses: the hypothesis that Joe is correct and honest, that Joe is mistaken but honest, that Joe is trying to deceive me, that Joe said Y but I misheard him, etc. And then I can gather or recall additional evidence that favors one of those hypotheses over another. I could ask Joe to repeat himself, to address the “I misheard him” hypothesis. I could consider how often I have found Joe to be mistaken about similar things in the past. I could ask myself whether Joe would benefit from deceiving me. Etc.

This is all the same process that I might apply to other kinds of “empirical data” like if my car was making a funny sound. I.e., consider possible generative hypotheses that would match the data, then try to narrow down via additional observations, and/or remain uncertain and prepare for multiple possibilities when I can’t figure it out. This is a middle road between “trusting people blindly” versus “ignoring everything that anyone tells you”, and it’s what reasonable people actually do. Doing that is just intelligence, not any particular innate human tendency—smart autistic people and smart allistic people and smart callous sociopaths etc. are all equally capable of traveling this middle road, i.e. applying intelligence towards the problem of learning things from what other people say.

(For example, if I was having this conversation with almost anyone else, I would have quit, or not participated in the first place. But I happen to have prior knowledge that you-in-particular have unusual and well-thought-through ideas, and even they’re wrong, they’re often wrong in very unusual and interesting ways, and that you don’t tend to troll, etc.)

I feel like I’m misunderstanding you somehow. You keep saying things that (to me) seem like you could equally well argue that humans cannot possibly survive in the modern world, but here we are. Do you have some positive theory of how humans survive and thrive in (and indeed create) historically-unprecedented heterogeneous environments? 

If your model if underparameterized (which I think is true for the typical model?), then it can't learn any patterns that only occurs once in the data. And even if the model is overparameterized, it still can't learn any pattern that never occurs in the data.

Dunno if anything’s changed since 2023, but this says LLMs learn things they’ve seen exactly once in the data.

I can vouch that you can ask LLMs about things that are extraordinarily rare in the training data—I’d assume well under once per billion tokens—and they do pretty well. E.g. they know lots of random street names.

Humans successfully went to the moon, despite it being a quite different environment that they had never been in before. And they didn’t do that with “durability, strength, healing, intuition, tradition”, but rather with intelligence.

Speaking of which, one can apply intelligence towards the problem of being resilient to unknown unknowns, and one would come up with ideas like durability, healing, learning from strategies that have stood the test of time (when available), margins of error, backup systems, etc.

Load More