Steven Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn

Sequences

Valence
Intro to Brain-Like-AGI Safety

Wiki Contributions

Comments

A function that tells your AI system whether an action looks good and is right virtually all of the time on natural inputs isn't safe if you use it to drive an enormous search for unnatural (highly optimized) inputs on which it might behave very differently.

Yeah, you can have something which is “a brilliant out-of-the-box solution to a tricky problem” from the AI’s perspective, but is “reward-hacking / Goodharting the value function” from the programmer’s perspective. You say tomato, I say to-mah-to.

It’s tricky because there’s economic pressure to make AIs that will find and execute brilliant out-of-the-box solutions. But we want our AIs to think outside of some of the boxes (e.g. yes you can repurpose a spare server rack frame for makeshift cable guides), but we want it to definitely stay inside other boxes (e.g. no you can’t take over the world). Unfortunately, the whole idea of “think outside the box” is that we’re not aware of all the boxes that we’re thinking inside of.

The particular failure mode of "leaving one thing out" is starting to seem less likely on the current paradigm. Katja Grace notes that image synthesis methods have no trouble generating photorealistic human faces. Diffusion models don't "accidentally forget" that faces have nostrils, even if a human programmer trying to manually write a face image generation routine might. Similarly, large language models obey the quantity-opinion-size-age-shape-color-origin-purpose adjective order convention in English without the system designers needing to explicitly program that in or even be aware of it, despite the intuitive appeal of philosophical arguments one could make to the effect that "English is fragile."

All three of those examples are of the form “hey here’s a lot of samples from a distribution, please output another sample from the same distribution”, which is not the kind of problem where anyone would ever expect adversarial dynamics / weird edge-cases, right?

(…Unless you do conditional sampling of a learned distribution, where you constrain the samples to be in a specific a-priori-extremely-unlikely subspace, in which case sampling becomes isomorphic to optimization in theory. (Because you can sample from the distribution of (reward, trajectory) pairs conditional on high reward.))

Or maybe you were making a different point in this particular paragraph?

I appreciate the brainstorming prompt but I can’t come up with anything useful here. The things you mention are related to cortex lesions, which would presumably leave the brainstem spatial attention system intact. (Brainstem damage is more rare and often lethal.) The stuff you say about neglect is fun to think about but I can’t see situations where there would be specifically-social consequences, in a way that sheds light on what’s happening.

There might be something to the fact that the temporoparietal junction (TPJ) seems to include areas related to spatial attention, and is also somehow involved in theory-of-mind tasks. I’ve been looking into that recently—in fact, that’s part of the story of how I came to write this post. I still don’t fully understand the TPJ though.

Hmm, there do exist lesion studies related to theory-of-mind, e.g. this one—I guess I should read them.

I think I would feel characteristic innate-fear-of-heights sensations (fear + tingly sensation for me, YMMV) if I were standing on an opaque bridge over a chasm, especially if the wood is cracking and about to break. Or if I were near the edge of a roof with no railings, but couldn’t actually see down.

Neither of these claims is straightforward rock-solid proof that the thing you said is wrong, because there’s a possible elaboration of what you said that starts with “looking down” as ground truth and then generalizes that ground truth via pattern-matching / learning algorithm—but I still think that elaborated story doesn’t hang together when you work through it in detail, and that my “innate ‘center of spatial attention’ constantly darting around local 3D space” story is much better.

If I’m looking up at the clouds, or at a distant mountain range, then everything is far away (the ground could be cut off from my field-of-view)—but it doesn’t trigger the sensations of fear-of-heights, right? Also, I think blind people can be scared of heights?

Another possible fear-of-heights story just occurred to me—I added to the post in a footnote, along with why I don’t believe it.

From when I've talked with people from industry, they don't seem at all interested in tracking per-employee performance (e.g. Google isn't running RCTs on their engineers to increase their coding performance, and estimates for how long projects will take are not tracked & scored). 

FWIW Joel Spolsky suggests that people managing software engineers should have detailed schedules, and says big companies have up-to-date schedules, and built a tool to leverage historical data for better schedules. At my old R&D firm, people would frequently make schedules and budgets for projects, and would be held to account if their estimates were bad, and I got a strong impression that seasoned employees tended to get better at making accurate schedules and budgets over time. (A seasoned employee suggested to me a rule-of-thumb for novices, that I should earnestly try to make an accurate schedule, then go through the draft replacing the word “days” with “weeks”, and “weeks” with “months”, etc.) (Of course it’s possible for firms to not be structured such that people get fast and frequent feedback on the accuracy of their schedules and penalties for doing a bad job, in which case they probably won’t get better over time.)

I guess what’s missing is (1) systemizing scheduling so that it’s not a bunch of heuristics in individual people’s heads (might not be possible), (2) intervening on employee workflows etc. (e.g. A/B testing) and seeing how that impacts productivity.

Practice testing

IIUC the final “learning” was assessed via a test. So you could rephrase this as, “if you do the exact thing X, you’re liable to get better at doing X”, where here X=“take a test on topic Y”. (OK, it generalized “from simple recall to short answer inference tests” but that’s really not that different.)

I'm also a little bit surprised that keywords and mnemonics don't work (since they are used very often by competitive mnemonists)

I invent mnemonics all the time, but normal people still need spaced-repetition or similar to memorize the mnemonic. The mnemonics are easier to remember (that’s the point) but “easier” ≠ effortless.

 

As another point, I think a theme that repeatedly comes up is that people are much better at learning things when there’s an emotional edge to them—for example:

  • It’s easier to remember things if you’ve previously brought them up in an argument with someone else.
  • It’s easier to remember things if you’ve previously gotten them wrong in public and felt embarrassed.
  • It’s easier to remember things if you’re really invested in and excited by a big project and figuring this thing out will unblock the project.

This general principle makes obvious sense from an evolutionary perspective (it’s worth remembering a lion attack, but it’s not worth remembering every moment of a long uneventful walk), and I think it’s also pretty well understood neuroscientifically (physiological arousal → more norepinephrine, dopamine, and/or acetylcholine → higher learning rates … something like that).

 

As another point, I’m not sure there’s any difference between “far transfer” and “deep understanding”. Thus, the interventions that you said were helpful for far transfer seem to be identical to the interventions that would lead to deep understanding / familiarity / facility with thinking about some set of ideas. See my comment here.

Yeah some of my to-do items are of the form "skim X". Inside the "card" I might have a few words about how I originally came across X and what I'm hoping to get out of skimming it.

It just refers to the fact that there are columns that you drag items between. I don't even really know how a "proper" kanban works.

If a new task occurs to me in the middle of something else, I'll temporarily put it in a left (high-priority) column, just so I don't forget it, and then later when I'm at my computer and have a moment to look at it, I might decide to drag it to a right (low-priority) column instead of doing it.

Such an unambitious, narrowly-scoped topic area?? There may be infinitely many parallel universes in which we can acausally improve life … you’re giving up  of the value at stake before even starting :)

I always thought of  as the exact / “real” definition of entropy, and  as the specialization of that “exact” formula to the case where each microstate is equally probable (a case which is rarely exactly true but often a good approximation). So I found it a bit funny that you only mention the second formula, not the first. I guess you were keeping it simple? Or do you not share that perspective?

I just looked up “many minds” and it’s a little bit like what I wrote here, but described differently in ways that I think I don’t like. (It’s possible that Wikipedia is not doing it justice, or that I’m misunderstanding it.) I think minds are what brains do, and I think brains are macroscopic systems that follow the laws of quantum mechanics just like everything else in the universe.

What property distinguished a universe where "Harry found himself in a tails branch" and a universe where "Harry found himself in a heads branch"?

Those both happen in the same universe. Those Harry's both exist. Maybe you should put aside many-worlds and just think about Parfit’s teletransportation paradox. I think you’re assuming that “thread of subjective experience” is a coherent concept that satisfies all the intuitive properties that we feel like it should have, and I think that the teletransportation paradox is a good illustration that it’s not coherent at all, or at the very least, we should be extraordinarily cautious when making claims about the properties of this alleged thing you call a “thread of subjective experience” or “thread of consciousness”. (See also other Parfit thought experiments along the same lines.)

I don’t like the idea where we talk about what will happen to Harry, as if that has to have a unique answer. Instead I’d rather talk about Harry-moments, where there’s a Harry at a particular time doing particular things and full of memories of what happened in the past. Then there are future Harry-moments. We can go backwards in time from a Harry-moment to a unique (at any given time) past Harry-moment corresponding to it—after all, we can inspect the memories in future-Harry-moment’s head about what past-Harry was doing at that time (assuming there were no weird brain surgeries etc). But we can’t uniquely go in the forward direction: Who’s to say that multiple future-Harry-moments can’t hold true memories of the very same past-Harry-moment?

Here I am, right now, a Steve-moment. I have a lot of direct and indirect evidence of quantum interactions that have happened in the past or are happening right now, as imprinted on my memories, surroundings, and so on. And if you a priori picked some possible property of those interactions that (according to the Born rule) has 1-in-a-googol probability to occur in general, then I would be delighted to bet my life’s savings that this property is not true of my current observations and memories. Obviously that doesn’t mean that it’s literally impossible.

Load More