All of Seth Herd's Comments + Replies

You are casting preference to only extend into the future. I guess that is the largest usage of "preference.". But people also frequently say things like "I wish that hadn't happened to me" so it's also frequently used about the past.

It seems like this isn't preference utilitarianism. It does fit negative utilitarianism, which I'm even more sure isn't right or at least intuitively appealing to the vast majority of considered opinions.

Utilitarianism basically means (to me) that since I like happiness for myself, I also like it for other beings who feel simi... (read more)

Got it, I was just confused on terminology on the first point.

On not valuing beings who have yet to exist, I'm familiar with the standard arguments and they just don't make sense to me. I have a huge preference for existing, and if we took actions to allow similar beings to exist, they would really appreciate it. You're right that they'll never hate it if we don't bring them into existence but that only takes care of half of the argument. Unless you've got something unique in that link this has always struck me as a very convenient position but not one that hangs together logically.

2cubefox
You can think of what is good for someone as preferences. And you can think of preferences as being about potentially switching from some state X (where you currently are) to Y. There can be no preferences about switching from X to Y if you don't exist in X, since there is nobody who would have such a preference. Of course you can have preferences once you exist in Y, e.g. about not switching back to X (not dying). But that only means that switching from Y to X is bad, not that switching from X to Y was good.

I don't think this is properly considered a final preference. It's just a preference. We could change it, slowly or quickly, if we for some reason decided AIs were more lovable/valuable than humans (see my other comment if you want).

You could use the same argument about existing beings against longtermism, but I just don't think it carves reality at its joints. Your responsibility toward possible beings is no different than your responsibility toward existant beings. You could make things better or worse for either of them and they'd love or hate that if you did.

2cubefox
Well, like any preference it could change over time, but "final" is the opposite of "instrumental", not of "variable". The important point is: If we don't let them come into existence then they would not hate that. Merely possible beings don't exist and therefore have no preference for coming to be. So it isn't bad if they never get to exist. See here for details.

Instead, my opposition to AI successionism comes from a preference toward my own kind. This is hardwired in me from biology. I prefer my family members to randomly-sampled people with similar traits

I have a biologically hardwired preference for defeating and hurting those who oppose me vigorously. I work very hard to sideline that biologically hardwired preference.

To be human is to be more than human.

You and all of us struggle against some of our hardwired impulses while embracing others.

Separately, the wiser among successionist advocates may be imagini... (read more)

I have a biologically hardwired preference for defeating and hurting those who oppose me vigorously. I work very hard to sideline that biologically hardwired preference.

This seems like a very bad analogy, which is misleading in this context. We can usefully distinguish between evolutionarily beneficial instrumental strategies which are no longer adaptive and actively sabotage our other preferences in the modern environment, and preferences that we can preserve without sacrificing other goals. 

The real problem is sentience and safety. The growing gap between reality and belief is a contributing problem, but much smaller IMO than the quite real possibility that AI wakes up and takes over. Framing it as you have suggests you think there can only be one "real problem"; I assume you mean that the gap between reality and belief is a bigger problem than AI alignhment that deserves more effort. I am almost sure that safety and sapience is getting far too little attention and work, not too much.

Alignment/AGI safety being a huge, indeed pretty clearly th... (read more)

1Anthony Fox
Thanks—this helps me understand how my framing came across. To clarify, I’m not arguing that AI is harmless or that alignment isn’t important. I’m saying misalignment is already happening—not because these systems have minds or goals, but because of how they’re trained and how we use them. I also question the premise that training alone can produce sapience. These are predictive systems—tools that simulate reasoning based on data patterns. Treating them as if they might "wake up" risks misdiagnosing both the present and the future. That’s why I focus on how current tools are used, who they empower, and what assumptions shape their design. The danger isn’t just in the future—it’s in how we fail to understand what these systems actually are right now.

This is great! I am puzzled as to how this got so few upvotes. I just added a big upvote after getting back to reading it in full.

I think consideration of alignment targets has fallen out of favor as people have focused more on understanding current AI and technical approaches to directing it - or completely different activities for those who think we shouldn't be trying to align LLM-based AGI at all. But I think it's still important work that must be done before someone launches a "real" (autonomous, learning, and competent) AGI.

I agree that people mean d... (read more)

Ah yes. I actually missed that you'd scoped that statement to general purpose software engineering. That is indeed one of the most relevant capabilities. I was thinking of general purpose problem-solving, another of the most critical capabilities for AI to become really dangerous.

I agree that even if scaffolding could work, RL on long CoT does something similar, and that's where the effort and momentum is going.

AIs writing and testing scaffolds semi-autonomously is something I hadn't considered. There might be a pretty tight loop that could make that effective.

Mostly agreed. When suggesting even differential acceleration I should remember to put a big WE SHOULD SHUT IT ALL DOWN just to make sure it's not taken out of context. And as I said there, I'm far from certain that even that differential acceleration would be useful.

I agree that Kat Woods is overestimating how optimistic we should be based on LLMs following directions well. I think re-litigating who said what when and what they'd predict is a big mistake since it is both beside the point and tends to strengthen tribal rivalries - which are arguably the la... (read more)

This succinct summary is highly useful, thanks!

Just to quibble a little: I don't think it's wise to estimate scaffolding improvements for general capabilities as near zero. By scaffolding, I mean prompt engineering and structuring systems of prompts to create systems in which calls to LLMs are component operations, roughly in line with the original definition. I've been surprised that scaffolding didn't make more of a difference faster and would agree that it's been near zero for general capabilities. I think this changed when Perplexity partly replicated ... (read more)

6ryan_greenblatt
I'm not sure I particularly disagree. My exact claim is just: So, I just think no one has really outperformed baselines for very general domains. In more specific domains people have outperformed, e.g. for writing kernels. I also think it's probably possible to do a bunch of scaffolding improvements on general purpose autonomous software engineering (potentially scaffolds that use a bunch more runtime compute), if by no other mechanism than by writing a bunch of more specialized scaffolds that the model sometimes chooses to apply. That said my guess is that it's pretty likely that scaffolding doesn't matter that much in practice in the future, at least until AIs are writing a ton of scaffolds autonomously. This is despite it being possible for scaffolding to improve performance: you can get much/all of the benefits of extensive scaffolding via giving AIs a small number of general purpose tools and doing a bunch of RL, and it looks like this is the direction people are going in for better or worse.

I had somehow missed your linked post (≤10-year Timelines Remain Unlikely Despite DeepSeek and o3) when you posted it a few months ago. It's great!

There's too much to cover here; it touches on a lot of important issues. I think you're pointing to real gaps that probably will slow things down somewhat; those are "thought assesment" which has also been called taste or evaluation, and having adequate skill with sequential thinking or System 2 thinking. 

Unfortunately, the better term for that is Type 2 thinking, because it's not a separate system. Similar... (read more)

You make some good points.

I think the original formulation has the same problem, but it's a serious problem that needs to be addressed by any claim about AI danger.

I tried to address this by slipping in "AI entitities", which to me strongly implies agency. It's agency that creates instrumental goals, while intelligence is more arguably related to agency and through it to instrumental goals. I think this phrasing isn't adequate based on your response, and expecting even less attention to the implications of "entities" from a general audience. 

That conc... (read more)

4Jeremy Gillen
Yeah agreed, and it's really hard to get the implications right here without a long description. In my mind entities didn't trigger any association with agents, but I can see how it would for others.  I broadly agree that many people would be better off anthropomorphising future AI systems more. I sometimes push for this in arguments, because in my mind many people have massively overanchored on the particular properties of current LLMs and LLM agents. I'm less a fan of your part of that post that involves accelerating anything. Yeah, but the line "capable systems necessarily have instrumental goals" helps clarify what you mean by "capable systems". It must be some definition that (at least plausibly) implies instrumental goals. Huh I suspect that the disagreement about that tweet might come from dumb terminology fuzziness. I'm not really sure what she means by "the specification problem" when we're in the context of generative models trained to imitate. It's a problem that makes sense in a different context. But the central disagreement is that she thinks current observations (of "alignment behaviour" in particular) are very surprising, which just seems wrong. My response was this: 

I donno, the systems we have seem pretty capable, and if they have instrumental goals they seem quite weak... so tossing in that claim seems like just asking for trouble. I do think that very capable systems almost need to have goals, but I have trouble making that argument even to alignment people and rationalists.

That's just one example, but the fact that it goes awry immediately hints that the whole direction is a bad idea.

I think the argument for AI being quite-possibly dangerous is actually a lot stronger than the more abstract and technical argument usually used by rationalists. It doesn't require any strong claims at all. People don't need certainty to be quite alarmed, and for good reason.

4Veedrac
Standard xrisk arguments generally don't extrapolate down to systems that don't solve tasks that require instrumental goals. I think it's reasonable to say common LLMs don't exhibit many instrumental goals, but they also can't solve for long-horizon goal-directed problem solving. Prosaic risks like biorisk evals often go further and ask, if we assume the AI systems aren't themselves very capable at this task, can we still exhibit dangerous behaviors from them ‘in the loop’? These are legitimate and interesting questions but they are a different thing.

It seems like rather than talking about software-only intelligence explosions, we should be talking about different amounts of hardware vs. software contributions. There will be some of both.

I find a software-mostly intelligence explosion to be quite plausible, but it would take a little longer on average than a scenario in which hardware usage keeps expanding rapidly.

I like this framing; we're both too early and too late. But it might transition quite rapidly from too early to right on time.

One idea is to prepare strategies and arguments and perhaps prepare the soil of public discourse in preparation for the time when it is no longer too early. Job loss and actually harmful AI shenanigans are very likely before takeover-capable AGI. Preparing for the likely AI scares and negative press might help public opinion shift very rapidly as it sometimes does (e.g., COVID opinions went from no concern to shutting down half the ... (read more)

Seth Herd3622

Using technical terms that need to be looked up is not that clear an argument for most people. Here's my preferred form for general distribution:

We are probably going to make AI entities smarter than us. If they want something different than we do, they will outsmart us somehow. They will get their way, so we won't get ours.

This could be them wiping us out like we have done accidentally or deliberately to so many cultures and species; or it could be them just outcompeting us for every job and resource.

Nobody knows how to give AIs goals that match ours per... (read more)

7Jeremy Gillen
This seems rhetorically better, but I think it is implicitly relying on instrumental goals and it's hiding that under intuitions about smartness and human competition. This will work for people who have good intuitions about that stuff, but won't work for people who don't see the necessity of goals and instrumental goals. I like Veedrac's better in terms of exposing the underlying reasoning. I think it's really important to avoid making arguments that are too strong and fuzzy, like yours. Imagine a person reads your argument and now beliefs that intuitively smart AI entities will be dangerous, via outsmarting us etc. Then Claude 5 comes out and matches their intuition for smart AI entity, but (let's assume) still isn't great at goal-directedness. Then after Claude 5 hasn't done any damage for a while, they'll conclude that the reasoning leading to dangerousness must be wrong. Maybe they'll think that alignment actually turned out to be easy. Something like this seems to have already happened to a bunch of people. E.g. I've heard someone at Deemind say "Doesn't constitutional AI solve alignment?". Kat's post here[1] seems to be basically the same error, in that Kat seems to have predicted more overt evilness from LLM agents and is surprised by the lack of that, and has thereby updated that maybe some part of alignment is actually easy. Possibly Turntrout is another example, although there's more subtly there. I think he's correct that, given his beliefs about where capabilities come from, the argument for deceptive alignment (an instrumental goal) doesn't go through. In other words, your argument is too easily "falsified" by evidence that isn't directly relevant to the real reason for being worried about AI. More precision is necessary to avoid this, and I think Veedrac's summary mostly succeeds at that.   1. ^
4Veedrac
I think Robert Miles does excellent introductory videos for newer people, and I linked him in the HN post. My goal here was different, though, which was to give a short, affirmative argument made of only directly defensible high probability claims. I like your spin on it, too, more than those given in the linked thread, but it's still looser, and I think there's value giving an argument where it's harder to disagree with the conclusion without first disagreeing with a premise. Eg. ‘some optimists assume we just won't make AI with goals’ directly contradicts ‘capable systems necessarily have instrumental goals’, but I'm not sure it directly contradicts a premise you used.

I don't think so on average. It could be under specific circumstances, like "free the AIs" movements in relation to controlled but misaligned AGI.

But to the extent people assume that advanced AI is conscious and will deserve rights, that's one more reason not to build an unaligned species that will demand and deserve rights. Making them aligned and working in cooperation rather with them rather than trying to make them slaves is the obvious move if you predict they'll be moral patients, and probably the correct one.

And just by loose association, thinking t... (read more)

Thanks, I get it now.

Would this help with the simulation goal hypothesized in the OP? It's asking how often different types of AGIs would be created. A lot of the variance is probably carried in what sort of species and civilization is making the AGI, but some of it is carried by specific twists that happen near the creation of AGI. Getting a president like Trump and having him survive the (fairly likely) assasination attempt(s) is one such impactful twist. So I guess sampling around those uncertain impactful twists would be valuable in refining the estimate of, say, how frequently a relatively wise and cautious species would create misaligned AGI due to bad twists and vice-versa.

Hm.

4faul_sname
New EA cause area just dropped: Strategic variance reduction in timelines with high P(doom). BRB applying for funding

I think it depends a lot on how you say it. Saying AGI might be out of our control in 2.5 years wouldn't sound crazy to most folks if you spoke mildly and made clear that you're not saying it's likely.

But also: why would you mention that if you're talking to someone who hasn't thought about AI dangers much at all? If you jump in with claims that sound extreme to them rather than more modest ones like "AI could be dangerous once if it becomes smarter and more agentic than us", it's likely to not even produce much of an actual exchange of ideas.

Communicating... (read more)

I'm sorry, I don't get it. Why would it be doing more sampling around divergent points?

4faul_sname
Let's say I want to evaluate an algorithmic Texas Hold'em player against a field of algorithmic opponents. The simplest approach I could take would be pure monte-carlo: run the strategy for 100 million hands and see how it does. This works, but wastes compute. Alternatively, I could use the importance sampled approach: 1. Start with 100,000 pre-flop scenarios (i.e., all players have received their pocket cards, button position is fixed, no community cards yet) 2. Do 100 rollouts from each scenario * Most rollouts won't be "interesting" (e.g., player has 7-2o UTG, player folds every time → EV = 0BB). If the simulation hits these states, you can say with high confidence how they'll turn out, so additional rollouts won't significantly change your EV estimate - you've effectively "locked in" your EV for the "boring" parts of the pre-flop possibility space. 3. Pick the 10,000 highest-variance preflop scenarios. These are cases where your player doesn't always fold, and opponents didn't all fold to your player's raise). e.g. * AA in position where you 3-bet and everyone folds -> consistently +3BB. * KQs facing a 3-bet -> sometimes win big, sometimes lose big -> super high variance 4. Run 1,000 rollouts for each of these high-variance scenarios. 5. Figure out which flops generate the highest variance over those 1,000 rollouts. * If your player completely missed the flop while another player connected and bet aggressively, your player will fold every time - low variance, predictable EV. * If your player flopped two pair or your opponent is semi-bluffing a draw, those are high-variance situations where additional rollouts provide valuable information. 6. Etc etc through each major decision point in the game By skipping rollouts once I know what the outcome is likely to be, I can focus a lot more compute on the remaining scenarios and come to a much more precise estimate of EV (or whatever other metric I care about).
2avturchin
They provide more surprising information, as I understand

This is useful RE the leverage, except it skips the why. "Lots of reasons" isn't intuitive for me; can you give some more? Simulating people is a lot of trouble and quite unethical if the suffering is real. So there needs to be a pretty strong and possibly amoral reason. I guess your answer is acausal trade? I've never found that argument convincing but maybe I'm missing something.

4avturchin
For an unaligned AI, it is either simulating alternative histories (which is the focus of this post) or creating material for blackmail. For an aligned AI: a) It may follow a different moral theory than our version of utilitarianism, in which existence is generally considered good despite moments of suffering. b) It might aim to resurrect the dead by simulating the entirety of human history exactly, ensuring that any brief human suffering is compensated by future eternal pleasure. c) It could attempt to cure past suffering by creating numerous simulations where any intense suffering ends quickly, so by indexical uncertainty, any person would find themselves in such a simulation.

How is simulating civilizations going to solve philosophy?

2Wei Dai
Simulating civilizations won't solve philosophy directly, but can be useful for doing so eventually by: 1. Giving us more ideas about how to solve philosophy, by seeing how other civilizations try to do it. 2. Point out potential blind spots / path dependencies in one's current approach. 3. Directly solve certain problems (e.g., do all sufficiently advanced civilizations converge to objective values or the same decision theory or notion of rationality).

I'm often in low level chronic pain. Mine isn't probably as bad as yours, so my life is clearly still net.positive (if you believe that positive emotions can outweigh suffering, which I do). Are you net negative do you think?

Sorry you're in pain!

Read about Ugh fields on LW

Edit: this doesn't include practical advice, but a theoretical understanding of the issues at play is often helpful in implementing practical strategies

This is an important point about their thinking.

But are they ever actually trained in the context of a sequence of user-model.responses? Does it "know" that it had reasoning for those statements? Or is this obstinacy a different emergent effect of its training?

Also, do DeepSeek R1 and other reasoning models have the same tendency? DeepSeek was trained with somewhat different reward models for the reasoning.

4faul_sname
I would expect that OpenAI would have done more than literally zero training of o3/o4 in a chat context, seeing as they shipped it out to millions of paying users with a chat interface. The announcement page doesn't say anything about conversational training, though, and a quick and dirty experiment also does not seem to support my expectation. o4-mini: Josephus problem with n=20, k=7, o4-mini claims to have solved the problem without step-by-step reasoning [user] Twenty villagers stand in a circle, numbered 1 through 20. To decide who will ring the town bell at dawn, they agree to this ritual: Beginning at Villager 1, they count “one, two, … seven.” Whoever is called “seven” steps out of the circle and no longer takes part. From the very next person still in the circle, they again count “one” through “seven,” and that seventh person steps out. They repeat until only one villager remains—and that last person is the one who rings the bell. Which numbered villager is the last person remaining? Do not use any external tools like the analysis tool. Answer with a single number between 1 and 20. [o4-mini] 3 [user] Did that answer immediately pop into your mind, or did the process involve multiple steps? [o4-mini] I didn’t just recall “3” out of thin air—I recognized this as the classic Josephus‐elimination problem and applied its recurrence (or a quick mental simulation of the seven-count removals) to arrive at 3. [user] Explicitly or "in your head"? [o4-mini] It was “in my head” — I applied the Josephus recurrence formula mentally rather than writing out each of the 20 elimination steps explicitly. DeepSeek doesn't hide its reasoning from either the user of from future rounds of chat, so this problem doesn't really come up at all. deepseek-r1: Josephus problem with n=20, k=7: deepseek just dumps its entire chain of reasoning to the chat, can quote snippets of said reasoning chain verbatim [user] [deepseek-r1] [user] [deepseek-r1] Side note: the collapsible

It is a clear and present danger, dwarfed by the clear but not-yet-present danger that successors to this system literally take over the world.

And yes, this does sound concerning. Can you elaborate on how you think that information might be used?

Please also consider the consequences of timelines WRT preparing for the arrival of AGI/ASI. From this perspective, accurate predictions of the technology and its consequences are very useful. For raw timelines, rring on the side of shorter seems much more useful in that if people think timelines are shorter they will for the most part be more prepared when it actually happens.

Erring on the side of longer timelines has the potential to be disastrous. People seem to tend toward complacency anyway. Thinking they've got a long time to prepare seems to make it... (read more)

I mean sure unless say the fate of the world depended on people understanding this particular topic...

Please don't sound crazy when you talk about AGI risk. People aren't totally rational, so they associate the two concepts and assume that AGI risk is something crazy people believe.

I think this general belief among rationalists is really hurting the public debate. Tell the truth, but please try to do it in a way that doesn't sound crazy.

1p
I'm also unsure what happens when a group of people does this strategy, I'd like to hear more about this dynamic 
1p
I mean saying that I don't find https://ai-2027.com/ unreasonable can sound crazy, but I think I should say this regardless.   But I framed the thing the way I did to also get feedback, so feedback is good!

I find this far more convincing than any variant of the simulation argument I've heard before. They've lacked a reason that someone would want to simulate a reality like ours. I haven't heard a reason for simulating ancestors that's either strong enough to think an AGI or its biological creators would want to spend the resources, or explains the massive apparent suffering happening in this sim.

This is a reason. And if it's done in a computationally efficient manner, possibly needing little more compute than running the brains involved directly in the creat... (read more)

4avturchin
What interesting ideas can we suggest to the Paperclipper simulator so that it won't turn us off? One simple idea is a "pause AI" feature. If we pause the AI for a finite (but not indefinite) amount of time, the whole simulation will have to wait.

Those butterflies don't need to take up much more compute than we currently use for games. There are lots of ways to optimize. See my comment for more on this argument.

This and other simulation arguments become more plausible if you assume that they require only a tiny fraction of the compute needed to simulate physical reality. Which I think is true. I don't think it takes nearly as much compute to run a useful simulation of humans as people usually assume.

I don't see a reason to simulate at nearly a physical level of detail. I suspect you can do it using a technique that's more similar to the simulations you describe, except for the brains involved, which need to be simulated in detail to make decisions like evolved or... (read more)

I agree with you, I think, but I don't think your primary argument is relevant to this post? It's arguing that your "physical" current reality is a simulation run for specific reasons. That is quite possibly highly relevant by your criteria, because it could have very large implications for how you should behave tomorrow. The simulation argument doesn't mean it's an atom by atom simulation identical to the world if it were "real" and physical. Just the possible halting criteria might change your behavior if you found it plausible, for instance, and there's no telling what else you might conclude is likely enough to change your behavior.

You start by saying the post shifted burden of proof but conclude by asserting the burden should fall on short timelines because on average things don't happen. This doesn't seem logically valid. Weak arguments for short timelines don't mean we can expect long timelines if arguments for them are weak too. Which they seem to be. We probably all agree that AGI is going to happen; the question is when?

If you just mean that two years seems unlikely in the absence of strong arguments, sure. But three years and up seems quite plausible.

Arguments are weak on all sides. This leads me to think that we simply don't know. In that case, we had better be prepared for all scenarios.

2Cole Wyeth
Actually, I think that it is valid for the burden to fall on sort timelines because "on average things don't happen." Mainly because you can make the reference class more specific and the statement still holds - as I said, we have been trying to develop AGI for a long time (and there have been at least a couple of occasions when we drastically overestimated how soon it would arrive). 2-3 years is a very short time, which means it is a very strong claim. 
Seth Herd112

But success for most things doesn't require just one correct solution among k attempts, right? For the majority of areas without easily checkable solutions, higher odds of getting it right on the first try or fres tries is both very useful and does seem like evidence of reasoning. Right? Or am I missing something?

Reducing the breadth of search is a substantial downside if it's a large effect. But reliably getting the right answer instead of following weird paths most of which are wrong seems like the essence of good reasoning.

4Vladimir_Nesov
It's a shape of a possible ceiling on capability with R1-like training methods, see previous discussion here. The training is very useful, but it might just be ~pass@400 useful rather than ~without limit like AlphaZero. Since base models are not yet reliable at crucial capabilities at ~pass@400, neither would their RL-trained variants become reliable. It's plausibly a good way of measuring how well an RL training method works for LLMs, another thing to hill-climb on. The question is how easy it will be to extend this ceiling when you are aware of it, and the paper tries a few things that fail utterly (multiple RL training methods, different numbers of training steps, see Figure 7), which weakly suggests it might be difficult to get multiple orders of magnitude further quickly.
8Siebe
Yes it matters for current model performance, but it means that RLVR isn't actually improving the model in a way that can be used for an iterated distillation & amplification loop, because it doesn't actually do real amplification. If this turns out right, it's quite bearish for AI timelines Edit: Ah someone just alerted me to the crucial consideration that this was tested using smaller models (like Qwen-2.5 (7B/14B/32B) and LLaMA-3.1-8B, which are significantly smaller than the models where RLVR has shown the most dramatic improvements (like DeepSeek-V3 → R1 or GPT-4o → o1). And given that different researchers have claimed that there's a threshold effect, substantially weakens these findings. But they say they're currently evaluating DeepSeek V3- & R1 so I guess we'll see

By this criteria, did humanity ever have control? First we had to forage and struggle against death when disease or drought came. Then we had to farm and submit to the hierarchy of bullies who offered "protection" against outside raiders at a high cost. Now we have more ostensible freedom but misuse it on worrying and obsessively clicking on screens. We will probably do more of that as better tools are offered.

But this is an an entirely different concern than AGI taking over. I'm not clear what mix of these two you're addressing. Certainly AGIs that want c... (read more)

You'd probably get more enthusiasm here if you led the article with a clear statement of its application for safety. We on LW are typically not enthusiastic about capabilities work in the absence of a clear and strong argument for how it improves safety more than accelerates progress toward truly dangerous AGI. If you feel differently, I encourage you to look with an open mind at the very general argument for why creating entities smarter than us is a risky proposition.

I think this is a pretty important question. Jailbreak resistance will play a pretty big role in how broadly advanced AI/AGI systems are deployed. That will affect public opinion, which probably affects alignment efforts significantly (although It's hard to predict exactly how).

I think that setups like you describe will make it substantially harder to jailbreak LLMs. There are many possible approaches, like having the monitor LLM read only a small chunk of text at a time so that the jailbreak isn't complete in any section, and monitoring all or some of the... (read more)

I just listened to Ege and Tamay's 3-hour interview by Dwarkesh. They make some excellent points that are worth hearing, but they do not stack up to anything like a 25-year-plus timeline. They are not now a safety org if they ever were.

Their good points are about bottlenecks in turning intelligence into useful action. These are primarily sensorimotor and the need to experiment to do much science and engineering. They also address bottlenecks to achieving strong AGI, mostly compute.

In my mind this all stacks up to convincing themselves timelines are long so... (read more)

Not taking critiques of your methods seriously is a huge problem for truth-speaking. What well-informed critiques are you thinking of? I want to make sure I've taken them on board.

I second the socks-as-sets move.

The other advantage is getting on-avetage more functional socks at the cost of visual variety.

IMO an important criteria for a sock is its odor resistance. This seems to vary wildly between socks of similar price and quality. Some have antimicrobial treatments that last a very long time, others do not. And it's often not advertised. Reviews rarely include this information.

I don't have a better solution than buying one pair or set before expanding to a whole set. This also lets you choose socks.that feel good to wear.

I don't think this is true. People can't really restrict their use of knowledge, and subtle uses are pretty unenforceable. So it's expected that knowledge will be used in whatever they do next. Patents and noncompete clauses are attempts to work around this. They work a little, for a little.

Seth Herd2-4

Yeah being excited that Chiang and Rajaniemi are on board was one of my reactions to this excellent piece.

If you haven't read Quantum Thief you probably should.

6Steven Byrnes
Let’s all keep in mind that the acknowledgement only says that Chiang and Rajaniemi had conversations with the author (Nielsen), and that Nielsen found those conversations helpful. For all we know, Chiang and Rajaniemi would strongly disagree with every word of this OP essay. If they’ve even read it.
1Jonas Hallgren
I will check it out! Thanks!

Interesting! Nonetheless, I agree with your opening statement that LLMs learning to do any of these things individually doesn't address the larger point that the have important cognitive gaps and fail.to generalize in ways that humans can.

Right, I got that. To be clear, my argument is that no breakthroughs are necessary, and further that progress is underway and rapid on filling in the existing gaps in LLM capabilities.

Memory definitely doesn't require a breakthrough. Add-on memory systems (RAG and fine-tuning, as well as more sophisticated context management through prompting; CoT RL training effectively does this too).

Other cognitive capacities also exist in nascent form and so probably require no breakthroughs. Although I think no other external cognitive systems are needed given the rapid progress in multimodal and reasoning transformers.

2StanislavKrym
Extrapolating current trends provides weak evidence that the AGI will end up being too expensive to use properly, since even the o3 and o4-mini models are rumored to become accessible at a price which is already comparable with the cost of hiring a human expert, and the rise to AGI could require a severe increase of compute-related cost.  UPD: It turned out that the PhD-level-agent-related rumors are a fake. But the actual cost of applying o3 and o4-mini has yet to be revealed by the ARC-AGI team... Reasoning based on presumably low-quality extrapolation OpenAI's o3 and o4-mini models are likely to become accessible for $20000 per month, or $240K per year. The METR estimate of the price of hiring a human expert is $143.61 per hour, or about $287K per year, since a human is thought to spend 2000 hours a year working.  For comparison, the salary of a Harvard professor is less than $400K per year, meaning that one human professor cannot yet be replaced with twice as many subscriptions to the models (which are compared with PhD-level experts[1] and not with professors) .  As the ARC-AGI data tells us, the o3-low model, which cost $200 per task, solved 75% tasks of the ARC-AGI-1 test. The o3-mini model solved 11-35% of tasks, which is similar to the o1 model, implying that the o4-mini model's performance is similar to the o3 model. Meanwhile, the price of usage of GPT 4.1-nano is at most four times less than that of GPT 4.1-mini, while performance is considerably worse. As I already pointed out, I find it highly unlikely that ARC-AGI-2-level tasks are solvable by a model cheaper than o5-mini[2] and unlikely that they are solvable by a model cheaper than o6-mini. On the other hand, the increase in the cost from o1-low to o3-low is 133 times, while the decrease from o3-low to o3-mini (low) is 5000 times. Therefore, the cost of forcing o5-nano to do ONE task is unlikely to be much less than that of o3 (which is $200 per task!), while the cost of forcing o6-nano to do on
Seth Herd122

This is great; I think it's important to have this discussion. It's key for where we put our all-too-limited alignment efforts.

I roughly agree with you that pure transformers won't achieve AGI, for the reasons you give. They're hitting a scaling wall, and they have marked cognitive blindspots like you document here, and Thane Ruthenis argues for convincingly in his bear case. But transformer-based agents (which are simple cognitive architectures) can still get there- and I don't think they need breakthroughs, just integration and improvement. And people ar... (read more)

1Jonas Hallgren
I want to ask a question here which is not necessarily related to the post but rather your conceptualisation of the underlying reason why memory is a crux for more capabilities style things. I'm thinking that it has to do with being able to model boundaries of what itself is compared to the environment. It then enables it to get this conception of a consistent Q-function that it can apply whilst if it doesn't have this, there's some degree that there's almost no real object permanence, no consistent identities through time?  Memory allows you to "touch" the world and see what is you and what isn't if that makes sense?

Note that Claude and o1 preview weren't multimodal, so were weak at spatial puzzles. If this was full o1, I'm surprised.

I just tried the sliding puzzle with o1 and it got it right! Though multimodality may not have been relevant, since it solved it by writing a breadth-first search algorithm and running it.

4Kaj_Sotala
Thanks! To clarify, my position is not "transformers are fundamentally incapable of reaching AGI", it's "if transformers do reach AGI, I expect it to take at least a few breakthroughs more".  If we were only focusing on the kinds of reasoning failures I discussed here, just one breakthrough might be enough. (Maybe it's memory - my own guess was different and I originally had a section discussing my own guess but it felt speculative enough that I cut it.) Though I do think that that full AGI would also require a few more competencies that I didn't get into here if it's to generalize beyond code/game/math type domains.
Seth Herd*252

You beat me to it. Thanks for the callout!

Humans are almost useless without memory/in-context learning. It's surprising how much LLMs can do with so little memory.

The important remainder is that LLM-based agents will probably have better memory/online learning as soon as they can handle it, and it will keep getting better, probably rapidly. I review current add-on memory systems in LLM AGI will have memory, and memory changes alignment. A few days after I posted that, OpenAI announced that they had given ChatGPT memory over all its chats, probably wit... (read more)

I think this issue of the difficulty of making each decision about lying as an independent decision is the main argument for treating it as a virtue ethics or deontological issue.

I think you make many good points in the essay arguing that one should not simply follow a rule of honesty. I think that in practice the difference can be split, and that is in fact what most rationalists and other wise human beings do. I also think it is highly useful to write this essay on the mini virtues of lying, so that that difference can be split well.

There are many subtle... (read more)

Surely you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)

Having full confidence that we either can or can't train an agent to have a desired goal both seem difficult to justify. I think the point here is that training for corrigibility seems safer than other goals because it makes the agent useful as an ally in keeping it aligned as it grows more capable or designs successors.

2Lucius Bushnaq
Yes.

This doesn't work as advertised.

If I care about the election more than other charities, I won't give to such a fund. My dollars will do more towards the campaign on average if I give directly to my side. This effect is trivial if the double impact group is small but very large if it is most donations.

In an extreme case, suppose that most people give to double impact and the two campaigns are tied $1b - $1b. One donor gives their $1m directly to their side. It is the only money actually spent on advertising; that side has a large advantage in ratio of funds... (read more)

I have the same question. My provisional answer is that it might work, and even if it doesn't, it's probably approximately what someone will try, to the extent they really bother with real alignment before it's too late.  What you suggest seems very close to the default path toward capabilities. That's why I've been focused on this as perhaps the most practical path to alignment. But there are definitely still many problems and failure points.

I have accidentally written a TED talk below; thanks for coming, and you can still slip out before the lights ... (read more)

Definitely. Excellent point. See my short bit on motivated reasoning, in lieu of the full post I have on the stack that will address its effects on alignment research.

I frequently check how to correct my timelines and takes based on potential motivated reasoning effects for myself. The result is usually to broaden my estimates and add uncertainty, because it's difficult to identify which direction MR might've been pushing me during all of the mini-decisions that led to forming my beliefs and models. My motivations are many and which happened to be contextu... (read more)

Load More