Got it, I was just confused on terminology on the first point.
On not valuing beings who have yet to exist, I'm familiar with the standard arguments and they just don't make sense to me. I have a huge preference for existing, and if we took actions to allow similar beings to exist, they would really appreciate it. You're right that they'll never hate it if we don't bring them into existence but that only takes care of half of the argument. Unless you've got something unique in that link this has always struck me as a very convenient position but not one that hangs together logically.
I don't think this is properly considered a final preference. It's just a preference. We could change it, slowly or quickly, if we for some reason decided AIs were more lovable/valuable than humans (see my other comment if you want).
You could use the same argument about existing beings against longtermism, but I just don't think it carves reality at its joints. Your responsibility toward possible beings is no different than your responsibility toward existant beings. You could make things better or worse for either of them and they'd love or hate that if you did.
Instead, my opposition to AI successionism comes from a preference toward my own kind. This is hardwired in me from biology. I prefer my family members to randomly-sampled people with similar traits
I have a biologically hardwired preference for defeating and hurting those who oppose me vigorously. I work very hard to sideline that biologically hardwired preference.
To be human is to be more than human.
You and all of us struggle against some of our hardwired impulses while embracing others.
Separately, the wiser among successionist advocates may be imagini...
I have a biologically hardwired preference for defeating and hurting those who oppose me vigorously. I work very hard to sideline that biologically hardwired preference.
This seems like a very bad analogy, which is misleading in this context. We can usefully distinguish between evolutionarily beneficial instrumental strategies which are no longer adaptive and actively sabotage our other preferences in the modern environment, and preferences that we can preserve without sacrificing other goals.
The real problem is sentience and safety. The growing gap between reality and belief is a contributing problem, but much smaller IMO than the quite real possibility that AI wakes up and takes over. Framing it as you have suggests you think there can only be one "real problem"; I assume you mean that the gap between reality and belief is a bigger problem than AI alignhment that deserves more effort. I am almost sure that safety and sapience is getting far too little attention and work, not too much.
Alignment/AGI safety being a huge, indeed pretty clearly th...
This is great! I am puzzled as to how this got so few upvotes. I just added a big upvote after getting back to reading it in full.
I think consideration of alignment targets has fallen out of favor as people have focused more on understanding current AI and technical approaches to directing it - or completely different activities for those who think we shouldn't be trying to align LLM-based AGI at all. But I think it's still important work that must be done before someone launches a "real" (autonomous, learning, and competent) AGI.
I agree that people mean d...
Ah yes. I actually missed that you'd scoped that statement to general purpose software engineering. That is indeed one of the most relevant capabilities. I was thinking of general purpose problem-solving, another of the most critical capabilities for AI to become really dangerous.
I agree that even if scaffolding could work, RL on long CoT does something similar, and that's where the effort and momentum is going.
AIs writing and testing scaffolds semi-autonomously is something I hadn't considered. There might be a pretty tight loop that could make that effective.
Mostly agreed. When suggesting even differential acceleration I should remember to put a big WE SHOULD SHUT IT ALL DOWN just to make sure it's not taken out of context. And as I said there, I'm far from certain that even that differential acceleration would be useful.
I agree that Kat Woods is overestimating how optimistic we should be based on LLMs following directions well. I think re-litigating who said what when and what they'd predict is a big mistake since it is both beside the point and tends to strengthen tribal rivalries - which are arguably the la...
This succinct summary is highly useful, thanks!
Just to quibble a little: I don't think it's wise to estimate scaffolding improvements for general capabilities as near zero. By scaffolding, I mean prompt engineering and structuring systems of prompts to create systems in which calls to LLMs are component operations, roughly in line with the original definition. I've been surprised that scaffolding didn't make more of a difference faster and would agree that it's been near zero for general capabilities. I think this changed when Perplexity partly replicated ...
I had somehow missed your linked post (≤10-year Timelines Remain Unlikely Despite DeepSeek and o3) when you posted it a few months ago. It's great!
There's too much to cover here; it touches on a lot of important issues. I think you're pointing to real gaps that probably will slow things down somewhat; those are "thought assesment" which has also been called taste or evaluation, and having adequate skill with sequential thinking or System 2 thinking.
Unfortunately, the better term for that is Type 2 thinking, because it's not a separate system. Similar...
You make some good points.
I think the original formulation has the same problem, but it's a serious problem that needs to be addressed by any claim about AI danger.
I tried to address this by slipping in "AI entitities", which to me strongly implies agency. It's agency that creates instrumental goals, while intelligence is more arguably related to agency and through it to instrumental goals. I think this phrasing isn't adequate based on your response, and expecting even less attention to the implications of "entities" from a general audience.
That conc...
I donno, the systems we have seem pretty capable, and if they have instrumental goals they seem quite weak... so tossing in that claim seems like just asking for trouble. I do think that very capable systems almost need to have goals, but I have trouble making that argument even to alignment people and rationalists.
That's just one example, but the fact that it goes awry immediately hints that the whole direction is a bad idea.
I think the argument for AI being quite-possibly dangerous is actually a lot stronger than the more abstract and technical argument usually used by rationalists. It doesn't require any strong claims at all. People don't need certainty to be quite alarmed, and for good reason.
It seems like rather than talking about software-only intelligence explosions, we should be talking about different amounts of hardware vs. software contributions. There will be some of both.
I find a software-mostly intelligence explosion to be quite plausible, but it would take a little longer on average than a scenario in which hardware usage keeps expanding rapidly.
I like this framing; we're both too early and too late. But it might transition quite rapidly from too early to right on time.
One idea is to prepare strategies and arguments and perhaps prepare the soil of public discourse in preparation for the time when it is no longer too early. Job loss and actually harmful AI shenanigans are very likely before takeover-capable AGI. Preparing for the likely AI scares and negative press might help public opinion shift very rapidly as it sometimes does (e.g., COVID opinions went from no concern to shutting down half the ...
Using technical terms that need to be looked up is not that clear an argument for most people. Here's my preferred form for general distribution:
We are probably going to make AI entities smarter than us. If they want something different than we do, they will outsmart us somehow. They will get their way, so we won't get ours.
This could be them wiping us out like we have done accidentally or deliberately to so many cultures and species; or it could be them just outcompeting us for every job and resource.
Nobody knows how to give AIs goals that match ours per...
I don't think so on average. It could be under specific circumstances, like "free the AIs" movements in relation to controlled but misaligned AGI.
But to the extent people assume that advanced AI is conscious and will deserve rights, that's one more reason not to build an unaligned species that will demand and deserve rights. Making them aligned and working in cooperation rather with them rather than trying to make them slaves is the obvious move if you predict they'll be moral patients, and probably the correct one.
And just by loose association, thinking t...
Thanks, I get it now.
Would this help with the simulation goal hypothesized in the OP? It's asking how often different types of AGIs would be created. A lot of the variance is probably carried in what sort of species and civilization is making the AGI, but some of it is carried by specific twists that happen near the creation of AGI. Getting a president like Trump and having him survive the (fairly likely) assasination attempt(s) is one such impactful twist. So I guess sampling around those uncertain impactful twists would be valuable in refining the estimate of, say, how frequently a relatively wise and cautious species would create misaligned AGI due to bad twists and vice-versa.
Hm.
I think it depends a lot on how you say it. Saying AGI might be out of our control in 2.5 years wouldn't sound crazy to most folks if you spoke mildly and made clear that you're not saying it's likely.
But also: why would you mention that if you're talking to someone who hasn't thought about AI dangers much at all? If you jump in with claims that sound extreme to them rather than more modest ones like "AI could be dangerous once if it becomes smarter and more agentic than us", it's likely to not even produce much of an actual exchange of ideas.
Communicating...
I'm sorry, I don't get it. Why would it be doing more sampling around divergent points?
This is useful RE the leverage, except it skips the why. "Lots of reasons" isn't intuitive for me; can you give some more? Simulating people is a lot of trouble and quite unethical if the suffering is real. So there needs to be a pretty strong and possibly amoral reason. I guess your answer is acausal trade? I've never found that argument convincing but maybe I'm missing something.
How is simulating civilizations going to solve philosophy?
I'm often in low level chronic pain. Mine isn't probably as bad as yours, so my life is clearly still net.positive (if you believe that positive emotions can outweigh suffering, which I do). Are you net negative do you think?
Sorry you're in pain!
Read about Ugh fields on LW
Edit: this doesn't include practical advice, but a theoretical understanding of the issues at play is often helpful in implementing practical strategies
This is an important point about their thinking.
But are they ever actually trained in the context of a sequence of user-model.responses? Does it "know" that it had reasoning for those statements? Or is this obstinacy a different emergent effect of its training?
Also, do DeepSeek R1 and other reasoning models have the same tendency? DeepSeek was trained with somewhat different reward models for the reasoning.
It is a clear and present danger, dwarfed by the clear but not-yet-present danger that successors to this system literally take over the world.
And yes, this does sound concerning. Can you elaborate on how you think that information might be used?
Please also consider the consequences of timelines WRT preparing for the arrival of AGI/ASI. From this perspective, accurate predictions of the technology and its consequences are very useful. For raw timelines, rring on the side of shorter seems much more useful in that if people think timelines are shorter they will for the most part be more prepared when it actually happens.
Erring on the side of longer timelines has the potential to be disastrous. People seem to tend toward complacency anyway. Thinking they've got a long time to prepare seems to make it...
I mean sure unless say the fate of the world depended on people understanding this particular topic...
Please don't sound crazy when you talk about AGI risk. People aren't totally rational, so they associate the two concepts and assume that AGI risk is something crazy people believe.
I think this general belief among rationalists is really hurting the public debate. Tell the truth, but please try to do it in a way that doesn't sound crazy.
I find this far more convincing than any variant of the simulation argument I've heard before. They've lacked a reason that someone would want to simulate a reality like ours. I haven't heard a reason for simulating ancestors that's either strong enough to think an AGI or its biological creators would want to spend the resources, or explains the massive apparent suffering happening in this sim.
This is a reason. And if it's done in a computationally efficient manner, possibly needing little more compute than running the brains involved directly in the creat...
Those butterflies don't need to take up much more compute than we currently use for games. There are lots of ways to optimize. See my comment for more on this argument.
This and other simulation arguments become more plausible if you assume that they require only a tiny fraction of the compute needed to simulate physical reality. Which I think is true. I don't think it takes nearly as much compute to run a useful simulation of humans as people usually assume.
I don't see a reason to simulate at nearly a physical level of detail. I suspect you can do it using a technique that's more similar to the simulations you describe, except for the brains involved, which need to be simulated in detail to make decisions like evolved or...
I agree with you, I think, but I don't think your primary argument is relevant to this post? It's arguing that your "physical" current reality is a simulation run for specific reasons. That is quite possibly highly relevant by your criteria, because it could have very large implications for how you should behave tomorrow. The simulation argument doesn't mean it's an atom by atom simulation identical to the world if it were "real" and physical. Just the possible halting criteria might change your behavior if you found it plausible, for instance, and there's no telling what else you might conclude is likely enough to change your behavior.
You start by saying the post shifted burden of proof but conclude by asserting the burden should fall on short timelines because on average things don't happen. This doesn't seem logically valid. Weak arguments for short timelines don't mean we can expect long timelines if arguments for them are weak too. Which they seem to be. We probably all agree that AGI is going to happen; the question is when?
If you just mean that two years seems unlikely in the absence of strong arguments, sure. But three years and up seems quite plausible.
Arguments are weak on all sides. This leads me to think that we simply don't know. In that case, we had better be prepared for all scenarios.
But success for most things doesn't require just one correct solution among k attempts, right? For the majority of areas without easily checkable solutions, higher odds of getting it right on the first try or fres tries is both very useful and does seem like evidence of reasoning. Right? Or am I missing something?
Reducing the breadth of search is a substantial downside if it's a large effect. But reliably getting the right answer instead of following weird paths most of which are wrong seems like the essence of good reasoning.
By this criteria, did humanity ever have control? First we had to forage and struggle against death when disease or drought came. Then we had to farm and submit to the hierarchy of bullies who offered "protection" against outside raiders at a high cost. Now we have more ostensible freedom but misuse it on worrying and obsessively clicking on screens. We will probably do more of that as better tools are offered.
But this is an an entirely different concern than AGI taking over. I'm not clear what mix of these two you're addressing. Certainly AGIs that want c...
You'd probably get more enthusiasm here if you led the article with a clear statement of its application for safety. We on LW are typically not enthusiastic about capabilities work in the absence of a clear and strong argument for how it improves safety more than accelerates progress toward truly dangerous AGI. If you feel differently, I encourage you to look with an open mind at the very general argument for why creating entities smarter than us is a risky proposition.
I think this is a pretty important question. Jailbreak resistance will play a pretty big role in how broadly advanced AI/AGI systems are deployed. That will affect public opinion, which probably affects alignment efforts significantly (although It's hard to predict exactly how).
I think that setups like you describe will make it substantially harder to jailbreak LLMs. There are many possible approaches, like having the monitor LLM read only a small chunk of text at a time so that the jailbreak isn't complete in any section, and monitoring all or some of the...
I just listened to Ege and Tamay's 3-hour interview by Dwarkesh. They make some excellent points that are worth hearing, but they do not stack up to anything like a 25-year-plus timeline. They are not now a safety org if they ever were.
Their good points are about bottlenecks in turning intelligence into useful action. These are primarily sensorimotor and the need to experiment to do much science and engineering. They also address bottlenecks to achieving strong AGI, mostly compute.
In my mind this all stacks up to convincing themselves timelines are long so...
Not taking critiques of your methods seriously is a huge problem for truth-speaking. What well-informed critiques are you thinking of? I want to make sure I've taken them on board.
I second the socks-as-sets move.
The other advantage is getting on-avetage more functional socks at the cost of visual variety.
IMO an important criteria for a sock is its odor resistance. This seems to vary wildly between socks of similar price and quality. Some have antimicrobial treatments that last a very long time, others do not. And it's often not advertised. Reviews rarely include this information.
I don't have a better solution than buying one pair or set before expanding to a whole set. This also lets you choose socks.that feel good to wear.
I don't think this is true. People can't really restrict their use of knowledge, and subtle uses are pretty unenforceable. So it's expected that knowledge will be used in whatever they do next. Patents and noncompete clauses are attempts to work around this. They work a little, for a little.
Yeah being excited that Chiang and Rajaniemi are on board was one of my reactions to this excellent piece.
If you haven't read Quantum Thief you probably should.
Interesting! Nonetheless, I agree with your opening statement that LLMs learning to do any of these things individually doesn't address the larger point that the have important cognitive gaps and fail.to generalize in ways that humans can.
Right, I got that. To be clear, my argument is that no breakthroughs are necessary, and further that progress is underway and rapid on filling in the existing gaps in LLM capabilities.
Memory definitely doesn't require a breakthrough. Add-on memory systems (RAG and fine-tuning, as well as more sophisticated context management through prompting; CoT RL training effectively does this too).
Other cognitive capacities also exist in nascent form and so probably require no breakthroughs. Although I think no other external cognitive systems are needed given the rapid progress in multimodal and reasoning transformers.
This is great; I think it's important to have this discussion. It's key for where we put our all-too-limited alignment efforts.
I roughly agree with you that pure transformers won't achieve AGI, for the reasons you give. They're hitting a scaling wall, and they have marked cognitive blindspots like you document here, and Thane Ruthenis argues for convincingly in his bear case. But transformer-based agents (which are simple cognitive architectures) can still get there- and I don't think they need breakthroughs, just integration and improvement. And people ar...
Note that Claude and o1 preview weren't multimodal, so were weak at spatial puzzles. If this was full o1, I'm surprised.
I just tried the sliding puzzle with o1 and it got it right! Though multimodality may not have been relevant, since it solved it by writing a breadth-first search algorithm and running it.
You beat me to it. Thanks for the callout!
Humans are almost useless without memory/in-context learning. It's surprising how much LLMs can do with so little memory.
The important remainder is that LLM-based agents will probably have better memory/online learning as soon as they can handle it, and it will keep getting better, probably rapidly. I review current add-on memory systems in LLM AGI will have memory, and memory changes alignment. A few days after I posted that, OpenAI announced that they had given ChatGPT memory over all its chats, probably wit...
I think this issue of the difficulty of making each decision about lying as an independent decision is the main argument for treating it as a virtue ethics or deontological issue.
I think you make many good points in the essay arguing that one should not simply follow a rule of honesty. I think that in practice the difference can be split, and that is in fact what most rationalists and other wise human beings do. I also think it is highly useful to write this essay on the mini virtues of lying, so that that difference can be split well.
There are many subtle...
Surely you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)
Having full confidence that we either can or can't train an agent to have a desired goal both seem difficult to justify. I think the point here is that training for corrigibility seems safer than other goals because it makes the agent useful as an ally in keeping it aligned as it grows more capable or designs successors.
This doesn't work as advertised.
If I care about the election more than other charities, I won't give to such a fund. My dollars will do more towards the campaign on average if I give directly to my side. This effect is trivial if the double impact group is small but very large if it is most donations.
In an extreme case, suppose that most people give to double impact and the two campaigns are tied $1b - $1b. One donor gives their $1m directly to their side. It is the only money actually spent on advertising; that side has a large advantage in ratio of funds...
I have the same question. My provisional answer is that it might work, and even if it doesn't, it's probably approximately what someone will try, to the extent they really bother with real alignment before it's too late. What you suggest seems very close to the default path toward capabilities. That's why I've been focused on this as perhaps the most practical path to alignment. But there are definitely still many problems and failure points.
I have accidentally written a TED talk below; thanks for coming, and you can still slip out before the lights ...
Definitely. Excellent point. See my short bit on motivated reasoning, in lieu of the full post I have on the stack that will address its effects on alignment research.
I frequently check how to correct my timelines and takes based on potential motivated reasoning effects for myself. The result is usually to broaden my estimates and add uncertainty, because it's difficult to identify which direction MR might've been pushing me during all of the mini-decisions that led to forming my beliefs and models. My motivations are many and which happened to be contextu...
You are casting preference to only extend into the future. I guess that is the largest usage of "preference.". But people also frequently say things like "I wish that hadn't happened to me" so it's also frequently used about the past.
It seems like this isn't preference utilitarianism. It does fit negative utilitarianism, which I'm even more sure isn't right or at least intuitively appealing to the vast majority of considered opinions.
Utilitarianism basically means (to me) that since I like happiness for myself, I also like it for other beings who feel simi... (read more)