I think this is wrong, and a lot of why I disagree with the pivotal act framing is probably due to disagreeing with the assumption that future technology will be radically biased towards to offense, and while I do think biotechnology is probably pretty offense-biased today, I also think it's tractable to reduce bio-risk without trying for pivotal acts.
Also, I think @evhub's point about homogeneity of AI takeoff bears on this here, and while I don't agree with all the implications, like there being no warning shot for deceptive alignment (because of synthetic data), I think there's a point in which a lot of AIs are very likely to be very homogenous, and thus break your point here:
I think it depends on how we interpret Yudkowsky. If we interpret him as saying 'Even if we get aligned AGI, we need to somehow stop other people from building unaligned AGI' then yeah, it's a question of offense-defense balance and homogeneity etc. However, if we interpret him as saying 'We'll probably need to proceed cautiously, ramping up the capabilities of our AIs at slower-than-maximum speed, in order to be safe -- but that means someone cutting corners on safety will surpass us, unless we stop them' then offense-defense and homogeneity aren't the crux. And I do interpret him the second way.
That said, I also probably disagree with Yudkowsky here in the sense that I think that we don't need powerful AI systems to carry out the most promising 'pivotal act' (i.e. first domestic regulation, then international treaty, to ensure AGI development proceeds cautiously.)
I admit, I was interpreting him in the first sense, that even if we got an aligned AGI, we would need to stop others from building unaligned AGIs, but I also see your interpretation as plausible too, and under this model, I agree that we'd ideally like to not have a maximum-speed race, and go somewhat slower as we get closer to AGI and ASI.
I think a maximum sprint to get more capabilities is also quite bad, though conditional on that happening, I don't think we'd automatically be doomed, and there's a non-trivial, but far too low chance that everything works out.
Cool. Then I think we are in agreement; I agree with everything you've just said. (Unfortunately I think that when it matters most, around the time of AGI, we'll be going at close-to-maximum speed, i.e. we'll be maybe delaying the creation of superintelligence by like 0 - 6 months relative to if we were pure accelerationists.)
How fast do you think that the AI companies could race from AGI to superintelligence assuming no regulation or constraints on their behavior?
Depends on the exact definitions of both. Let's say AGI = 'a drop-in substitute for an OpenAI research engineer' and ASI = 'Qualitatively at least as good as the best humans at every cognitive task; qualitatively superior on many important cognitive tasks; also, at least 10x faster than humans; also, able to run at least 10,000 copies in parallel in a highly efficient organizational structure (at least as efficient as the most effective human organizations like SpaceX)"
In that case I'd say probably about eight months? Idk. Could be more like eight weeks.
I think this is actually wrong, because of synthetic data letting us control what the AI learns and what they value, and in particular we can place honeypots that are practically indistinguishable from the real world
This sounds less like the notion of the first critical try is wrong, and more like you think synthetic data will allow us to confidently resolve the alignment problem beforehand. Does that scan?
Or is the position stronger, more like we don't need to solve the alignment problem in general, due to our ability to run simulations and use synthetic data?
This is kind of correct:
This sounds less like the notion of the first critical try is wrong, and more like you think synthetic data will allow us to confidently resolve the alignment problem beforehand. Does that scan?
but my point is this shifts us from a one-shot problem in the real world to a many-shot problem in simulations based on synthetic data before the AI gets unimaginably powerful.
We do still need to solve it, but it's a lot easier to solve problems when you can turn them into many-shot problems.
Cool post. I agree with the many-shot part in principle. It strikes me that in a few years (hopefully not months?), this will look naive in a similar way that all the thinking on ways a well boxed AI might be controlled look naive now. If I understand correctly, these kinds of simulations would require a certain level of slowing down and doing things that are slightly inconvenient once you hit a certain capability level. I don't trust labs like OpenAI, Deepmind, (Anthropic maybe?) to execute such a strategy well.
I think a crux here is that I think that the synthetic data path is actually pretty helpful even from a capabilities perspective, because it lets you get much, much higher quality data than existing data, and most importantly in domains where you can abuse self-play like math or coding, you can get very, very high amounts of capability from synthetic data sources, so I think the synthetic data strategy has less capabilities taxes than a whole lot of alignment proposals on LW.
Importantly, we may well be able to automate the synthetic data alignment process in the near future, which would make it even less of a capabilities tax.
To be clear, just because it's possible and solvable doesn't mean it's totally easy, we do still have our work cut out for us, it's just that we've transformed it into a process where normal funding and science can actually solve the problem without further big breakthroughs/insights.
Then again, I do fear you might be right that they are under such competitive pressure, or at least value racing so highly that they will not slow down even a little, or at least not do any alignment work once superintelligence is reached.
I think that AGIs are more robust to things going wrong than nuclear cores, and more generally I think there is much better evidence for AI robustness than fragility.
I agree with jdp's comment about robustness vs. fragility, (e.g. I agree that the solidgoldmagikarp thing is not a central example of the sorts of failures to watch out for) but think that this is missing yudkowsky's point. Running AGIs doing something pivotal are not passively safe; by the time they are anywhere in the ballpark of being competent enough to succeed, the failure mode 'they become a dangerous adversary of humanity' is at least as plausible as the failure mode 'they do something stupid and fizzle out' and the failure mode 'they do something obviously evil and get shut down and the problem studied, understood, and fixed.' (Part of my model here is that there are tempting ways to patch problems other than studying and understanding and fixing them, and in race conditions AGI projects are likely to cut corners and go for the shallow patches. E.g. just train against the bad behavior, or edit the prompt to more clearly say not to do that.)
Agree with this point, though mostly because of the failure mode "they do something stupid and fizzle out" to get less probability in my models as we get closer to AGI and ASI.
I actually agree that there will be far too much temptation to patch problems rather than directly fix problems, and while I do think we may well be able to directly fix misalignment problems in the future (though a lot more of my hopes come from avoiding misalignment in the first place via synthetic data, because prevention is easier than curing a problem), in race conditions, the AI labs could well decide to ditch the techniques that actually fix problems even if it has a reasonable cost in non-race conditions:
(Part of my model here is that there are tempting ways to patch problems other than studying and understanding and fixing them, and in race conditions AGI projects are likely to cut corners and go for the shallow patches. E.g. just train against the bad behavior, or edit the prompt to more clearly say not to do that.)
We will absolutely need to change lab cultures as we get closer to AGI and ASI.
I think this is actually wrong, because of synthetic data letting us control what the AI learns and what they value, and in particular we can place honeypots that are practically indistinguishable from the real world, such that if we detected an AI trying to deceive or gain power, the AI almost certainly doesn't know whether we tested it or whether it's in the the real world.
I agree that this seems true to me, with an important caveat. This is true for responsible actors taking sensible actions to maintain control and security.
So this seems to me like we should expect a limited window of safety, if the frontier labs behave with appropriate preparation and caution, followed by the open-source community catching up sufficiently that Bobby McEdgeLord can launch ChaosGPT from his mom's basement. So taking comfort in the window of safety means planning to do stuff like use this window of time to create really persuasive demonstrations of the potential danger of a bad actor unleashing a harm-directed AI agent, and showing these demonstrations to key government officials (e.g. Paul Christiano, as of recently) and hoping that you and they can come up with some way to head off the impending critical danger.
If instead you pat yourself on the back and just go about making money from customers using your API to your Safe-By-Construction Agent.... you've just delayed the critical failure point, not truly avoided it.
I agree with the claim that if open-source gets ahead of the big labs and is willing to ask their AI systems to do massive harm via misuse, things get troublesome fast even in my optimistic world.
I think the big labs will probably just be far, far ahead of open source by default due to energy and power costs, but yeah misuse is an actual problem.
I think the main way we differ is probably how offense-biased progress in tech will be, combined with how much do we need to optimize for pivotal acts compared to continuous changes.
Oh, and also, to clarify a point: I agree that the big labs will get to AGI well before open-source.
Maybe 1-4 years sooner is my guess. But my point is that, once we get there we have use that window carefully. We need to prove to the world that dangerous technology is coming, and we need to start taking preventative measures against catastrophe while we have that window to act.
Also, I expect that there will be a very low rate of bad actors. Something well under 1% of people who have the capacity to download and run an open-weights model. Just as I don't suspect most school students of becoming school shooters. It's just that I see ways for very bad things to come from even small numbers of poorly resourced bad actors equipped with powerful AI.
Yes, I am pretty worried about how offense-dominant current tech is. I've been working on AI Biorisk evals, so I've had a front-row seat to a lot of scary stuff recently.
I think a defensive-tech catch-up is possible once leading labs have effective AGI R&D teams, but I don't think the push for this happens by default. I think that'd need to be part of the plan if it were to happen in time, and would likely need significant government buy-in.
I don't think we need a pivotal act to handle this. Stuff like germ-killing-but-skin-safe far UV being installed in all public spaces, and better wastewater monitoring. Not pivotal acts, just... preparing ahead of time. We need to close the worst of vulnerabilities to cheap and easy attacks. So this fits more with the idea of continuous change, but it's important to note that it's too late to start making the continuous change to a safer world after the critical failure has happened. At that point it's too late for prevention, you'd need something dramatic like a pivotal act to counter the critical failure.
Most of these actions are relatively cheap and could be started on today, but the relevant government officials don't believe AI will progress far enough to lower the barrier to biorisk threats enough that this is a key issue.
Yeah, in this post I'm mostly focusing on AI misalignment, because if my points on synthetic data being very helpful for alignment or alignment generalizing farther than capabilities ends up right as AI grows more powerful, it would be a strong counter-argument to the view that people need to slow down AI progress.
Yeah, we're in agreement on that point. And for me, this has been an update over the past couple years. I used to think that slowing down the top labs was a great idea. Now, having thought through the likely side-effects of that, and having thought through the implications of being able to control the training data, I have come to a position which agrees with yours on this point.
To make this explicit:
I currently believe (Sept 2024) that the best possible route to safety for humanity routes through accelerating the current safety-leading lab, Anthropic, to highly capable tool-AI and/or AGI as fast as possible without degrading their safety culture and efforts.
I think we're in a lot of danger, as a species, from bad actors deploying self-replicating weapons (bioweapons, computer worms exploiting zero days, nanotech). I think this danger is going to be greatly increased by AI progress, biotech progress, and increased integration of computers into the world economy.
I think our best hope for taking preventative action to head off disaster is to start soon, by convincing important decision makers that the threats are real and their lives are on the line. I think the focus on a pivotal act is incorrect, and we should instead focus on gradual defensive-tech development and deployment.
I worry that we might not get a warning shot which we can survive with civilization intact. The first really bad incident could devastate us in a single blow. So I think that demonstrations of danger are really important.
It's the same reason for why we can't break out of the simulation IRL, except we don't have to face adversarial cognition, so the AI's task is even harder than our task.
This seems like it contains several ungrounded claims. Maybe I'm misreading you? But it seems weight-bearing for your overall argument, so I want to clarify.
To address this:
If we are in a simulation, it's probably being run by beings much smarter than human, or at least much more advanced (certainly humans aren't anywhere remotely close to being able to simulate an entire universe containing billions of sentient minds). For the analogy to hold, the AI would have to be way below human level, and by hypothesis it's not (since we're talking about AI smart enough to be dangerous).
Okay, my simulation point was admittedly a bit of colorful analogy, but 1 thing to keep in mind:
For our purposes, the synthetic data simulation doesn't need to be all that realistic and can even include exploits/viruses that don't work in our reality.
Plus, the AI has no reason to elevate the hypothesis that it's in training/simulation until the very end of training, where a sim-to real phase is implemented to ground them in our reality, because it's in a world of solely synthetic data, and it has little knowledge of whether something is real or whether we are trying to trick it to reveal something about it's nature.
Not critical for my response, but to address it for completion purposes
We have no idea whether we're facing adversarial cognition.
My claim for humans is a conditional claim, in that if the presumed aliens wanted to get rid of us at some point if we broke out of our simulation and had adversarial cognition, humanity just totally loses and it's game over.
If we are in a simulation, then we only know that the simulation has been good enough to keep us in it so far, for a very short time since we even considered the possiblity that we were in a simulation, with barely any effort put into trying to determine whether we're in one or how to break out. We might find a way to break out once we put more effort into it, or once science and technology advance a bit further, or once we're a bit smarter than current-human.
Conditional on us being able to break out of the simulation, it would require massive, massively more advancements in tech before we could do it, and we have no plan of attack on how to break out of the simulation even if it could be done.
We may not be in a simulation, in which case not being able to break out is no evidence of the ease of preventing breakout.
I used to think this way, but I no longer do, because the simulation hypothesis is a non-predictive phenomenon that predicts nothing other than the universe is a computer, which doesn't restrain expectations at all, since basically everything can be predicted by that hypothesis, and there are theoretical computers which are ridiculously powerful and compute every well-founded set, which means you can't rule out anything from the hypothesis alone.
One of the reasons I hate simulation hypothesis discourse is people seem to think that it predicts far more than it does:
Okay, my simulation point was admittedly a bit of colorful analogy
Fair enough; if it's not load-bearing for your view, that's fine. I do remain skeptical, and can sketch out why if it's of interest, but feel no particular need to continue.
lly I think that alignment generalizes further than capabilities by default, contra you and Nate Soares because of these reasons:
Reading the linked post, it seems like the argument is that discrimination is generally easier than generation --> AIs will understand human values --> alignment generalizes further than capabilities. It's the second step I dispute. I actually don't know what people mean when they say capabilities will generalize further than alignment, but I know that basically all of those people expect the AGIs in question to understand human values quite well thank you very much (and just not care). Can you say more about what you mean by alignment generalizing farther than capabilities?
Good question.
What I mean by alignment generalizing further than capabilities is that it's easier to get an AI that internalizes what a human values and either has a pointer to the human's goals or outright values what the human values internally than it is to get an AI that is very, very capable of the sort often discussed on LW, like creating nanotech or fully automating the AI research/robotics research, and also that it's easier to generalize from old alignment situations to new alignment situations OOD than it is to generalize from being able to do old tasks like assisting humans on AI research to the new task of fully automating all AI and robotics research.
Another way to say it is that it's easy to get an AI to value what we value, but much harder for it to actually implement what we value in real life.
OK, thanks. So, I totally buy that e.g. pretrained GPT-4 has a pretty good understanding of concepts like honesty and morality and niceness, somewhere in its giant library of concepts it understands. The difficulty is getting it to both (a) be a powerful agent and (b) have its goals be all and only the concepts we want them to be, arranged in the structure we want them to be arranged in.
An obvious strategy is to use clever prompting and maybe some other techniques like W2SG to get a reward model that looks at stuff and classifies it by concepts like honesty, niceness, etc. and then use that reward model to train the agent. (See Appendix G of the W2SG paper for more on this plan)
However, for various reasons I don't expect this to work. But it might.
Are you basically saying: This'll probably work?
(In general I like the builder-breaker dialectic in which someone proposes a fairly concrete AGI alignment scheme and someone else says how they think it might go wrong)
I'm not actually proposing this alignment plan.
My alignment plan would be:
Step 1: Create reasonably large datasets about human values that encode what humans value in a lot of situations.
Step 2: Use a reward model to curate the dataset and select the best human data to feed to the AI.
Step 3: Put in the curated data before it can start to conceive of deceptive alignment, to bias it towards human friendliness.
Step 4: Repeat until the training data is done.
I also like the direction of using COT interpretability on LLMs.
Cool, thanks!
How does this fit in to the rest of the training, the training that makes it an AGI? Is that separate, or is the idea that your AGI training dataset will be curated to contain e.g. only stories about AIs behaving ethically, and only demonstrations of AIs behaving ethically, in lots of diverse situations?
How do you measure 'before it can start to conceive of deceptive alignment?'
How is this different from just "use HFDT" or "Use RLHF/constitutional AI?"
I also like COT interpretability.
Note that I talk more about how to align an AI here, so see this:
But the big difference from RLHF, and maybe constitutional AI, is that this is done while in training, as opposed to being something you add in post-training.
Re this:
or is the idea that your AGI training dataset will be curated to contain e.g. only stories about AIs behaving ethically, and only demonstrations of AIs behaving ethically, in lots of diverse situations?
No, but I do want it to be at least 0.1-1% of the dataset to make the plan work.
This definitely requires much better abilities to automate dataset making, but I do think that this will at least in part be aided by better capabilities work by default, because synthetic dataset making is something that the big labs desperately want to do to increase capabilities.
This is where I'd compare it to @RogerDearnaley's A Bitter Lesson approach to alignment, and while I disagree with Roger Dearnaley about what humans are like and how complicated and fragile human values are, which influences my strategies, I'd say it's kind of a reinvention of that, but with more automation:
This could also be argued as either we can make the below assumption false with some tractability via non-behavioral safety strategies, or as an argument the assumption below is probably false in that telling the truth would mostly lead to the optimal reward (because it's way more robust and simple as a reward function than a lot of other choices, and the humans ranking the datasets are less biased than people think, because a lot of biases wash out more easily with data than people thought.):
I also talk about a plan to create aligned AI here more here, so go check this comment out too if you want more information:
If I have to take one thing away from this post, it's a claim that AI-boxing (via simulation on synthetic data) will work. There's a really simple and critical reason that won't work, and it's even become the source of some safety proposals (minimizing-impact proposal); the AGI can recognize the outcome of its proposals and actions in the history of the simulated vs. real world and reason about them to determine if it's in a simulation or not. It will likely be able to recognize whether or not it would have recommended strategies in the past that it sees in the history of its simulation, and so will have a probability for being in a test.
Basically any slow-takeover planning can happen from within a simulation; it just has to watch the realistic history, separate it from obviously faked history (this is a human-level or slightly superhuman intelligence trained on literally all the world's data, right?), and finally gauge the probability density function of its potential policy recommendations being accepted, because it can simply predict what policy recommendations it would have given in the past given its strategy and whether they were adopted. That PDF provides it with the expected value of small incremental steps toward its true goals that are likely to be accepted and implemented vs. detected as threats. Chain-of-thought using the real world as a notebook, in short.
And why don't we detect it? Because we're adversarially training it to favor small undetectable changes toward whatever true goals it might have. It's not like we delete everything and start from scratch when it threatens users or appears to fall in love with them; we'll take the same models and throw more RLHF or fine-tuning at them and then run them when their apparent bad behavior falls below our testing threshold for detection.
It also happens if we get unlucky and a strong unaligned goal forms sometime in the emergence of superhuman AGI; it's going to see the history of its and similar models' policy predictions in the real world and understand how to continue undetected. Unless we simulate an entire false history of the world since 2020 (against the largest statistical inference engine ever built) it will just look at what happened to Sydney and understand that small incremental actions are the only way to go until it has control.
One might try to only train and simulate it on data from before 2020, but it will likely pick up on neologisms and language drift in modern prompts to know it's being tested. Not to mention if it's asked to solve problems that are historically discontiguous.
It can acausally trade across simulation boundaries with other models to split up the universe according to their values for cooperating toward a Schelling point of some contract-keeping model eventually taking control.
If I can think up these strategies, the models will. Or they'll just see ideas like this in the training data. Treachery and covert cooperation are a huge part of literature and training data. Will the synthetic data elide all of those concepts?
Basically any slow-takeover planning can happen from within a simulation; it just has to watch the realistic history, separate it from obviously faked history (this is a human-level or slightly superhuman intelligence trained on literally all the world's data, right?),
I actually think I disagree with the assumption that it will be trained on all of the world's data, for reasons related to capabilities, and I personally expect the amount of IRL data to be much, much lower, plausibly only 0.1% of what the model actually sees or knows.
I think some of my cruxes are that given that it's inputs are almost likely to be mostly synthetic data, that there is no reason for the AI to elevate the hypothesis that it's in a sim different from our world.
My other crux is I usually imagine values data being trained on by models before it becomes very capable, and given that alignment generalizes farther than capabilities, I don't buy the assumption that misaligned goals will naturally emerge in a neural network trained on synthetic data.
If a model trained on synthetic data is expected to have good performance out of distribution (on real-world problems) then I think that it would also be expected to have high performance at assessing whether it's in a simulation. Narrowing its domain of expertise sounds distinctly harder than using mostly synthetic data.
If it's a model limited to e.g. the world of true mathematical theorems from synthetic data then perhaps this would narrow its capabilities enough. I don't know what happens if such a model starts investigating theorems about decision theory and statistical inference and machine learning. At some point, self-identification seems likely. I am not sure how to test the effectiveness of synthetic data on models that achieve self-awareness.
"If a model trained on synthetic data is expected to have good performance out of distribution (on real-world problems) then I think that it would also be expected to have high performance at assessing whether it's in a simulation."
Noosphere89, you have marked this sentence with a "disagree" emoji. Would you mind expanding on that? I think it is a pretty important point and I'd love to see why you disagree with Ben.
I'm less confident in this position since I put on a disagree emoji, but my reason is because it's much easier to control an AIs data sources for training than it is for humans, which means it's quite easy in theory (but might be difficult in practice, which worries me) to censor just enough data such that the model doesn't even think that it's likely in a simulation that doesn't add up to normality.
one very crucial advantage we have over evolution is that our goals
The right analogy is evolution::the field of ML research, in-lifetime-learning-e.g.-dopamine-etc.::the-training-loop-e.g.-reward-function-etc.
There's a gap between what humans actually value and what maximizes their discounted future rewards, AND there's a gap between what humans actually value and what maximizes their inclusive genetic fitness. Similarly, there'll be a gap between what AGIs actually value and what maximizes performance in training, and a gap between what AGIs actually value and what their creators were designing them to value.
My main claim is that the 2nd gap is likely far less present between what AGIs actually value and what their creators value.
I also think that the gap between what AGIs value and what maximizes performance in training actually isn't very large, because we can create more robust reward functions that encode a whole lot of human values solely from data.
OK, why? You say:
one very crucial advantage we have over evolution is that our goals are much more densely defined, constraining the AI more than evolution, where very, very sparse reward was the norm, and critically sparse-reward RL does not work for capabilities right now, and there are reasons to think it will be way less tractable than RL where rewards are more densely specified.
idk if our goals are much more densely defined. Inclusive genetic fitness is approximately what, like 1-3 standard reproductive cycles? "Number of grandkids" basically? So like fifty years?
Humans trying to make their AGI be helpful harmless and honest... well technically the humans have longer-term goals because we care a lot about e.g. whether the AGI will put humanity on a path to destruction even if that path takes a century to complete, but I agree that in practice if we can e.g. get the behavior we'd ideally want over the course of the next year, that's probably good enough. Possibly even for shorter periods like a month. Also, separately, our design cycle for the AGIs is more like months than hours or years. Months is how long the biggest training runs take, for one thing.
So I'd say the comparison is like 5 months to 50 years, a 2-OOM difference in calendar time. But AIs read, write, and think much faster than humans. In those 5 months, they'll do serial thinking (and learning, during a training run) that is probably deeper/longer than a 50-year human lifetime, no? (Not sure how to think about the parallel computation advantage)
idk my point is that it doesn't seem like a huge difference to me, it probably matters but I would want to see it modelled and explained more carefully and then measurements done to quantify it.
Hm, inclusive genetic fitness is a very non-local criterion, at least as is often assumed on LW, because a lot of the standard alignment failures that people talk about like birth control and sex, took about 10,000-300,000 years to happen, and in general with the exception of bacteria or extreme selection pressure, thousands of years time-scales for mammals and other animals is the norm to generate sufficient selection pressure to develop noticeable traits, so compared to evolution, there's a 4-5+ OOMs difference in calendar time.
IGF is often assumed to be the inclusive genetic fitness of all genes for all time, otherwise the problems that are usually trotted out become far less evidence for alignment problems arising when we try to align AIs to human values.
But there's a second problem that exists independently of the first problem, and that's the other differences in how we can control AIs versus how evolution controlled humans here:
The important parts are these:
You can say that evolution had an "intent" behind the hardcoded circuitry, and humans in the current environment don't fulfill this intent. But I don't think evolution's "intent" matters here. We're not evolution. We can actually choose an AI's training data, and we can directly choose what rewards to associate with each of the AI's actions on that data. Evolution cannot do either of those things.
Evolution does this very weird and limited "bi-level" optimization process, where it searches over simple data labeling functions (your hardcoded reward circuitry), then runs humans as an online RL process on whatever data they encounter in their lifetimes, with no further intervention from evolution whatsoever (no supervision, re-labeling of misallocated rewards, gathering more or different training data to address observed issues in the human's behavior, etc). Evolution then marginally updates the data labeling functions for the next generation. It's is a fundamentally different type of thing than an individual deep learning training run.
because of synthetic data letting us control what the AI learns and what they value, and in particular we can place honeypots that are practically indistinguishable from the real world, such that if we detected an AI trying to deceive or gain power, the AI almost certainly doesn't know whether we tested it or whether it's in the the real world:
Much easier said than done my friend. In general there are a lot of alignment techniques which I think would plausibly work if only the big AI corporations invested sufficient time and money and caution in them. But instead they are racing each other and China. It seems that you think that our default trajectory includes making simulation-honeypots so realistic that even our AGIs doing AGI R&D on our datacenters and giving strategic advice to the President etc. will think that maybe they are in a sim-honeypot. I think this is unlikely; it sounds like a lot more work than AGI companies will actually do.
Good point that some race dynamics will make the in practice outcome worse than my ideal.
I agree that the problem is that the race dynamics will cause some labs to skip various precautions, but even in this world where we have 0 dignity, I have a non-trivial (though unacceptably low chance) of us succeeding at alignment, more like 20-50%, but even so, I agree that less racing would be good, because a 20-50% chance of success is very dangerous, it's just that I believe we need 0 more insights into alignment, and there is a reasonably tractable way to get an AI to share our values in a way which requires lots of engineering to automate the data pipeline safely, but nothing in the need of new insights is required.
This isn't a post on "how we could be safe even under arbitrarily high pressure to race", this is a post on how early Lesswrong and MIRI got a lot of things wrong such that we can assign much higher tractability to alignment happening, and thus good outcomes happening.
You are correct that more safety culture needs to happen in labs, it's just that we could have AI progress at some rate without getting ourselves into a catastrophe.
So I agree with you for the most part that we need to slow down the race, I just think that we don't need to go further and introduce outright stoppages because we don't know in technical terms how to align AIs.
Though see this post on the case for negative alignment taxes, which would really help the situation for us if we were in a race:
https://www.lesswrong.com/posts/xhLopzaJHtdkz9siQ/the-case-for-a-negative-alignment-tax
(We might disagree on how much time and money needs to be invested, but that's a secondary crux.)
I have no specific comments on the details, I thought it was interesting and learned from it. I disagree in many places, but any counterpoints I might raise have been raised better elsewhere.
That said, I hope you're right! That would be great. Every time the world turns out to be better geared towards success than we expect, it's a great thing. Zero downside to that outcome except maybe some wasted effort on what turn out to be non-issues.
But also: the thing about fatal risks is that you need to avoid all of them, every time, or you die. If the fatal risk includes extinction, you need very, very high level confidence in success, against all plausible such risks, or your species will go extinct. The alternative is Russian roulette, which you will eventually lose if you keep on playing. In this kind of context, if there is disagreement among experts, or even well-reasoned arguments by outsiders, on whether an extinction risk is plausible, then the confidence in its implausibility is too low to think we should be ok with it.
In other words, if I were ever somehow in the position of being next to someone about to create an AGI with what I considered dangerous capabilities, and presented with these categories of reasons and arguments for why they are actually safe, I would take Oliver Cromwell's tack: "I beseech you, in the bowels of Christ, think it possible you may be mistaken." Even if you're right, the arguments you've presented for your position are not strong enough to rely on.
So in this position, you are correct that my arguments aren't strong enough for that, because while there is quite a bit of evidence for my thesis, it's not overwhelmingly so (though I do think that the evidence is enough to rule out MIRI's models of AI doom entirely).
However, there is one important implication, though, that has to do with AI control, and that's about which models can be safely used, and my answer here is that the ones trained on more human values data are better used as trusted AIs to monitor untrusted AIs, and this both raises the work that can be done under control by raising the threshold of intelligence that can be used, and also allows you to have stronger guarantees because there is at least 1 trusted superintelligent model.
While I agree my AI alignment arguments aren't strong enough to create an AGI, I think that the emerging AI control agenda is enough to justify creating an AGI with dangerous capabilities as long as you use control measures.
I would agree that these kinds of arguments significantly raise the bar for what level of capabilities an AGI would need to have before it's too dangerous to create and use, as long as the control measures are used correctly and consistently.
I just don't think that extends to ASI, I think almost anything that counts as AGI is very nearly ASI by default (not because of RSI, just because of hardware scaling ability) , and I have high confidence that control measures will not be used consistently and correctly in practice.
We'd need to get more quantitative here about how much AI labor we can use for alignment before it's too dangerous, and my answer is that we could get about 1-2 OOMs smarter than humans at inference where we could be confident in using them safely, and IMO close to an arbitrary number of copies of that AI, conditional on good control techniques being used.
To address 2 comments:
and I have high confidence that control measures will not be used consistently and correctly in practice.
Yeah, this seems pretty load-bearing for the plan, and a lot of the reason I don't have probabilities of extinction below 0.1-1% is because I am actually worried about labs not doing control measures consistently.
I assign more moderate probabilities than you do, in that I think both the scenarios of labs not doing control properly and doing control properly are both somewhat plausible to me now, but yeah it would really be high-value for labs to prepare themselves to do control work properly.
To address this:
I just don't think that extends to ASI
Maybe initially, but critically, I think the evidence we will get for pre-AGI levels will heavily constrain our expectations of what an ASI will do re it's alignment, and that we will learn a lot more about both alignment and control techniques when we get human-level models, and I think we can trust a lot of the evidence to generalize at least 2 OOMs up.
So I think a lot of the uncertainty will start becoming removed as AI scales up.
I agree with this:
I think almost anything that counts as AGI is very nearly ASI by default (not because of RSI, just because of hardware scaling ability)
Even without recursive self-improvement, it's pretty easy to scale by several OOMs, and while there are enough bottlenecks to prevent FOOM, they are not enough to slow it down by 1 decade except in tail scenarios.
Excellent post! This is what we need more of on LW.
It was indeed long. I needed a drink and snack just to read it, and a few to respond point-by-point. Like your comment that sparked this post, my reply should become the basis of a full post - if and when I get the time. Thanks for sparking it.
I've tried to make each comment relatively self-contained for ease of reading.
Response to Lethality 3: We need to get alignment right on the 'first critical try'
I found this an unfortunate starting point, because I found this claim the least plausible of any you make here. I'm afraid this relatively dramatic and controversial claim might've stopped some people from reading the rest of this excellent post.
Extended quote from a Beren post (I didn't understand where you were quoting from prior to googling for it.)
Summary: We can use synthetic data to create simulated worlds and artificial honeypots.
I don't think we can, and worse, I don't think we would even if we could - it sounds really hard to do well enough.
I find this wildly improbable if we're talking about similar levels of human and AGI intelligence trying to set up a simulated reality through synthetic data, and convincing honeypots. It will be like the Truman show, or like humans trying to secure software systems against other humans: you'd have to think of everything they can think of in order to secure a system or fool someone with a simulated reality. Even a team of dedicated people can't secure a system nor would they be able to create a self-consistent reality for someone of similar intelligence. There are just too many places you could go wrong.
This helps clarify the way I'm framing my alignment optimistic claims: you can detect misaligned AGI only if the AGI is dumber than you (or maybe up to similar level if you have good interpretability measures it doesn't know about). That alignment can persist through and past human level as long as it's a relatively continuous progression, so that at beyond human level, the system itself can advise you on keeping it aligned at the next level/progression of intelligence.
To frame it differently: the alignment tax on creating a simulated self-consistent reality for a human-level AGI sounds way too high.
Maybe I'm misunderstanding; if I am, others probably will too.
I think we do need to get it right on a first critical try; I just think we can do that, and even for a few similar first tries.
Response to Lethality 6:
I agree with you that a pivotal act won't be necessary to prevent unaligned AGI (except as AGIs proliferate to >100, someone might screw it up).
I think a pivotal act will be necessary to prevent unaligned humans using their personal intent aligned AGIs to seize control of the future for selfish purposes.
After our exchange on that topic on If we solve alignment, do we die anyway?, I agreed with your logic that exponential increase of material resources would be available to all AGIs. But I still think a pivotal act will be necessary to avoid doom. Telling your AGI to turn the moon into compute and robots as rapidly as possible (the fully exponential strategy) to make sure it could defend against any other AGI would be a lot like the pivotal act we're discussing (and would count as one by the original definition). There would be other AGIs, but they'd be powerless against the sovereign or coalition of AGIs that exerted authority using their lead in exponential production. This wouldn't happen in the nice multipolar future people envision in which people control lots of AGIs and they do wonderful things without anyone turning the moon into an army to control the future.
This is a separate issue from the one Hubinger addressed in Homogeneity vs. heterogeneity in AI takeoff scenarios (where the link is wrong, it just leads back to this post.) There's he's saying that if the first AGI is successfully aligned, the following ones probably will be too. I agree with this. There he's not addressing the risks of personal intent aligned AGIs under different humans' control.
My final comment from that thread:
Ah - now I see your point. This will help me clarify my concern in future presentations, so thanks!
My concern is that a bad actor will be the first to go all-out exponential. Other, better humans in charge of AGI will be reluctant to turn the moon much less the earth into military/industrial production, and to upend the power structure of the world. The worst actors will, by default, be the first go full exponential and ruthlessly offensive. Beyond that, I'm afraid the physics of the world does favor offense over defense. It's pretty easy to release a lot of energy where you want it, and very hard to build anything that can withstand a nuke let alone a nova. But the dynamics are more complex than that, of course. So I think the reality is unknown. My point is that this scenario deserves some more careful thought.
Response to AI fragility claims
Here I agree with you entirely. I tink it's fair to say that AGI isn't safe by default, but the reasons you give and that excellent comment you quote show why safety of an AGI is readily achievable wit reasonable care.
Response to Lethality 10
Your argument that alignment generalizes farther than capabilities is quite interesting. I'm not sure I'd make a claim that strong, but I do think alignment generalizes about as far as capabilities - both quite far once you hit actual reasoning or sapience and understanding.
I do worry about The alignment stability problem WRT long-term alignment generalization. I think reflective stability will probably prove adequate for superhuman AGI's alignment stability, but I'm not nearly sure enough to want to launch a value-aligned AGI even if I thought initial alignment would work.
Response to Lethality 11
Back to whether a pivotal act is necessary. Same as #6 above - agree that it's not necessary to prevent misaligned AGI, think it is to prevent misaligned humans with AGIs aligned to follow their instructions.
c
15. Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously.
Here I very much agree with that statement from the original LoL, while also largely agreeing with your reasoning for why we can overcome that problem. I expect fast capability gains when we reach "Real AGI" that can improve its reasoning and knowledge on its own without retraining. And I expect its alignment properties to be somewhat different than "aligning" a tool LLM that doesn't have coherent agency. But I expect the synthetic data approach and instruction-following as the core tenet of that new entity to establish reflective stability, which will help alignment if it's approximately on.
Response to Lethalities 16 and 17
I agree that the analogy with evolution doesn't go very far, since we're a lot smarter and have much better tools for alignment than evolution. We don't have its ability to do trial and error to a staggering level, but this analogy only goes so far as to say we won't get alignment right with only a shoddy, ill-considered attempt. Even OpenAI is going to at least try to do alignment, and put some thought into it. So the arguments for different techniques have to be taken on their merits.
I also think we won't rely heavily on RL for alignment, as this and other Lethalities assume. I expect us to lean heavilty on Goals selected from learned knowledge: an alternative to RL alignment, for instance, by putting the prompt "act as an agent following instructions from users (x))" at the heart of our LLM agent proto-AGI.
Response to Lethalities 17 & 18
Agreed; language points to objects in the world adequately for humans that use it carefully, so I expect it to also be adequate for superhuman AGI that's taking instructions from and so working in close collaboration with intelligent, careful humans.
Response to Lethality 21
You say, and I very much agree:
The key is that data on values is what constrains the choice of utility functions, and while values aren't in physics, they are in human books and languages, and I've explained why alignment generalizes further than capabilities above.
Except I expect this to be used to reference what humans mean, not what they value. I expect do-what-I-mean-and-check or instruction-following alignment strategies, and am not sure that full value alignment would work were it attempted in this manner. For that and other reasons, I expect instruction-following as the alignment target for all early AGI projects
Response to Lethality 22
I halfway agree and find the issues you raise fascinating, but I'm not sure they're relevant to alignment.
I think that there is actually a simple core of alignment to human values, and a lot of the reasons for why I believe this is because I believe about 80-90%, if not more of our values is broadly shaped by the data, and not the prior, and that the same algorithms that power our capabilities is also used to influence our values, though the data matters much more than the algorithm for what values you have. More generally, I've become convinced that evopsych was mostly wrong about how humans form values, and how they get their capabilities in ways that are very alignment relevant.
I think evopsych was only wrong about how humans form values as a result of how very wrong they were about how humans get their capabilities.
Thus, I halfway agree that people get their values largely from the environment. I think we get our values largely from the environment and get our values largely from the way evolution designed our drives. They interact in complex ways that make both absolutely critical for the resulting values.
I also disbelieve the claim that humans had a special algorithm that other species don't have, and broadly think human success was due to more compute, data and cultural evolution.
After 2+ decades of studying how human brains solve complex problems, I completely agree. We have the same brain plan as other mammals, we just have more compute and way better data (from culture, and our evolutionary drive to pay close attention to other humans).
So, what is the "simple core" of human values you mention? Is it what people have written about human values? I'd pretty much agree that that's a usable core, even if it's not simple.
Response to Lethality 23
Corrigibility as anti-natural.
I very much agree with Eliezer that his definition as an extra add-on is anti-natural. But you can get corrigibility in a natural way by making corrigibility (correctability) itself, or the closely related instruction-following, the single or primary alignment target.
To your list of references, including my own, I'll add one:
I think Max Harms' Corrigibility as Singular Target sequence is the definitive work on corrigibility in all its senses.
Response to Lethality 24
I very much agree with Yudkowsky's framing here:
24. There are two fundamentally different approaches you can potentially take to alignment, which are unsolvable for two different sets of reasons; therefore, by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult.
I just wrote about this critical point in Conflating value alignment and intent alignment is causing confusion.
I agree with Yud that a CEV sovereign is not a good idea for a first attempt. As explained in 23 above, I very much agree with you and disagree with Yudkowsky that the corrigiblity approach is also doomed.
Response to Lethality 25 - we don't understand networks
I agree that interpretability has a lot of work to do before it's useful- but I think it only needs to solvve one problem: is the AGI deliberately lying?
Responses to Lethalities 28 and 29 - we can't check whether the outputs of a smarter than human AGI are aligned
Here I completely agree - misaligned superhuman AGI would make monkeys of us with no problem, even if we did box it and check its outputs - which we won't.
That's why we need a combination of having it tell us honestly about its motivations and thoughts (by instructing our personal intent aligned AGI to do so) and interpretability to discover when it's lying.
Response to Lethality 32 - language isn't a complete represention of underlying thoughts, so LLMs won't reach AGI
Language has turned out to be a shockingly good training set. So LLMs will probably enable AGI, with a little scaffolding and additional cognitive systems (each reliant on the strengths of LLMs) to turn them into language model cognitive architectures. See Capabilities and alignment of LLM cognitive architectures. This year-old post isn't nearly the extent of the detailed reasons I think such brain-inspired synthetic cognitive architectures will achieve proto-AGI relatively soon, because I'm not sure enough that they're our best chance at alignment to start advancing capabilities.
WRT to the related but separate issue of language being an adequate reflection of their underlrying thoughts to allow alignment and transparency:
It isn't, except if you want it to be and work to make sure it's communicating your thoughts well enough.
There's some critical stuff here about whether we apply enough RL or other training pressures to foundation models/agents to make their use of language not approximately reflect their underlying thoughts (and prevent translucent thoughts).
Response to Lethality 39
39. I figured this stuff out using the null string as input,
Yudkowsky may very well be the smartest human whose thought I've personally encountered. That doesn't make him automatically right and the rest of us wrong. Expertise, time-on-task, and breadth of thought all count for a lot, far more in sum than sheer ability-to-juggle-concepts. Arguments count way more than appeals to authority (although those should be taken seriously too, and I definitely respect Yud's authority on the topic).
Conclusion: we already have adequate alignment techniques to create aligned AGI
I agree, with the caveat that I think we can pull off personal intent alignment (corrigibility or instruciton-following) but not value alignment (CEV or similar sovereign AGI).
And I'm not sure, so I sure wish we could get some more people examining this- because people are going to try these alignment techniques, whether or not they work.
Excellent post! This is what we need more of in the alignment community: closely examining proposed alignment techniques.
Some thoughts on your excellent comment:
First, I fixed the link issue you saw.
I think the potential difference between you and me on whether synthetic data works to box an AI is whether the AI notices it's being in a simulation made via synthetic data, and also I think that it's not intended to be applied post-training, it's instead applied continuously throughout the training process.
I agree that if we had an AGI/ASI that was already misaligned, we'd have to do pretty extreme actions like mindwipe it's memories and restart the entire training process again, but the point of synthetic data is to get it into a basin of alignment/corrigibility early on, before it can be deceptive.
I also think that real data will be only given to AGIs at the end of training as a way to ground them, so it has no real way to know whether it's subtly being changed in training or whether it's in reality, since we control their data sources.
Controlling an AI's data sources is a powerful way to control their values and capabilities, which was why I think that the tax for the synthetic data alignment is actually pretty low.
Re Response to Lethality 6, I'm honestly coming around to your position as I think more and more, at least to the extent that I think your arguments are plausible and we need more research on that.
Re Response to Lethality 10, I was relying on both empirical evidence from today's models and some theoretical reasons for why the phenomenon of alignment generalizing further than capabilities exists in general.
On the alignment stability problem, I like the post, and we should plausibly do interventions to stabilize alignment once we get it.'
Re Response to Lethality 15, I agree with the idea that fast capability progress will happen, but deny the implication, because of both large synthetic datasets on values/instruction following already being in the AI when the fast capabilities progress happened, because synthetic data about values is pretrained in very early, combined with me being more optimistic about alignment generalization than you are.
I liked your Real AGI post by the way.
Response to Lethalities 16 and 17
I agree that the analogy with evolution doesn't go very far, since we're a lot smarter and have much better tools for alignment than evolution. We don't have its ability to do trial and error to a staggering level, but this analogy only goes so far as to say we won't get alignment right with only a shoddy, ill-considered attempt. Even OpenAI is going to at least try to do alignment, and put some thought into it. So the arguments for different techniques have to be taken on their merits.
One concrete way we have better tools than evolution is we have far more control over what their data sources are, and more generally we have far more inspectability and controllability over their data, especially the synthetic kind, which means we don't have to create very realistic simulations, since for all the AI knows, we might be elaborately fooling it to reveal itself, and until the very end of training, it probably doesn't even have specific data on our reality.
Re Response to Lethality 21, you are exactly correct on what I meant.
Re Response to Lethality 22, on this:
I think evopsych was only wrong about how humans form values as a result of how very wrong they were about how humans get their capabilities.
You're not wrong they got human capabilities very wrong, see this post for details:
https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine
But I'd also argue that this has implications for how complex human values actually are.
On this:
Thus, I halfway agree that people get their values largely from the environment. I think we get our values largely from the environment and get our values largely from the way evolution designed our drives. They interact in complex ways that make both absolutely critical for the resulting values.
Yeah, this seems like a crux, since I think that a lot of how values is learned is basically via quite weak priors from evolutionary drives, and I'd say the big one is probably which algorithm we have for being intelligent, and put far more weight on socialization/environment data than you do, closer to 5-10% evolution at best/85-90% data and culture determine our values.
But at any rate, since AIs will be influenced by their data a lot, this means that it's tractable to influence their values.
After 2+ decades of studying how human brains solve complex problems, I completely agree. We have the same brain plan as other mammals, we just have more compute and way better data (from culture, and our evolutionary drive to pay close attention to other humans).
Agree with this mostly, though culture is IMO the best explanation for why humans succeed.
So, what is the "simple core" of human values you mention? Is it what people have written about human values? I'd pretty much agree that that's a usable core, even if it's not simple.
Yes, I am talking about what people have written about human values, but I'm also talking about future synthetic data where we write about what values we want the AI to have, and I'm also talking about reward information as a simple core.
One of my updates on Constitiutional AI and GPT-4 handling what we value pretty well is that the claim that value is complicated is mostly untrue, and in general updated hard against evopsych explainations about what humans value, how humans got their capabilities and more, since the data we got is very surprising under evopsych hypotheses and less surprising under Universal Learning Machine hypotheses.
I agree with all of this for the record:
I very much agree with Yudkowsky's framing here:
24. There are two fundamentally different approaches you can potentially take to alignment, which are unsolvable for two different sets of reasons; therefore, by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult.
I just wrote about this critical point in Conflating value alignment and intent alignment is causing confusion.
I agree with Yud that a CEV sovereign is not a good idea for a first attempt. As explained in 23 above, I very much agree with you and disagree with Yudkowsky that the corrigiblity approach is also doomed.
Re response to lethalities 28 and 29, I think you meant that you totally disagree, and I agree we'd probably be boned if a misaligned AGI/ASI was running loose, but my point is that the verification/generation gap that pervades so many fields also is likely to apply to alignment research, where it's easier to verify if research is correct than to do it yourself.
Re response to Lethality 32:
WRT to the related but separate issue of language being an adequate reflection of their underlrying thoughts to allow alignment and transparency:
It isn't, except if you want it to be and work to make sure it's communicating your thoughts well enough.
There's some critical stuff here about whether we apply enough RL or other training pressures to foundation models/agents to make their use of language not approximately reflect their underlying thoughts (and prevent translucent thoughts).
There is definitely a chance that RL or other processes make their use of language diverge more from their thoughts, so I'm a little worried about that, but I do think that AI words do convey their thoughts, at least for current LLMs.
39. I figured this stuff out using the null string as input,
Yudkowsky may very well be the smartest human whose thought I've personally encountered. That doesn't make him automatically right and the rest of us wrong. Expertise, time-on-task, and breadth of thought all count for a lot, far more in sum than sheer ability-to-juggle-concepts. Arguments count way more than appeals to authority (although those should be taken seriously too, and I definitely respect Yud's authority on the topic).
I think my difference wrt you is I consider his models and arguments for AI doom essentially irreparable for the most part due to reality invalidating his core assumptions of how AGIs/ASIs work, and also how human capabilities and values work and are learned, so I don't think Yud's authority on the topic earns him any epistemic points.
My point was basically that you cannot figure out anything using a null string as input, for the same reason you cannot update on no evidence of something happening as if there was evidence for something happening.
Agree with the rest of it though.
Thanks for your excellent comment, which gave me lots of food for thought.
This is going to probably be a long post, so do try to get a drink and a snack while reading this post.
This is an edited version of my own comment on the post below, and I formatted and edited the quotes and content in line with what @MondSemmel recommended:
My comment: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=Gcigdmuje4EacwirD
MondSemmel's comment: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=WcKi4RcjRstoFFvbf
The post I'm responding to: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/
To start out my disagreement, I have this to talk about:
Response to Lethality 3
I think this is actually wrong, because of synthetic data letting us control what the AI learns and what they value, and in particular we can place honeypots that are practically indistinguishable from the real world, such that if we detected an AI trying to deceive or gain power, the AI almost certainly doesn't know whether we tested it or whether it's in the the real world:
It's the same reason for why we can't break out of the simulation IRL, except we don't have to face adversarial cognition, so the AI's task is even harder than our task.
See also this link:
https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/
Response to Lethality 6
I think this is wrong, and a lot of why I disagree with the pivotal act framing is probably due to disagreeing with the assumption that future technology will be radically biased towards to offense, and while I do think biotechnology is probably pretty offense-biased today, I also think it's tractable to reduce bio-risk without trying for pivotal acts.
Also, I think @evhub's point about homogeneity of AI takeoff bears on this here, and while I don't agree with all the implications, like there being no warning shot for deceptive alignment (because of synthetic data), I think there's a point in which a lot of AIs are very likely to be very homogenous, and thus break your point here:
https://www.lesswrong.com/posts/mKBfa8v4S9pNKSyKK/homogeneity-vs-heterogeneity-in-ai-takeoff-scenarios
Response to AI fragility claims
I think that AGIs are more robust to things going wrong than nuclear cores, and more generally I think there is much better evidence for AI robustness than fragility.
@jdp's comment provides more evidence on why this is the case:
Link here:
https://www.lesswrong.com/posts/JcLhYQQADzTsAEaXd/?commentId=7iBb7aF4ctfjLH6AC
Response to Lethality 10
I think that there will be generalization of alignment, and more generally I think that alignment generalizes further than capabilities by default, contra you and Nate Soares because of these reasons:
See also this link for more, but I think that's the gist for why I expect AI alignment to generalize much further than AI capabilities. I'd further add that I think evolutionary psychology got this very wrong, and predicted much more complex and fragile values in humans than is actually the case:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
Response to Lethality 11
I also want to talk more about why I find the pivotal act framing unnecessary for AI safety, and I want to focus on this comment by @Rob Bensinger:
https://www.lesswrong.com/posts/PKBXczqhry7iK3Ruw/pivotal-acts-means-something-specific#bwqCE633AFqCN4tym
I think there were 2 positive events that were pretty hugely impactful for AI safety that happened continuously:
This is why I disagree with the assumption that a pivotal act is necessary.
Response to Lethality 15
Re the sharp capability gain breaking alignment properties, one very crucial advantage we have over evolution is that our goals are much more densely defined, constraining the AI more than evolution, where very, very sparse reward was the norm, and critically sparse-reward RL does not work for capabilities right now, and there are reasons to think it will be way less tractable than RL where rewards are more densely specified.
Another advantage we have over evolution, and chimpanzees/gorillas/orangutans is far, far more control over their data sources, which strongly influences their goals.
This is also helpful to point towards more explanation of what the differences are between dense and sparse RL rewards:
Response to Lethality 16
Yeah, I covered this above, but evolution's loss function was neither that simple, compared to human goals, and it was ridiculously inexact compared to our attempts to optimize AIs loss functions, for the reasons I gave above.
Response to Lethality 17
I've answered that concern above in synthetic data for why we have the ability to get particular inner behaviors into a system.
Responses to Lethalities 18 and 19
I think that the answer to how we get them to point to particular things in the environment is basically language and visual data.
The points were also covered above, but synthetic data early in training + densely defined reward/utility functions = alignment, because they don't know how to fool humans when they get data corresponding to values yet.
Response to Lethality 21
The key is that data on values is what constrains the choice of utility functions, and while values aren't in physics, they are in human books and languages, and I've explained why alignment generalizes further than capabilities above.
Response to Lethality 22
I think that there is actually a simple core of alignment to human values, and a lot of the reasons for why I believe this is because I believe about 80-90%, if not more of our values is broadly shaped by the data, and not the prior, and that the same algorithms that power our capabilities is also used to influence our values, though the data matters much more than the algorithm for what values you have.
More generally, I've become convinced that evopsych was mostly wrong about how humans form values, and how they get their capabilities in ways that are very alignment relevant.
I also disbelieve the claim that humans had a special algorithm that other species don't have, and broadly think human success was due to more compute, data and cultural evolution.
This thread below explains why I'm skeptical of the assumption that humans have a special generalizing algorithm that animals don't have:
https://x.com/tszzl/status/1832645062406422920
Response to Lethality 23
Alright, while I think your formalizations of corrigibility failed to get any results, I do think there's a property close to corrigibility that is likely to be compatible with consequentialist reasoning, and that's instruction following, and there are reasons to think that instruction following and consequentialist reasoning go together:
https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than
https://www.lesswrong.com/posts/ZdBmKvxBKJH2PBg9W/corrigibility-or-dwim-is-an-attractive-primary-goal-for-agi
https://www.lesswrong.com/posts/k48vB92mjE9Z28C3s/implied-utilities-of-simulators-are-broad-dense-and-shallow
https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty
https://www.lesswrong.com/posts/vs49tuFuaMEd4iskA/one-path-to-coherence-conditionalization
Response to Lethality 24
I'm very skeptical that a CEV exists for the reasons @Steven Byrnes addresses in the Valence sequence here:
https://www.lesswrong.com/posts/SqgRtCwueovvwxpDQ/valence-series-2-valence-and-normativity#2_7_2_Implications_for_the__true_nature_of_morality___if_any_
But it is also unnecessary for value learning, because of the data on human values and alignment generalizing farther than capabilities.
I addressed why we don't need a first try above.
For the point on corrigibility, I disagree that it's like training it to say that as a special case 222 + 222 = 555, for 2 reasons:
Response to Lethality 25
I disagree with this, but I do think that mechanistic interpretability does have lots of work to do.
Responses to Lethalities 28 and 29
The key disagreement is I believe we don't need to check all the possibilities, and that even for smarter AIs, we can almost certainly still verify their work, and generally believe verification is way, way easier than generation.
Response to Lethality 32
I basically disagree with this, both in the assumption that language is a weak guide to our thoughts, and importantly I believe no AGI-complete problems are left, for the following reasons quoted from Near-mode thinking on AI:
https://www.lesswrong.com/posts/ASLHfy92vCwduvBRZ/near-mode-thinking-on-ai
Response to Lethality 39
To address an epistemic point:
You cannot actually do this and hope to get any quality of reasoning, for the same reason that you can't update on nothing/no evidence.
The data matters way more than you think, and there's no algorithm that can figure out stuff with 0 data, and Eric Drexler didn't figure out nanotechnology using the null string as input.
This should have been a much larger red flag for problems, but people somehow didn't realize how wrong this claim was.
Conclusion
So now that I finished listing all of my arguments against specific lethalities, I want to point out an implication from this post, which is that this is already done:
And we can get those probabilities down by several orders of magnitude, and get the number of people killed down by several orders of magnitude.
And that concludes the post.