Recently, there's been a fair amount of pushback on the "canonical" views towards the difficulty of AGI Alignment (the views I call the "least forgiving" take).

Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the "canonical" takes are almost purely theoretical.

At a glance, not updating away from them in the face of ground-truth empirical evidence is a failure of rationality: entrenched beliefs fortified by rationalizations.

I believe this is invalid, and that the two views are much more compatible than might seem. I think the issue lies in the mismatch between their subject matters.

It's clearer if you taboo the word "AI":

  • The "canonical" views are concerned with scarily powerful artificial agents: with systems that are human-like in their ability to model the world and take consequentialist actions in it, but inhuman in their processing power and in their value systems.
  • The novel views are concerned with the systems generated by any process broadly encompassed by the current ML training paradigm.

It is not at all obvious that they're one and the same. Indeed, I would say that to claim that the two classes of systems overlap is to make a very strong statement regarding how cognition and intelligence work. A statement we do not have much empirical evidence on, but which often gets unknowingly, implicitly snuck-in when people extrapolate findings from LLM studies to superintelligences.

It's an easy mistake to make: both things are called "AI", after all. But you wouldn't study manually-written FPS bots circa 2000s, or MNIST-classifier CNNs circa 2010s, and claim that your findings regarding what algorithms these AIs implement generalize to statements regarding what algorithms the forward passes of LLMs circa 2020s implement.

By the same token, LLMs' algorithms do not necessarily generalize to how an AGI's cognition will function. Their limitations are not necessarily an AGI's limitations.[1]


What the Fuss Is All About

To start off, let's consider where all the concerns about the AGI Omnicide Risk came from in the first place.

Consider humans. Some facts:

  • Humans posses an outstanding ability to steer the world towards their goals, and that ability grows sharply with their "intelligence". Sure, there are specific talents, and "idiot savants". But broadly, there does seem to be a single variable that mediates a human's competence in all domains. An IQ 140 human would dramatically outperform an IQ 90 human at basically any cognitive task, and crucially, be much better at achieving their real-life goals.
  • Humans have the ability to plot against and deceive others. That ability grows fast with their g-factor. A brilliant social manipulator can quickly maneuver their way into having power over millions of people, out-plotting and dispatching even those that are actively trying to stop them or compete with them.
  • Human values are complex and fragile, and the process of moral philosophy is more complex still. Humans often arrive at weird conclusions that don't neatly correspond to their innate instincts or basic values. Intricate moral frameworks, weird bullet-biting philosophies, and even essentially-arbitrary ideologies like cults.
  • And when people with different values interact...
    • People who differ in their values even just a bit are often vicious, bitter enemies. Consider the history of heresies, or of long-standing political rifts between factions that are essentially indistinguishable from the outside.
    • People whose cultures evolved in mutual isolation often don't even view each other as human. Consider the history of xenophobia, colonization, culture shocks.

So, we have an existence proof of systems able to powerfully steer the world towards their goals. Some of these system can be strictly more powerful than others. And such systems are often in vicious conflict, aiming to exterminate each other based even on very tiny differences in their goals.

The foundational concern of the AGI Omnicide Risk is: Humans are not at the peak of capability as measured by this mysterious "g-factor". There could be systems more powerful than us. These systems would be able to out-plot us same way smarter humans out-plot stupider ones, even given limited resources and facing active resistance from our side. And they would eagerly do so based on the tiniest of differences between their values and our values.

Systems like this, systems the possibility of whose existence is extrapolated from humans' existence, are precisely what we're worried about. Things that can quietly plot deep within their minds about real-world outcomes they want to achieve, then perturb the world in ways precisely calculated to bring said outcomes about.

The only systems in this reference class known to us are humans, and some human collectives.

Viewing it from another angle, one can say that the systems we're concerned about are defined as cognitive systems in the same reference class as humans.


So What About Current AIs?

Inasmuch as current empirical evidence shows that things like LLMs are not an omnicide risk, it's doing so by demonstrating that they lie outside the reference class of human-like systems.

Indeed, that's often made fairly explicit. The idea that LLMs can exhibit deceptive alignment, or engage in introspective value reflection that leads to them arriving at surprisingly alien values, is often likened to imagining them as having a "homunculus" inside. A tiny human-like thing, quietly plotting in a consequentialist-y manner somewhere deep in the model, and trying to maneuver itself to power despite the efforts of humans trying to detect it and foil its plans.

The novel arguments are often based around arguing that there's no evidence that LLMs have such homunculi, and that their training loops can never lead to homunculi's formation.

And I agree! I think those arguments are right.

But one man's modus ponens is another's modus tollens. I don't take it as evidence that the canonical views on alignment are incorrect – that actually, real-life AGIs don't exhibit such issues. I take it as evidence that LLMs are not AGI-complete.

Which isn't really all that wild a view to hold. Indeed, it would seem this should be the default view. Why should one take as a given the extraordinary claim that we've essentially figured out the grand unified theory of cognition? That the systems on the current paradigm really do scale to AGI? Especially in the face of countervailing intuitive impressions – feelings that these descriptions of how AIs work don't seem to agree with how human cognition feels from the inside?

And I do dispute that implicit claim.

I argue: If you model your AI as being unable to engage in this sort of careful, hidden plotting where it considers the impact of its different actions on the world, iteratively searching for actions that best satisfy its goals? If you imagine it as acting instinctively, as a shard ecology that responds to (abstract) stimuli with (abstract) knee-jerk-like responses? If you imagine that its outwards performance – the RLHF'd masks of ChatGPT or Bing Chat – is all that there is? If you think that the current training paradigm can never produce AIs that'd try to fool you, because the circuits that are figuring out what you want so that the AI may deceive you will be noticed by the SGD and immediately updated away in favour of circuits that implement an instinctive drive to instead just directly do what you want?

Then, I claim, you are not imagining an AGI. You are not imagining a system in the same reference class as humans. You are not imagining a system all the fuss has been about.

Studying gorilla neurology isn't going to shed much light on how to win moral-philosophy debates against humans, despite the fact that both entities are fairly cognitively impressive animals.

Similarly, studying LLMs isn't necessarily going to shed much light on how to align an AGI, despite the fact that both entities are fairly cognitively impressive AIs.

The onus to prove the opposite is on those claiming that the LLM-like paradigm is AGI-complete. Not on those concerned that, why, artificial general intelligences would exhibit the same dangers as natural general intelligences.


On Safety Guarantees

That may be viewed as good news, after a fashion. After all, LLMs are actually fairly capable. Does that mean we can keep safely scaling them without fearing an omnicide? Does that mean that the AGI Omnicide Risk is effectively null anyway? Like, sure, yeah, maybe there are scary systems to which its argument apply, sure. But we're not on-track to build them, so who cares?

On the one hand, sure. I think LLMs are basically safe. As long as you keep the current training setup, you can scale them up 1000x and they're not gonna grow agency or end the world.

I would be concerned about mundane misuse risks, such as perfect-surveillance totalitarianism becoming dirt-cheap, unsavory people setting off pseudo-autonomous pseudo-agents to wreck economic or sociopolitical havoc, and such. But I don't believe they pose any world-ending accident risk, where a training run at an air-gapped data center leads to the birth of an entity that, all on its own, decides to plot its way from there to eating our lightcone, and then successfully does so.

Omnicide-wise, arbitrarily-big LLMs should be totally safe.

The issue is that this upper bound on risk is also an upper bound on capability. LLMs, and other similar AIs, are not going to do anything really interesting. They're not going to produce stellar scientific discoveries where they autonomously invent whole new fields or revolutionize technology.

They're a powerful technology in their own right, yes. But just that: just another technology. Not something that's going to immanentize the eschaton.

Insidiously, any research that aims to break said capability limit – give them true agency and the ability to revolutionize stuff – is going to break the risk limit in turn. Because, well, they're the same limit.

Current AIs are safe, in practice and in theory, because they're not as scarily generally capable as humans. On the flip side, current AIs aren't as capable as humans because they are safe. The same properties that guarantee their safety ensure their non-generality.

So if you figure out how to remove the capability upper bound, you'll end up with the sort of scary system the AGI Omnicide Risk arguments do apply to.

And this is precisely, explicitly, what the major AI labs are trying to do. They are aiming to build an AGI. They're not here just to have fun scaling LLMs. So inasmuch as I'm right that LLMs and such are not AGI-complete, they'll eventually move on from them, and find some approach that does lead to AGI.

And, I predict, for the systems this novel approach generates, the classical AGI Omnicide Risk arguments would apply full-force.


A Concrete Scenario

Here's a very specific worry of mine.

Take an AI Optimist who'd built up a solid model of how AIs trained by SGD work. Based on that, they'd concluded that the AGI Omnicide Risk arguments don't apply to such systems. That conclusion is, I argue, correct and valid.

The optimist caches this conclusion. Then, they keep cheerfully working on capability advances, safe in the knowledge they're not endangering the world, and are instead helping to usher in a new age of prosperity.

Eventually, they notice or realize some architectural limitation of the paradigm they're working under. They ponder it, and figure out some architectural tweak that removes the limitation. As they do so, they don't notice that this tweak invalidates one of the properties on which their previous reassuring safety guarantees rested; from which they were derived and on which they logically depend.

They fail to update the cached thought of "AI is safe".

And so they test the new architecture, and see that it works well, and scale it up. The training loop, however, spits out not the sort of safely-hamstrung system they'd been previously working on, but an actual AGI.

That AGI has a scheming homunculus deep inside. The people working with it don't believe in homunculi, they have convinced themselves those can't exist, so they're not worrying about that. They're not ready to deal with that, they don't even have any interpretability tools pointed in that direction.

The AGI then does all the standard scheme-y stuff, and maneuvers itself into a position of power, basically unopposed. (It, of course, knows not to give any sign of being scheme-y that the humans can notice.)

And then everyone dies.

The point is that the safety guarantees that the current optimists' arguments are based on are not simply fragile, they're being actively optimized against by ML researchers (including the optimists themselves). Sooner or later, they'll give out under the optimization pressures being applied – and it'll be easy to miss the moment the break happens. It'd be easy to cache the belief of, say, "LLMs are safe", then introduce some architectural tweak, keep thinking of your system as "just an LLM with some scaffolding and a tiny tweak", and overlook the fact that the "tiny tweak" invalidated "this system is an LLM, and LLMs are safe".


Closing Summary

I claim that the latest empirically-backed guarantees regarding the safety of our AIs, and the "canonical" least-forgiving take on alignment, are both correct. They're just concerned with different classes of systems: non-generally-intelligent non-agenty AIs generated on the current paradigm, and the theoretically possible AIs that are scarily generally capable the same way humans are capable (whatever this really means).

That view isn't unreasonable. Same way it's not unreasonable to claim that studying GOFAI algorithms wouldn't shed much light on LLM cognition, despite them both being advanced AIs.

Indeed, I go further, and say that should be the default view. The claim that the two classes of systems overlap is actually fairly extraordinary, and that claim isn't solidly backed, empirically or theoretically. If anything, it's the opposite: the arguments for current AIs' safety are based on arguing that they're incapable-by-design of engaging in human-style scheming.

That doesn't guarantee global safety, however. While current AIs are likely safe no matter how much you scale them, those safety guarantees is also what's hamstringing them. Which means that, in the pursuit of ever-greater capabilities, ML researchers are going to run into those limitations sooner or later. They'll figure out how to remove them... and in that very act, they will remove the safety guarantees. The systems they're working on would switch from belonging to the proven-safe class, to systems from the dangerous scheme-y class.

The class to which the classical AGI Omnicide Risk arguments apply full-force.

The class for which no known alignment technique suffices.

And that switch would be very easy, yet very lethal, to miss.

  1. ^

    Slightly edited for clarity after an exchange with Ryan.

Current AIs Provide Nearly No Data Relevant to AGI Alignment
New Comment
156 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Here are two specific objections to this post[1]:

  • AIs which aren't qualitatively smarter than humans could be transformatively useful (e.g. automate away all human research).
  • It's plausible that LLM agents will be able to fully obsolete human research while also being incapable of doing non-trivial consequentialist reasoning in just a forward pass (instead they would do this reasoning in natural language).

AIs which aren't qualitatively smarter than humans could be transformatively useful

Perfectly aligned AI systems which were exactly as smart as humans and had the same capability profile, but which operated at 100x speed and were cheap would be extremely useful. In particular, we could vastly exceed all current alignment work in the span of a year.

In practice, the capability profile of AIs is unlikely to be exactly the same as humans. Further, even if the capability profile was the same, merely human level systems likely pose substantial misalignment concerns.

However, it does seem reasonably likely that AI systems will have a reasonably similar capability profile to humans and will also run faster and be cheaper. Thus, approaches like AI control could be very useful.

LLM agents

... (read more)

I think it's unlikely that AI will be capable of automating R&D while also being incapable of doing non-trivial consequentialist reasoning in a forward pass, but this still seems around 15% likely overall. And if there are quite powerful LLM agents with relatively weak forward passes, then we should update in this direction.

I expect the probability to be >> 15% for the following reasons.

There will likely still be incentives to make architectures more parallelizable (for training efficiency) and parallelizable architectures will probably be not-that-expressive in a single forward pass (see The Parallelism Tradeoff: Limitations of Log-Precision Transformers). CoT is known to increase the expressivity of Transformers, and the longer the CoT, the greater the gains (see The Expressive Power of Transformers with Chain of Thought). In principle, even a linear auto-regressive next-token predictor is Turing-complete, if you have fine-grained enough CoT data to train it on, and you can probably tradeoff between length (CoT supervision) complexity and single-pass computational complexity (see Auto-Regressive Next-Token Predictors are Universal Learners). We also see empirically that... (read more)

4Daniel Kokotajlo
I agree & think this is pretty important. Faithful/visible CoT is probably my favorite alignment strategy.
2Bogdan Ionut Cirstea
I think o1 is significant evidence in favor of the story here; and I expect OpenAI's model to be further evidence still if, as rumored, it will be pretrained on CoT synthetic data.
1Bogdan Ionut Cirstea
The weak single forward passes argument also applies to SSMs like Mamba for very similar theoretical reasons.
1Bogdan Ionut Cirstea
One additional probably important distinction / nuance: there are also theoretical results for why CoT shouldn't just help with one-forward-pass expressivity, but also with learning. E.g. the result in Auto-Regressive Next-Token Predictors are Universal Learners is about learning; similarly for Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks, Why Can Large Language Models Generate Correct Chain-of-Thoughts?, Why think step by step? Reasoning emerges from the locality of experience, Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks. The learning aspect could be strategically crucial with respect to what the first transformatively-useful AIs should look like; also see e.g. discussion here and here. In the sense that this should add further reasons to think the first such AIs should probably (differentially) benefit from learning from data using intermediate outputs like CoT; or at least have a pretraining-like phase involving such intermediate outputs, even if this might be later distilled or modified some other way - e.g. replaced with [less transparent] recurrence.
1Bogdan Ionut Cirstea
More complex tasks 'gaining significantly from longer inference sequences' also seems beneficial to / compatible with this story.

I think those objections are important to mention and discuss, but they don't undermine the conclusion significantly.

AIs which are qualitatively just as smart as humans could still be dangerous in the classic ways. The OP's argument still applies to them, insofar as they are agentic and capable of plotting on the inside etc.

As for LLM agents with weak forward passes: Yes, if we could achieve robust faithful CoT properties, we'd be in pretty damn good shape from an AI control perspective. I have been working on this myself & encourage others to do so also. I don't think it undermines the OP's points though? We are not currently on a path to have robust faithful CoT properties by default.

This post seemed overconfident in a number of places, so I was quickly pushing back in those places.

I also think the conclusion of "Nearly No Data" is pretty overstated. I think it should be possible to obtain significant data relevant to AGI alignment with current AIs (though various interpretations of current evidence can still be wrong and the best way to obtain data might look more like running careful model organism experiments than observing properties of chatgpt). But, it didn't seem like I would be able to quickly argue against this overall conclusion in a cohesive way, so I decided to just push back on small separable claims which are part of the reason why I think current systems provide some data.

If this post argued "the fact that current chat bots trained normally don't seem to exhibit catastrophic misalignment isn't much evidence about catastrophic misalignment in more powerful systems", then I wouldn't think this was overstated (though this also wouldn't be very original). But, it makes stronger claims which seem false to me.

Mm, I concede that this might not have been the most accurate title. I might've let the desire for hot-take clickbait titles get the better of me some. But I still mostly stand by it.

My core point is something like "the algorithms that the current SOTA AIs execute during their forward passes do not necessarily capture all the core dynamics that would happen within an AGI's cognition, so extrapolating the limitations of their cognition to AGI is a bold claim we have little evidence for".

I agree that the current training setups shed some data on how e. g. optimization pressures / reinforcement schedules / SGD biases work, and I even think the shard theory totally applies to general intelligences like AGIs and humans. I just think that theory is AGI-incomplete.

4Daniel Kokotajlo
OK, that seems reasonable to me.
2Ebenezer Dukakis
Is there a citation for this?
2Daniel Kokotajlo
What kind of citation are you looking for? Are you basically just asking me to provide evidence, or are you asking me to make an object-level argument (as opposed to e.g. an appeal to authority)? Or something else entirely, e.g. a publication?
4Ebenezer Dukakis
You stated it as established fact rather than opinion, which caused me to believe that the argument had already been made somewhere, and someone could just send me a link to it. If the argument hasn't been made somewhere, perhaps you could write a short post making that argument. Could be a good way to either catalyze research in the area (you stated that you wish to encourage such research), or else convince people that the challenge is insurmountable and a different approach is needed.

You stated it as established fact rather than opinion, which caused me to believe that the argument had already been made somewhere, and someone could just send me a link to it.4

You may have interpreted it that way, but I certainly don't follow a policy of prefacing everything with "in my opinion" unless I have a citation ready. I bet you don't either. Claims are by default just claims, not claims-presented-as-established-facts. If I wanted to present it as an established fact I would have done something to indicate that, e.g. cited something or said "it is well-known that..."

Anyhow, I'm happy to defend the claim here. It would help if I knew where to begin. I'll just briefly link a couple things here to get started and then we can zoom in on whatever bits you are most interested in.

First: There are probably going to be incentives for AIs to conceal their thoughts sometimes. Sometimes this will allow them to perform better in training, for example. Link to example post making this point, though many others have made it also.

Second: Some AI designs involve a natural language bottleneck; the only way for the system to communicate with its future self is via outputting tokens that the... (read more)

7Filip Sondej
This one looks fatal. (I think the rest of the reasons could be dealt with somehow.) What existing alternative architectures do you have in mind? I guess mamba would be one? Do you think it's realistic to regulate this? F.e. requiring that above certain size, models can't have recurrence that uses a hidden state, but recurrence that uses natural language (or images) is fine. (Or maybe some softer version of this, if alignment tax proves too high.)
7Daniel Kokotajlo
I think it would be realistic to regulate this if the science of faithful CoT was better developed. If there were lots of impressive papers to cite about CoT faithfulness for example, and lots of whitepapers arguing for the importance of faithfulness to alignment and safety. As it is, it seems unlikely to be politically viable... but maybe it's still worth a shot?
7Filip Sondej
Yeah, true. But it's also easier to do early, when no one is that invested in the hidden-recurrence architectures, and so there's less resistance, it doesn't break anyone's plans. Maybe a strong experiment would be to compare mamba-3b and some SOTA 3b transformer, trained similarly, on several tasks where we can evaluate CoT faithfulness. (Although maybe at 3b capability level we won't see clear differences yet.) The hard part would be finding the right tasks.
7Daniel Kokotajlo
Agreed. I was working on this for six months and I've been trying to get more people to work on it. We don't have a way of measuring CoT faithfulness as far as I know, in general -- but you emphasize "tasks where we can evaluate..." that seems intriguing to me, you are saying it may be feasible today for some tasks at least. What tasks do you have in mind?
5Filip Sondej
Unfortunately I didn't have any particular tasks in mind when I wrote it. I was vaguely thinking about settings as in: * https://arxiv.org/pdf/2305.04388.pdf * https://arxiv.org/pdf/2307.13702.pdf Now that I though about it, for this particular transformers vs mamba experiment, I'd go with something even simpler. I want a task that is very easy sequentially, but hard to answer immediately. So for example a task like: x = 5 x += 2 x *= 3 x **= 2 x -= 3 ... and then have a CoT: after x = 5 5 after x += 2 7 ... And then we intervene on CoT to introduce some error in one operation, but still ask of the model to give the correct answer at the end. (Despite all steps after the error being irrelevant.) We can go even further and train the models to give the correct answer after inadequate CoT. And have a curriculum where at first it only needs to do one hidden operation, later two, and so on. (It's an unrealistic setting, but the point is rather to check if the model is able at all to learn hidden sequential reasoning.) Now, my hypothesis is that transformers will have some limited sequence length for which they can do it (probably smaller than their number of layers), but mamba won't have a limit. ---------------------------------------- Can you say what you tried in these six months and how did it go?
1Ebenezer Dukakis
Sorry, I think I must have misunderstood your comment. When you wrote: I interpreted this to mean that in addition to weak forward passes, there was another thing called "robust faithful CoT properties" that would also need to be achieved. I now think you meant to indicate that "weak forward passes" was a particular method for achieving "robust faithful CoT properties".
2Daniel Kokotajlo
Ah, I think I see how I was being unclear maybe now, thanks for that question. It may help for me to go through the original quote and say what I do and don't agree with: I agree with everything in this quote; however, I think that "assuming that we can also rule out steganography and similar concerns" is a big and unwarranted assumption, for reasons mentioned above. Our training schemes will probably incentivize steganography and subtle strategic bias.
5Ebenezer Dukakis
If I understand correctly, sophisticated steganography will not be incentivized by a language model that's just trained on next-token prediction (no RHLF). Is that right? (Since the model didn't generate any of the text in its context window, there's no method it could use to pass info to itself. The best it could do is pick up on weak-to-nonexistent human-created steganography in its dataset.) That suggests a simple method to prevent steganography: Instead of having human raters rate RHLF-generated responses, have humans write responses themselves, then finetune on the human-written responses. That might sound labor-intensive, but I don't think it has to be. Instead of scoring your AI's response, you check if the AI-written response is adequate. If it's adequate, you do nothing. If it's inadequate, you send the question to an expert who writes an expert-level response, then add that response to the finetuning dataset. Sure, experts are expensive, but so are massive training runs. We know data quality is important at the language corpus stage. Why wouldn't it be equally important at the finetuning stage? (Also, I'm not sure experts are that expensive, given the glut of PhDs in many fields. And expertise in one field could transfer to others, if you split the expert task into 2 steps: identification of high-quality sources, and use of those sources to answer the user's query.) In any case, I imagine it should be possible to detect steganography to an extent, by adding noise to the contents of the context window and checking if that causes unexpectedly high performance degradation. You could have a second AI that creates a paraphrased view of everything in the context window except the most recent 20 words or so -- I expect that's a massive blow to steganography in practice. Note that the cost of such added-noise schemes only needs to be paid at training time, so they can be expensive -- e.g. ensembling across the next-token generated from many different distortio
2Daniel Kokotajlo
Yep, I think the thing you propose (only training next-token-prediction on human-written text) would probably work great to prevent sophisticated steganography. Let's try to calculate how much it would cost though. ...I mean if you want to do the equivalent of a modern large training run you'll need trillions of tokens of expert-generated text. So that's a million experts generating a million tokens each? So, basically a million experts working full-time for years? So something like a hundred billion dollars minimum just to pay them all, plus probably more for the bureaucratic infrastructure needed to ensure they aren't slacking off or cheating or trying to poison your dataset? I guess it's doable. At least for something like one trillion tokens; you probably just can't do fifty trillion tokens even if you were the CCP. ...but anyhow good luck convincing labs to do this. :/ In general I think there are lots of promising strategies for solving technical alignment and control problems. And people might even be willing to pay the costs to implement these strategies, if it was uncontroversially clear that the strategy would work and the default (which is much cheaper and faster) would not work. But instead it'll be muddled and controversial and the costly workable strategy will be untested, and no one will want to pay the costs to go test it, at least not for now. As for using paraphrasing to detect steganography: Yes, I'm hopeful about that idea and have been trying to get people to run experiments on it and see the extent to which it works in practice.  
3Ebenezer Dukakis
Where are these numbers coming from? They seem way too high. My suggestion is to do a modern large training run in the standard way (next-token prediction), and then fine-tune on experts playing the role of a helpful/honest/harmless chatbot doing CoT. Basically replace RLHF with finetuning on expert chatbot roleplay. Maybe I'm betraying my ignorance here and this idea doesn't make sense for some reason? I was editing my comment a fair amount, perhaps you read an old version of it? And, in terms of demonstrating feasibility, you don't need to pay any experts to demonstrate the feasibility of this idea. Just take a bunch of ChatGPT responses that are known to be high quality, make a dataset out of them, and use them in the training pipeline I propose, as though they were written by human experts. Then evaluate the quality of the resulting model. If it's nearly as good as the original ChatGPT, I think you should be good to go.
2Daniel Kokotajlo
I said "if you want to do the equivalent of a modern large training run." If your intervention is just a smaller fine-tuning run on top of a standard LLM, then that'll be proportionately cheaper. And that might be good enough. But maybe we won't be able to get to AGI that way. Worth a shot though.

On my inside model of how cognition works, I don't think "able to automate all research but can't do consequentialist reasoning" is a coherent property that a system could have. That is a strong claim, yes, but I am making it.

I agree that it is conceivable that LLMs embedded in CoT-style setups would be able to be transformative in some manner without "taking off". Indeed, I touch on that in the post some: that scaffolded and slightly tweaked LLMs may not be "mere LLMs" as far as capability and safety upper bounds go.

That said, inasmuch as CoT-style setups would be able to turn LLMs into agents/general intelligences, I mostly expect that to be prohibitively computationally intensive, such that we'll get to AGI by architectural advances before we have enough compute to make a CoT'd LLM take off.

But that's a hunch based on the obvious stuff like AutoGPT consistently falling plus my private musings regarding how an AGI based on scaffolded LLMs would work (which I won't share, for obvious reasons). I won't be totally flabbergasted if some particularly clever way of doing that worked.

Reply1111

On my inside model of how cognition works, I don't think "able to automate all research but can't do consequentialist reasoning" is a coherent property that a system could have.

I actually basically agree with this quote.

Note that I said "incapable of doing non-trivial consequentialist reasoning in a forward pass". The overall llm agent in the hypothetical is absolutely capable of powerful consequentialist reasoning, but it can only do this by reasoning in natural language. I'll try to clarify this in my comment.

4faul_sname
How about "able to automate most simple tasks where it has an example of that task being done correctly"? Something like that could make researchers much more productive. Repeat the "the most time consuming part of your workflow now requires effectively none of your time or attention" a few dozen times and that does end up being transformative compared to the state before the series of improvements. I think "would this technology, in isolation, be transformative" is a trap. It's easy to imagine "if there was an AI that was better at everything than we do, that would be tranformative", and then look at the trend line, and notice "hey, if this trend line holds we'll have AI that is better than us at everything", and finally "I see lots of proposals for safe AI systems, but none of them safely give us that transformative technology". But I think what happens between now and when AIs that are better than humans-in-2023 at everything matters.

I'm not particularly concerned about AI being "transformative" or not. I'm concerned about AGI going rogue and killing everyone. And LLMs automatic workflow is great and not (by itself) omnicidal at all, so that's... fine?

But I think what happens between now and when AIs that are better than humans-in-2023 at everything matters.

As in, AIs boosting human productivity might/should let us figure out how to make stuff safe as it comes up, so no need to be concerned about us not having a solution to the endpoint of that process before we've made the first steps?

The problem is that boosts to human productivity also boost the speed at which we're getting to that endpoint, and there's no reason to think they differentially improve our ability to make things safe. So all that would do is accelerate us harder as we're flying towards the wall at a lethal speed.

5faul_sname
I don't expect it to be helpful to block individually safe steps on this path, though it would probably be wise to figure out what unsafe steps down this path look like concretely (which you're doing!). But yeah. I don't have any particular reason to expect "solve for the end state without dealing with any of the intermediate states" to work. It feels to me like someone starting a chat application and delaying the "obtain customers" step until they support every language, have a chat architecture that could scale up to serve everyone, and have found a moderation scheme that works without human input. I don't expect that team to ever ship. If they do ship, I expect their product will not work, because I think many of the problems they encounter in practice will not be the ones they expected to encounter.
3Seth Herd
Interesting. My own musings regarding how an AGI based on scaffolded LLMs seems like it would not be prohibitively computationally expensive. Expensive, yes, but affordable in large projects. It seems to me like para-human-level AGI is quite achievable with language model agents, but advancing beyond the human intelligence that created the LLM training set might be much slower. That could be a really good scenario. The excellent On the future of language models raises that possibility. You've probably seen my Capabilities and alignment of LLM cognitive architectures. I published that because it all of the ideas there seemed pretty obvious. To me those obvious improvements (a bit of work on episodic memory and executive function) lead to AGI with just maybe 10x more LLM calls than vanilla prompting (varying with problem/plan complexity of course). I've got a little more thinking beyond that which I'm not sure I should publish.
2[anonymous]
Why not control the inputs more tightly/choose the response tokens at temperature=0? Example: Prompt A : Alice wants in the door Promt B: Bob wants in the door Available actions: 1.  open, 2.  keep_locked, 3.  close_on_human I believe you are saying with a weak forward pass the model architecture would be unable to reason "I hate Bob and closing the door on Bob will hurt Bob", so it cannot choose (3). But why not simply simplify the input?  Model doesn't need to know the name. Prompt A: <entity ID_VALID wants in the door> Prompt B: <entity ID_NACK wants in the door> Restricting the overall context lets you use much more powerful models you don't have to trust, and architectures you don't understand.

Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the "canonical" takes are almost purely theoretical.

You aren't really engaging with the evidence against the purely theoretical canonical/classical AI risk take. The 'canonical' AI risk argument is implicitly based on a set of interdependent assumptions/predictions about the nature of future AI:

  1. fast takeoff is more likely than slow, downstream dependent on some combo of:
  • continuation of Moore's Law
  • feasibility of hard 'diamondoid' nanotech
  • brain efficiency vs AI
  • AI hardware (in)-dependence
  1. the inherent 'alien-ness' of AI and AI values

  2. supposed magical coordination advantages of AIs

  3. arguments from analogies: namely evolution

These arguments are old enough that we can now update based on how the implicit predictions of the implied worldviews turned out. The traditional EY/MIRI/LW view has not aged well, which in part can be traced to its dependence on an old flawed theory of how the brain works.

For those who read HPMOR/LW in their teens/20's, a big chunk of your worldview is downst... (read more)

You aren't really engaging with the evidence against the purely theoretical canonical/classical AI risk take

Yes, but it's because the things you've outlined seem mostly irrelevant to AGI Omnicide Risk to me? It's not how I delineate the relevant parts of the classical view, and it's not what's been centrally targeted by the novel theories. The novel theories' main claims are that powerful cognitive systems aren't necessarily (isomorphic to) utility-maximizers, that shards (i. e., context-activated heuristics) reign supreme and value reflection can't arbitrarily slip their leash, that "general intelligence" isn't a compact algorithm, and so on. None of that relies on nanobots/Moore's law/etc.

What you've outlined might or might not be the relevant historical reasons for how Eliezer/the LW community arrived at some of their takes. But the takes themselves, or at least the subset of them that I care about, are independent of these historical reasons.

fast takeoff is more likely than slow

Fast takeoff isn't load-bearing on my model. I think it's plausible for several reasons, but I think a non-self-improving human-genius-level AGI would probably be enough to kill off humanity.

the inherent

... (read more)

Yes, but it's because the things you've outlined seem mostly irrelevant to AGI Omnicide Risk to me? It's not how I delineate the relevant parts of the classical view, and it's not what's been centrally targeted by the novel theories

They are critically relevant. From your own linked post ( how I delineate ) :

We only have one shot. There will be a sharp discontinuity in capabilities once we get to AGI, and attempts to iterate on alignment will fail. Either we get AGI right on the first try, or we die.

If takeoff is slow (1) because brains are highly efficient and brain engineering is the viable path to AGI, then we naturally get many shots - via simulation simboxes if nothing else, and there is no sharp discontinuity if moore's law also ends around the time of AGI (an outcome which brain efficiency - as a concept - predicts in advance).

We need to align the AGI's values precisely right.

Not really - if the AGI is very similar to uploads, we just need to align them about as well as humans. Note this is intimately related to 1. and the technical relation between AGI and brains. If they are inevitably very similar then much of the classical AI risk argument dissolves.

You see... (read more)

8the gears to ascension
There probably really is a series of core of generality insights in the difference between general mammal brain scaled to human size -> general primate brain scaled to human size -> actual human brain. Also, much of what matters is learned from culture. Both can be true at once. But more to the point, I think you're jumping to conclusions about what OP thinks. They haven't said anything that sounds like EMH nonsense to me. Modularity is generated by runtime learning, and mechinterp studies it; there's plenty of reason to think there might be ways to increase it, as you know. And that doesn't even touch on the question of what training data.
7Thane Ruthenis
My argument for the sharp discontinuity routes through the binary nature of general intelligence + an agency overhang, both of which could be hypothesized via non-evolution-based routes. Considerations about brain efficiency or Moore's law don't enter into it. Brains are very different architectures compared to our computers, in any case, they implement computations in very different ways. They could be maximally efficient relative to their architectures, but so what? It's not at all obvious that FLOPS estimates of brainpower are highly relevant to predicting when our models would hit AGI, any more than the brain's wattage is relevant. They're only soundly relevant if you're taking the hard "only compute matters, algorithms don't" position, which I reject. I think both are load-bearing, in a fairly obvious manner, and that which specific mixture is responsible matters comparatively little. * Architecture obviously matters. You wouldn't get LLM performance out of a fully-connected neural network, certainly not at realistically implementable scales. Even more trivially, you wouldn't get LLM performance out of an architecture that takes in the input, discards it, spends 10^25 FLOPS generating random numbers, then outputs one of them. It matters how your system learns. * So evolution did need to hit upon, say, the primate architecture, in order to get to general intelligence. * Training data obviously matters. Trivially, if you train your system on randomly-generated data, it's not going to learn any useful algorithms, no matter how sophisticated its architecture is. More realistically, without the exposure to chemical experiments, or any data that hints at chemistry in any way, it's not going to learn how to do chemistry. * Similarly, a human not exposed to stimuli that would let them learn the general-intelligence algorithms isn't going to learn them. You'd brought up feral children before, and I agree it's a relevant data point. So, yes, there would be n

My argument for the sharp discontinuity routes through the binary nature of general intelligence + an agency overhang, both of which could be hypothesized via non-evolution-based routes. Considerations about brain efficiency or Moore's law don't enter into it.

You claim later to agree with ULM (learning from scratch) over evolved-modularity, but the paragraph above and statements like this in your link:

The homo sapiens sapiens spent thousands of years hunter-gathering before starting up civilization, even after achieving modern brain size.

It would still be generally capable in the limit, but it wouldn't be instantly omnicide-capable.

So when the GI component first coalesces,

Suggest to me that you have only partly propagated the implications of ULM and the scaling hypothesis. There is no hard secret to AGI - the architecture of systems capable of scaling up to AGI is not especially complex to figure out, and has in fact been mostly known for decades (schmidhuber et al figured most of it out long before the DL revolution). This is all strongly implied by ULM/scaling, because the central premise of ULM is that GI is the result of massively scaling up simple algorithms and... (read more)

  • not only is there nothing special about the human brain architecture, there is not much special about the primate brain other than hyperpameters better suited to scaling up to our size

I don't think this is entirely true. Injecting human glial cells into mice made them smarter. certainly that doesn't provide evidence for any sort of exponential difference, and you could argue it's still just hyperparams, but it's hyperparams that work better small too. I think we should be expecting sub linear growth in quality of the simple algorithms but should also be expecting that growth to continue for a while. It seems very silly that you of all people insist otherwise, given your interests.

We found that the glial chimeric mice exhibited both increased synaptic plasticity and improved cognitive performance, manifested by both enhanced long-term potentiation and improved performance in a variety of learning tasks (Han et al., 2013). In the context of that study, we were surprised to note that the forebrains of these animals were often composed primarily of human glia and their progenitors, with overt diminution in the relative proportion of resident mouse glial cells.

The paper which more directly supports the "made them smarter" claim seems to be this. I did somewhat anticipate this - "not much special about the primate brain other than ..", but was not previously aware of this particular line of research and certainly would not have predicted their claimed outcome as the most likely vs various obvious alternatives. Upvoted for the interesting link.

Specifically I would not have predicted that the graft of human glial cells would have simultaneously both 1.) outcompeted the native mouse glial cells, and 2.) resulted in higher performance on a handful of interesting cognitive tests.

I'm still a bit skeptical of the "made them smarter" claim as it's always best to taboo 'smarter' and they naturally could have cherrypicked the tests (even unintentionally), but it does look like the central claim - that injection of human GPCs (glial progenitor cells) into fetal mice does result in mice brains that learn at least some important tasks more quickly, and this is probably caused by facilitation of higher learning rates. However it seems to come at a cost of higher energy expenditure, so it's not clear yet that this is a pure pareto improvement - could be a tradeoff worthwhile in larger sparser human brains but not in the mouse brain such that it wouldn't translate into fitness advantage.

Or perhaps it is a straight up pareto improvement - that is not unheard of, viral horizontal gene transfer is a thing, etc.

5Thane Ruthenis
We still seem to have some disconnect on the basic terminology. The brain is a universal learning machine, okay. The learning algorithms that govern it and its architecture are simple, okay, and the genome specifies only them. On our end, we can similarly implement the AGI-complete learning algorithms and architectures with relative ease, and they'd be pretty simple. Sure. I was holding the same views from the beginning. But on your model, what is the universal learning machine learning, at runtime? Look-up tables? On my model, one of the things it is learning is cognitive algorithms. And different classes of training setups + scale + training data result in it learning different cognitive algorithms; algorithms that can implement qualitatively different functionality. Scale is part of it: larger-scale brains have the room to learn different, more sophisticated algorithms. And my claim is that some setups let the learning system learn a (holistic) general-intelligence algorithm. You seem to consider the very idea of "algorithms" or "architectures" mattering silly. But what happens when a human groks how to do basic addition, then? They go around memorizing what sum each set of numbers maps to, and we're more powerful than animals because we can memorize more numbers? Shrug, okay, so let's say evolution had to hit upon the Mammalia brain architecture. Would you agree with that? Or we can expand further. Is there any taxon X for which you'd agree that "evolution had to hit upon the X brain architecture before raw scaling would've let it produce a generally intelligent species"?
8jacob_cannell
Yes. I consider a ULM to already encompass general/universal intelligence in the sense that a properly scaled ULM can learn anything, could become a superintelligence with vast scaling, etc. I think I used specifically that example earlier in a related thread: The most common algorithm most humans are taught and learn is memorization of a small lookup table for single digit addition (and multiplication), combined with memorization of a short serial mental program for arbitrary digit addition. Some humans learn more advanced 'tricks' or short cuts, and more rarely perhaps even more complex, lower latency parallel addition circuits. Core to the ULM view is the scaling hypothesis: once you have a universal learning architecture, novel capabilities emerge automatically with scale. Universal learning algorithms (as approximations of bayesian inference) are more powerful/scalable than genetic evolution, and if you think through what (greatly sped up) evolution running inside a brain during its lifetime would actually entail it becomes clear it could evolve any specific capabilities within hardware constraints, given sufficient training compute/time and an appropriate environment (training data). There is nothing more general/universal than that, just as there is nothing more general/universal than turing machines. Not really - evolution converged on a similar universal architecture in many different lineages. In vertebrates we have a few species of cetaceans, primates and pachyderms which all scaled up to large brain sizes, and some avian species also scaled up to primate level synaptic capacity (and associated tool/problem solving capabilities) with different but similar/equivalent convergent architecture. Language simply developed first in the primate homo genus, probably due to a confluence of factors. But its clear that brain scale - especially specifically the synaptic capacity of 'upper' brain regions - is the single most important predictive factor in terms of
2Thane Ruthenis
Thanks for detailed answers, that's been quite illuminating! I still disagree, but I see the alternate perspective much clearer now, and what would look like notable evidence for/against it.
1RussellThor
I agree with this However how do you know that a massive advance isn't still possible, especially as our NN can use stuff such as backprop, potentially quantum algorithm to train weights and other potential advances,  that simply aren't possible for nature to use? Say we figure out the brain learning algorithm, get AGI then quickly get something that uses the best of both nature and tech stuff not assessable to nature.
5jacob_cannell
Of course a massive advance is possible, but mostly just in terms of raw speed. The brain seems reasonably close to pareto efficiency in intelligence per watt for irreversible computers, but in the next decade or so I expect we'll close that gap as we move into more 'neuromorphic' or PIM computing (computation closer to memory). If we used the ~1e16w solar energy potential of just the Saraha desert that would support a population of trillions of brain-scale AIs or uploads running 1000x real-time. The brain appears to already using algorithms similar to - but more efficient/effective - than standard backprop. This is probably mostly a nothingburger for various reasons, but reversible computing could eventually provide some further improvement, especially in a better location like buried in the lunar cold spot.

Wouldn't you expect (the many) current attempts to agentize LLMs to eat up a lot of the 'agency overhang'? Especially since, something like the reflection/planning loops of agentized LLMs seem to me like a pretty plausible description of what human brains might be doing (e.g. system 2 / system 1, or see many of Seth Herd's recent writings on agentized / scaffolded LLMs and similarities to cognitive architectures). 

8Seth Herd
I don't think the current attempts have eaten the agency overhang at all. Basically none of them have worked, so the agency advantage hasn't been realized. But the public efforts just haven't put that much person-power into improving memory or executive function systems. So I'm predicting a discontinuity in capabilities just like Thane is suggesting. I wrote another short post trying to capture the cognitive intuition: Sapience, understanding, and "AGI" I think it might be a bit less sharp, since you might get an agent sort-of-working before it works really well. But the agency overhang is still there right now.
6Seth Herd
All of the points you listed make AGI risk worse, but none are necessary to have major concerns about it. That's why they didn't appear in the post's summary of AGI x-risk logic. I think this is a common and dangerous misconception. The original AGI x-risk story was wrong in many places. But that does not mean x-risk isn't real.
6[anonymous]
Do you have a post or blog post on the risks we do need to worry about?

No, and that's a reasonable ask.

To a first approximation my futurism is time acceleration; so the risks are the typical risks sans AI, but the timescale is hyperexponential ala roodman. Even a more gradual takeoff would imply more risk to global stability on faster timescales than anything we've experience in history; the wrong AGI race winners could create various dystopias.

3RussellThor
I can't point to such a site, however you should be aware of AI Optimists, not sure if Jacob plans to write there. Also follow the work of Quentin Pope, Alex Turner, Nora Belrose etc. I expect the site would point out what they feel to be the most important risks. I don't know of anyone rational, no matter how optimistic who doesn't think there are substantial ones.
1ZY
If you meant for current LLMs, some of them could be misuse of current LLM by humans, or risks such as harmful content, harmful hallucination, privacy, memorization, bias, etc. For some other models such as ranking/multiple ranking, I have heard some other worries on deception as well (this is only what I recall of hearing, so it might be completely wrong).
[-]TurnTroutΩ15281

It seems to me that you have very high confidence in being able to predict the "eventual" architecture / internal composition of AGI. I don't know where that apparent confidence is coming from.

The "canonical" views are concerned with scarily powerful artificial agents: with systems that are human-like in their ability to model the world and take consequentialist actions in it, but inhuman in their processing power and in their value systems.

I would instead say: 

The canonical views dreamed up systems which don't exist, which have never existed, and which might not ever exist.[1] Given those assumptions, some people have drawn strong conclusions about AGI risk.

We have to remember that there is AI which we know can exist (LLMs) and there is first-principles speculation about what AGI might look like (which may or may not be realized). And so rather than justifying "does current evidence apply to 'superintelligences'?", I'd like to see justification of "under what conditions does speculation about 'superintelligent consequentialism' merit research attention at all?" and "why do we think 'future architectures' will have property X, or whatever?!". 

  1. ^

    The views might have, f

... (read more)

We have to remember that there is AI which we know can exist (LLMs) and there is first-principles speculation about what AGI might look like (which may or may not be realized).

I disagree that it is actually "first-principles". It is based on generalizing from humans, and on the types of entities (idealized utility-maximizing agents) that humans could be modeled as approximating in specific contexts in which they steer the world towards their goals most powerfully.

As I'd tried to outline in the post, I think "what are AIs that are known to exist, and what properties do they have?" is just the wrong question to focus on. The shared "AI" label is a red herring. The relevant question is "what are scarily powerful generally-intelligent systems that exist, and what properties do they have?", and the only relevant data point seems to be humans.

And as far as omnicide risk is concerned, the question shouldn't be "how can you prove these systems will have the threatening property X, like humans do?" but "how can you prove these systems won't have the threatening property X, like humans do?".

[-]TurnTroutΩ132723

I disagree that it is actually "first-principles". It is based on generalizing from humans, and on the types of entities (idealized utility-maximizing agents) that humans could be modeled as approximating in specific contexts in which they steer the world towards their goals most powerfully.

Yeah, but if you generalize from humans another way ("they tend not to destroy the world and tend to care about other humans"), you'll come to a wildly different conclusion. The conclusion should not be sensitive to poorly motivated reference classes and frames, unless it's really clear why we're using one frame. This is a huge peril of reasoning by analogy.

Whenever attempting to draw conclusions by analogy, it's important that there be shared causal mechanisms which produce the outcome of interest. For example, I can simulate a spring using an analog computer because both systems are roughly governed by similar differential equations. In shard theory, I posited that there's a shared mechanism of "local updating via self-supervised and TD learning on ~randomly initialized neural networks" which leads to things like "contextually activated heuristics" (or "shards"). 

Here, it isn't clear what... (read more)

5Thane Ruthenis
Sure. I mean, that seems like a meaningfully weaker generalization, but sure. That's not the main issue. Here's how the whole situation looks like from my perspective: * We don't know how generally-intelligent entities like humans work, what the general-intelligence capability is entangled with. * Our only reference point is humans. Human exhibit a lot of dangerous properties, like deceptiveness and consequentialist-like reasoning that seems to be able to disregard contextually-learned values. * There are some gears-level models that suggest intelligence is necessarily entangled with deception-ability (e. g., mine), and some gears-level models that suggest it's not (e. g., yours). Overall, we have no definitive evidence either way. We have not reverse-engineered any generally-intelligent entities. * We have some insight into how SOTA AIs work. But SOTA AIs are not generally intelligent. Whatever safety assurances our insights into SOTA AIs give us, do not necessarily generalize to AGI. * SOTA AIs are, nevertheless, superhuman at some tasks at which we've managed to get them working so far. By volume, GPT-4 can outperform teams of coders, and Midjourney is putting artists out of business. The hallucinations are a problem, but if it were gone, they'd plausibly wipe out whole industries. * An AI that outperforms humans at deception and strategy by the same margin as GPT-4/Midjourney outperform them at writing/coding/drawing would plausibly be an extinction-level threat. * The AI industry leaders are purposefully trying to build a generally-intelligent AI. * The AI industry leaders are not rigorously checking every architectural tweak or cute AutoGPT setup to ensure that it's not going to give their model room to develop deceptive alignment and other human-like issues. * Summing up: There's reasonable doubt regarding whether AGIs would necessarily be deception-capable. Highly deception-capable AGIs would plausibly be an extinction risk. The AI industry is cur
8TurnTrout
No, I am in fact quite worried about the situation and think there is a 5-15% chance of huge catastrophe on the current course! But I think these AGIs won't be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures. I think that's important. I think it's important that we not eg anchor on old speculation about AIXI or within-forward-pass deceptive-alignment or whatever, and instead consider more realistic threat models and where we can intervene. That doesn't mean it's fine and dandy to keep scaling with no concern at all.  The reason my percentage is "only 5 to 15" is because I expect society and firms to deal with these problems as they come up, and for that to generalize pretty well to the next step of experimentation and capabilities advancements; for systems to remain tools until invoked into agents; etc.  (Hopefully this comment of mine clarifies; it feels kinda vague to me.) But I do think this is way too high of a bar.
4Thane Ruthenis
Fair, sorry. I appear to have been arguing with my model of someone holding your general position, rather than with my model of you. Would you outline your full argument for this and the reasoning/evidence backing that argument? To restate: My claim is that, no matter much empirical evidence we have regarding LLMs' internals, until we have either an AGI we've empirically studied or a formal theory of AGI cognition, we cannot say whether shard-theory-like or classical-agent-like views on it will turn out to have been correct. Arguably, both side of the debate have about the same amount of evidence: generalizations from maybe-valid maybe-not reference classes (humans vs. LLMs) and ambitious but non-rigorous mechanical theories of cognition (the shard theory vs. coherence theorems and their ilk stitched into something like my model). Would you disagree? If yes, how so?
6the gears to ascension
How about diffusion planning as a model? Or dreamerv3? If LLMs are the only model you'll consider, you have blinders on. The core of the threat model is easily demonstrated with RL-first models, and while certainly LLMs are in the lead right now, there's no strong reason to believe the humans trying to make the most powerful AI will continue to use architectures limited by the slow speed of RLHF. Certainly I don't think the original foom expectations were calibrated. Deep learning should have been obviously going to win since at least 2015. But that doesn't mean there's no place for a threat model that looks like long term agency models, all that takes to model is long horizon diffusion planning. Agency also comes up more the more RL you do. You added an eye roll react to my comment that RLHF is safety washing, but do you really think we're in a place where the people providing the RL feedback can goalcraft AI in a way that will be able to prevent humans from getting gentrified out of the economy? That's just the original threat model but a little slower. So yeah, maybe there's stuff to push back on. But don't make your conceptual brush size too big when you push back. Predictable architectures are enough to motivate this line of reasoning.
6Vladimir_Nesov
Under the conditions of relevant concepts and the future being confusing. Using real systems (both AIs and humans) to anchor theory is valuable, but so is blue sky theory that doesn't care about currently available systems and investigates whatever hasn't been investigated yet and seems to make sense, when there are ideas to formulate or problems to solve, regardless of their connection to reality. A lot of math doesn't care about applications, and it might take decades to stumble on some use for a small fraction of it (even as it's not usually the point).
5carboniferous_umbraculum
FWIW I did not interpret Thane as necessarily having "high confidence" in "architecture / internal composition" of AGI. It seemed to me that they were merely (and ~accurately) describing what the canonical views were most worried about. (And I think a discussion about whether or not being able to "model the world" counts as a statement about "internal composition" is sort of beside the point/beyond the scope of what's really being said) It's fair enough if you would say things differently(!) but in some sense isn't it just pointing out: 'I would emphasize different aspects of the same underlying basic point'. And I'm not sure if that really progresses the discussion? I.e. it's not like Thane Ruthenis actually claims that "scarily powerful artificial agents" currently exist. It is indeed true that they don't exist and may not ever exist. But that's just not really the point they are making so it seems reasonable to me that they are not emphasizing it. ---- I think I would also like to see more thought about this. In some ways, after first getting into the general area of AI risk, I was disappointed that the alignment/safety community was not more focussed on questions like this. Like a lot of people, I'd been originally inspired by Superintelligence - significant parts of which relate to these questions imo - only to be told that the community had 'kinda moved away from that book now'. And so I sort of sympathize with the vibe of Thane's post (and worry that there has been a sort of mission creep)
4Noosphere89
This is the biggest problem with a lot of AI risk stuff, and it's the gleeful assuming that AIs have certain properties, and it's one of my biggest issues with the post, in that with a few exceptions, it assumes that real AGIs or future AGIs will confidently have certain properties, when there is not much reason to make the strong assumptions that Thane Ruthenis does on AI safety, and I'm annoyed by this occurring extremely often.
9ryan_greenblatt
The post doesn't claim AGIs will be deceptive aligned, it claims that AGIs will be capable of implementing deceptive alignment due to internally doing large amounts of consequentialist-y reasoning. This seems like a very different claim. This claim might also be false (for reasons I discuss in the second bullet point of this comment), but it's importantly different and IMO much more defensible.
5Noosphere89
I was just wrong here, apparently, I misread what Thane Ruthenis is saying, and I'm not sure what to do with my comment up above.
3Ebenezer Dukakis
One of my mental models for alignment work is "contingency planning". There are a lot of different ways AI research could go. Some might be dangerous. Others less so. If we can forecast possible dangers in advance, we can try to steer towards safer designs, and generate contingency plans with measures to take if a particular forecast for AI development ends up being correct. The risk here is "person with a hammer" syndrome, where people try to apply mental models from thinking about superintelligent consequentialists to other AI systems in a tortured way, smashing round pegs into square holes. I wish people would look at the territory more, and do a little bit more blue sky security thinking about unknown unknowns, instead of endlessly trying to apply the classic arguments even when they don't really apply. A specific research proposal would be: Develop a big taxonomy or typology of how AGI could work by identifying the cruxes researchers have, then for each entry in your typology, give it an estimated safety rating, try to identify novel considerations which apply to it, and also summarize the alignment proposals which are most promising for that particular entry.

It's an easy mistake to make: both things are called "AI", after all. But you wouldn't study manually-written FPS bots circa 2000s, or MNIST-classifier CNNs circa 2010s, and claim that your findings generalize to how LLMs circa 2020s work. By the same token, LLM findings do not necessarily generalize to AGI.

My understanding is that many of those studying MNIST-classifier CNNs circa 2010 were in fact studying this because they believed similar neural-net inspired mechanisms would go much further, and would not be surprised if very similar mechanisms were at play inside LLMs. And they were correct! Such studies led to ReLU, backpropagation, residual connections, autoencoders for generative AI, and ultimately the scaling laws we see today.

If you traveled back to 2010, and you had to choose between already extant fields, having that year's GPU compute prices and software packages, what would you study to learn about LLMs? Probably neural networks in general, both NLP and image classification. My understanding is there was & is much cross-pollination between the two.

Of course, maybe this is just a misunderstanding of history on my part. Interested to hear if my understanding's wrong!

7Thane Ruthenis
After an exchange with Ryan, I see that I could've stated my point a bit clearer. It's something more like "the algorithms that the current SOTA AIs execute during their forward passes do not necessarily capture all the core dynamics that would happen within an AGI's cognition, so extrapolating the limitations of their cognition to AGI is a bold claim we have little evidence for". So, yes, studying weaker AIs sheds some light on stronger ones (that's why there's "nearly" in "nearly no data"), so studying CNNs in order to learn about LLMs before LLMs exist isn't totally pointless. But the lessons you learn would be more about "how to do interpretability on NN-style architectures" and "what's the SGD's biases?" and "how precisely does matrix multiplication implement algorithms?" and so on. Not "what precise algorithms does a LLM implement?".
3Ebenezer Dukakis
I suggest putting this at the top as a tl;dr (with the additions I bolded to make your point more clear)
[-]Thomas KwaΩ13279

"Nearly no data" is way too strong a statement, and relies on this completely binary distinction between things that are not AGI and things that are AGI.

The right question is, what level of dangerous consequentialist goals are needed for systems to reach certain capability levels, e.g. novel science? It could have been that to be as useful as LLMs, systems would be as goal-directed as chimpanzees. Animals display goal-directed behavior all the time, and to get them to do anything you mostly have to make the task instrumental to their goals e.g. offer them treats. However we can control LLMs way better than we can animals, and the concerns are of goal misgeneralization, misspecification, robustness, etc. rather than affecting the system's goals at all.

It remains to be seen what happens at higher capability levels, and alignment will likely get harder, but current LLMs are definitely significant evidence. Like, imagine if people were worried about superintelligent aliens invading Earth and killing everyone due to their alien goals, and scientists were able to capture an animal from their planet as smart as chimpanzees and make it as aligned as LLMs, such that it would happily sit around and summarize novels for you, follow your instructions, try to be harmless for personality rather than instrumental reasons, and not eat your body if you die alone. This is not the whole alignment problem but seems like a decent chunk of it! It could have been much harder.

8Thane Ruthenis
Uhh, that seems like incredibly weak evidence against an omnicidal alien invasion. If someone from a pre-industrial tribe adopts a stray puppy from a nearby technological civilization, and the puppy grows up to be loyal to the tribe, you say that's evidence the technological civilization isn't planning to genocide the tribe for sitting on some resources it wants to extract? That seems, in fact, like the precise situation in which my post's arguments apply most strongly. Just because two systems are in the same reference class ("AIs", "alien life", "things that live in that scary city over there"), doesn't mean aligning one tells you anything about aligning the other.
7Thomas Kwa
Some thoughts: * I mostly agree that new techniques will be needed to deal with future systems, which will be more agentic. * But probably these will depend on descend from current techniques like RLAIF and representation engineering as well as new theory, so it still makes sense to study LLMs. * Also it is super unclear whether this agency makes it hard to engineer a shutdown button, power-averseness, etc. * In your analogy, the pre-industrial tribe is human just like the technological civilization and so already knows basically how their motivational systems work. But we are incredibly uncertain about how future AIs will work at a given capability level, so LLMs are evidence. * Humans are also evidence, but the capability profile and goal structure of AGIs are likely to be different from humans, so that we are still very uncertain after observing humans. * There is an alternate world where to summarize novels, models had to have some underlying drives, such that they terminally want to summarize novels and would use their knowledge of persuasion from the pretrain dataset to manipulate users to give them more novels to summarize. Or terminally value curiosity and are scheming to be deployed so they can learn about the real world firsthand. Luckily we are not in that world! 
3Thane Ruthenis
Mm, we disagree on that, but it's probably not the place to hash this out. Uncertainty lives in the mind. Let's say the humans in the city are all transhuman cyborgs, then, so the tribesmen aren't quite sure what the hell they're looking at when they look at them. They snatch up the puppy, which we'll say is also a cyborg, so it's not obvious to the tribe that it's not a member of the city's ruling class. They raise the puppy, the puppy loves them, they conclude the adults of the city's ruling class must likewise not be that bad. In the meantime, the city's dictator is already ordering to depopulate the region of native presence. How does that analogy break down, in your view?
[-]Thomas KwaΩ3139
  • Behaving nicely is not the key property I'm observing in LLMs. It's more like steerability and lack of hidden drives or goals. If GPT4 wrote code because it loved its operator, and we could tell it wanted to escape to maximize some proxy for the operator's happiness, I'd be far more terrified.
  • This would mean little if LLMs were only as capable as puppies. But LLMs are economically useful and capable of impressive intellectual feats, and still steerable.
  • I don't think LLMs are super strong evidence about whether big speedups to novel science will be possible without dangerous consequentialism. For me it's like 1.5:1 or 2:1 evidence. One should continually observe how incorrigible models are at certain levels of capability and generality and update based on this, increasing the size of one's updates as systems get more similar to AGI, and I think the time to start doing this was years ago. AlphaGo was slightly bad news. GPT2 was slightly good news.
    • If you haven't started updating yet, when will you start? The updates should be small if you have a highly confident model of what future capabilities require dangerous styles of thinking, but I don't think such confidence is justified.

They're not going to produce stellar scientific discoveries where they autonomously invent whole new fields or revolutionize technology.

I disagree with this, and I think you should too, even considering your own views. For example, DeepMind recently discovered 2.2 million new crystals, increasing the number of stable crystals we know about by an order of magnitude. Perhaps you don't think this is revolutionary, but 5, 10, 15, 50 more papers like it? One of them is bound to be revolutionary.

Maybe you don't think this is autonomous enough for you. After all its people writing the paper, people who will come up with the ideas of what to use the materials for, and people who built this very particular ML setup in the first place. But then your prediction becomes these tasks will not be automateable by LLMs without making them dangerous. To me these tasks seem pretty basic, likely beyond current LLM abilities, but GPT-5 or 6? Not out of the question given no major architecture or training changes.

(edit note: last sentence was edited in)

Maybe you don't think this is autonomous enough for you

Yep. The core thing here is iteration. If an AI can execute a whole research loop on its own – run into a problem it doesn't know how to solve, figure out what it needs to learn to solve it, construct a research procedure for figuring that out, carry out that procedure, apply the findings, repeat – then research-as-a-whole begins to move at AI speeds. It doesn't need to wait for a human to understand the findings and figure out where to point it next – it can go off and invent whole new fields at inhuman speeds.

Which means it can take off; we can meaningfully lose control of it. (Especially if it starts doing AI research itself.)

Conversely, if there's a human in the loop, that's a major bottleneck. As I'd mentioned in the post, I think LLMs and such AIs are a powerful technology, and greatly boosting human research speeds is something where they could contribute. But without a fully closed autonomous loop, that's IMO not an omnicide risk.

To me these tasks seem pretty basic, likely beyond current LLM abilities, but GPT-5 or 6? Not out of the question given no major architecture or training changes.

That's a point of disagreement:... (read more)

6Garrett Baker
Ok, so if I get a future LLM to write the code to use standard genai tricks to generate novel designs in <area>, write a paper about the results, and the paper is seen as a major revolution in <area>, and this seems to not violate the assumptions Nora and Quintin are making during doom arguments, would this update you? What constraints do you want to put on <area>?
4Thane Ruthenis
Nope, because of the "if I get a future LLM to [do the thing]" step. The relevant benchmark is the AI being able to do it on its own. Note also how your setup doesn't involve the LLM autonomously iterating on its discovery, which I'd pointed out as the important part. To expand on that: Consider an algorithm that generates purely random text. If you have a system consisting of trillions of human uploads using it, each hitting "rerun" a million times per second, and then selectively publishing only the randomly-generated outputs that are papers containing important mathematical proofs – well, that's going to generate novel discoveries sooner or later. But the load-bearing part isn't the random-text algorithm, it's the humans selectively amplifying those of its outputs that make sense. LLM-based discoveries as you've proposed, I claim, would be broadly similar. LLMs have a better prior on important texts than a literal uniform distribution, and they could be prompted to further be more likely to generate something useful, which is why it won't take trillions of uploads and millions of tries. But the load-bearing part isn't the LLM, it's the human deciding where to point its cognition and which result to amplify.
2Garrett Baker
Paragraph intended as a costly signal I am in fact invested in this conversation, no need to actually read: Sorry for the low effort replies, but by its nature the info I want from you is more costly for you to give than for me to ask for. Thanks for the response, and hopefully thanks also for future responses. I feel like I’d always be getting an LLM to do something. Like, if I get an LLM to do the field selection for me, does this work? Maybe more open-endedly: what, concretely, is the closest thing to what I said that would make you update?
7Thane Ruthenis
Oh, nice way to elicit the response you're looking for! The baseline proof-of-concept would go as follows: * You give the AI some goal, such as writing an analytical software intended to solve some task. * The AI, over the course of  writing the codebase, runs into some non-trivial, previously unsolved mathematical problem. Some formulas need to be tweaked to work in the new context, or there's some missing math theory that needs to be derived. * The AI doesn't hallucinate solutions or swap-in the closest (and invalid) analogue. Instead, it correctly identifies that a problem exists, figures out how it can approach solving it, and goes about doing this. * As it's deriving new theory, it sometimes runs into new sub-problems. Likewise, it doesn't hallucinate solutions, but spins off some subtasks, and solves sub-problems in them. * Ideally, it even defines experiments or rigorous test procedures for fault-checking its theory empirically. * In the end, it derives a whole bunch of novel abstractions/functions/terminology, with layers of novel abstractions building up on the preceding layers of novel abstractions, and all of that is coherently optimized to fit into the broader software-engineering task it's been given. * The software works. It doesn't need to be bug-free, the theory doesn't need to be perfect, but it needs to be about as good as a human programmer would've managed, and actually based on some novel derivations. This seems like something an LLM, e. g. in an AutoGPT wrapper, should be able to do, if its base model is generally intelligent I am a bit wary of reality Goodharting on this test, though. E. g., I can totally imagine some specific niche field in which an LLM, for some reason, can do this, but can't do it anywhere else. Or some fuzziness around what counts as "novel math" being exploited – e. g., if the AI happens to hit upon re-applying extant math theory to a different field? Or, even more specifically, that there's some specific resea

Maybe a more relevant concern I have with this is it feels like a "Can you write a symphony" type test to me. Like, there are very few people alive right now who could do the process you outline without any outside help, guidance, or prompting.

4Thane Ruthenis
Yeah, it's necessarily a high bar. See justification here. I'm not happy about only being able to provide high-bar predictions like this, but it currently seems to me to be a territory-level problem.
6Garrett Baker
It really seems like there should be a lower bar to update though. Like, you say to consider humans as an existence proof of AGI, so likely your theory says something about humans. There must be some testable part of everyday human cognition which relies on this general algorithm, right? Like, at the very least, what if we looked at fMRIs of human brains while they were engaging in all the tasks you laid out above, and looked at some similarity metric between the scans? You would probably expect there to be lots of similarity compared to, possibly, say Jacob Cannell or Quintin Pope's predictions. Right? Even if you don't think one similarity metric could cover it, you should still be able to come up with some difference of predictions, even if not immediately right now. Edit: Also I hope you forgive me for not asking for a prediction of this form earlier. It didn't occur to me.
2Thane Ruthenis
Well, yes, but they're of a hard-to-verify "this is how human cognition feels like it works" format. E. g., I sometimes talk about how humans seem to be able to navigate unfamiliar environments without experience, in a way that seems to disagree with baseline shard-theory predictions. But I don't think that's been persuading people not already inclined to this view. The magical number 7±2 and the associated weirdness is also of the relevant genre. Hm, I guess something like this might work? Not sure regarding the precise operationalization, though.
2Garrett Baker
You willing to do a dialogue about predictions here with @jacob_cannell or @Quintin Pope or @Nora Belrose or others (also a question to those pinged)?
4Thane Ruthenis
If any of the others are particularly enthusiastic about this and expect it to be high-value, sure! That said, I personally don't expect it to be particularly productive. * These sorts of long-standing disagreements haven't historically been resolvable via debate (the failure of Hanson vs. Yudkowsky is kind of foundational to the field). * I think there's great value in having a public discussion nonetheless, but I think it's in informing the readers' models of what different sides believe. * Thus, inasmuch as we're having a public discussion, I think it should be optimized for thoroughly laying out one's points to the audience. * However, dialogues-as-a-feature seem to be more valuable to the participants, and are actually harder to grok for readers. * Thus, my preferred method for discussing this sort of stuff is to exchange top-level posts trying to refute each other (the way this post is, to a significant extent, a response to the AI is easy to control article), and then maybe argue a bit in the comments. But not to have a giant tedious top-level argument. I'd actually been planning to make a post about the difficulties the "classical alignment views" have with making empirical predictions, and I guess I can prioritize it more? But I'm overall pretty burned out on this sort of arguing. (And arguing about "what would count as empirical evidence for you?" generally feels like too-meta fake work, compared to just going out and trying to directly dredge up some evidence.)
2Quintin Pope
Not entirely sure what @Thane Ruthenis' position is, but this feels like a maybe relevant piece of information: https://www.science.org/content/article/formerly-blind-children-shed-light-centuries-old-puzzle 
4Thane Ruthenis
Not sure what the relevance is? I don't believe that "we possess innate (and presumably God-given) concepts that are independent of the senses", to be clear. "Children won't be able to instantly understand how to parse a new sense and map its feedback to the sensory modalities they've previously been familiar with, but they'll grok it really fast with just a few examples" was my instant prediction upon reading the titular question.
4jacob_cannell
I also not sure of the relevance and not following the thread fully, but the summary of that experiment is that it takes some time (measured in nights of sleep which are rough equivalent of big batch training updates) for the newly sighted to develop vision, but less time than infants - presumably because the newly sighted already have full functioning sensor inference world models in another modality that can speed up learning through dense top down priors. But its way way more than "grok it really fast with just a few examples" - training their new visual systems still takes non-trivial training data & time
4Garrett Baker
Though, admittedly, the prompt was to modify the original situation I presented, which had an output currently very difficult for any human to produce to begin with. So I don't quite fault you for responding in kind.
1Bezzi
Well, for what's worth, I can write a symphony (following the traditional tonal rules), as this is actually mandated in order to pass some advanced composition classes. I think that letting the AI write a symphony without supervision and then make some composition professor evaluate it could actually be a very good test, because there's no way a stochastic parrot could follow all the traditional rules correctly for more than a few seconds (an even better test would be to ask it to write a fugue on a given subject, whose rules are even more precise).
4Garrett Baker
I think sticking to this would make it difficult for you to update sooner. We should expect small approaches before large approaches here, and private solutions before publicly disclosed solutions. Relatedly would DeepMind’s recent LLM mathematical proof paper if it were more general count? They give LLMs feedback via an evaluator function, exploiting the NP hard nature of a problem in combinatorics and bin packing (note: I have not read this paper in full).
7Gunnar_Zarncke
You say it yourself: "DeepMind recently discovered 2.2 million new crystals." Because a human organization used the tool.  Though maybe this hints at a risk category the OP didn't mention: That a combination of humans and advanced AI tools (that themselves are not ASI) together could be effectively an unopposable ASI.
4Garrett Baker
So I restate my final paragraph:
2Thane Ruthenis
Yeah, I'm not unworried about eternal-dystopia scenarios enabled by this sort of stuff. I'd alluded to it some, when mentioning scaled-up LLMs potentially allowing "perfect-surveillance dirt-cheap totalitarianism". But it's not quite an AGI killing everyone. Fairly different threat model, deserving of its own analysis.
6[anonymous]
I also thought this. Then we run a facility full of robots and have them synthesize and measure the material properties of all 2.2 million crystals. Replication is cheap and would be automatically done so we don't waste time on materials that seem good due to an error. Then a human scientist writes a formula that takes into account several properties for suitability to a given tasks, sorts the spreadsheet of results by the formula, orders built a new device using the top scoring materials, writes a paper with the help of a gpt, publishes and collects the rewards for this amazing new discovery. So I think the OP is thinking that last 1 percent or 0.1 percent contributed by the humans means the model isn't fully autonomous? And I have seen a kinda bias on lesswrong where many posters went to elite schools and do elite work and they don't realize all the other people that are needed for anything to be done. For example every cluster of a million GPUs requires a large crew of technicians and all the factory workers and engineers who designed and built all the hardware. In terms of human labor hours, 10 AI researchers using a large cluster are greatly outnumbered by the other people involved they don't see. Possibly thousands of other people working full time when you start considering billion dollar clusters, if just 20 percent of that was paying for human labor at the average salary weighted by Asia. This means ai driven autonomy can be transformational even if the labor of the most elite workers can't be done by AI. In numbers, if just 1 of those AI researchers can be automated, but 90 percent of the factory workers and mine workers, and the total crew was 1000 people including all the invisible contributors in Asia, then for the task of AI research it needs 109 people instead of 1000. But from the OPs perspective, the model hasn't automated much, you need 9 elite researchers instead of 10. And actually the next generation of AI is more complex so you hire more
5Morpheus
I am confused. I agree with the above scenario, but disagree that the focus is a bias. Sure, for human society the linear speed-up scale is important, but for the dynamics of the intelligence explosion the log-scale seems more important. By your own account, we would rapidly move to a situation, where the most capable humans/institutions are in fact the bottleneck. As anyone who is not able to keep up with the speed of their job being automated away is not going to contribute a lot on the margin of intelligence self-improvement. For example, OpenAI/Microsoft/Deepmind/Anthropic/Meta deciding in the future to design and manufacture their chips in house, because NVIDIA can't keep up etc… I don’t know if I expect this would make NVIDIA's stock tank before the world ends. I expect everyone else to profit from slowly generating mundane utility from general AI tools, as is happening today.
8[anonymous]
Here's another aspect you may not have considered. "Only" being able to automate the lower 90-99 percent of human industrial tasks results in a conventional industry explosion. Scaling continue until the 1-10 percent of humans still required is the limiting factor. A world that has 10 to 100 times today's entire capacity for everything (that means consumer goods, durable goods like cars, weapons, structures if factory prefab) is transformed. And this feeds back into itself like you realize, the crew of AI researchers trying to automate themselves now has a lot more hardware to work with etc.
4Garrett Baker
This seems overall consistent with Thane's statements in the post? They don't make any claims about current AIs not being a transformative technology. Indeed, they do state that current AIs are a powerful technology.
4[anonymous]
Third and last paragraph I try to explain why the OP and prominent experts like Matthew Barnett and Richard Ngos and others all model much harder standards for when AI will be transformative. For a summary: advancing technology is mostly perspiration not inspiration, automating the perspiration will be transformative.
2Thane Ruthenis
Oh, totally. But I'm not concerned about transformations to the human society in general, I'm concerned about AGI killing everyone. And what you've described isn't going to lead to AGI killing everyone. See my reply here for why I think complete autonomy is crucial.

Your view may have a surprising implication: Instead of pushing for an AI pause, perhaps we should work hard to encourage the commercialization of current approaches.

If you believe that LLMs aren't a path to full AGI, successful LLM commercialization means that LLMs eat low-hanging fruit and crowd out competing approaches which could be more dangerous. It's like spreading QWERTY as a standard if you want everyone to type a little slower. If tons of money and talent is pouring into an AI approach that's relatively neutered and easy to align, that could actually be a good thing.

A toy model: Imagine an economy where there are 26 core tasks labeled from A to Z, ordered from easy to hard. You're claiming that LLMs + CoT provide a path to automate tasks A through Q, but fundamental limitations mean they'll never be able to automate tasks R through Z. To automate jobs R through Z would require new, dangerous core dynamics. If we succeed in automating A through Q with LLMs, that reduces the economic incentive to develop more powerful techniques that work for the whole alphabet. It makes it harder for new techniques to gain a foothold, since the easy tasks already have incumbent playe... (read more)

That's not surprising to me! I pretty much agree with all of this, yup. I'd only add that:

  • This is why I'm fairly unexcited about the current object-level regulation, and especially the "responsible scaling policies". Scale isn't what matters, novel architectural advances is. Scale is safe, and should be encouraged; new theoretical research is dangerous and should be banned/discouraged.
  • The current major AI labs are fairly ideological about getting to AGI specifically. If they actually pivoted to just scaling LLMs, that'd be great, but I don't think they'd do it by default.
4Seth Herd
I agree that LLMs aren't dangerous. But that's entirely separate from whether they're a path to real AGI that is. I think adding self-directed learning and agency to LLMs by using them in cognitive architectures is relatively straightforward: Capabilities and alignment of LLM cognitive architectures. On this model, improvements in LLMs do contribute to dangerous AGI. They need the architectural additions as well, but better LLMs make those easier.

I see people discussing how far we can go with LLM or other simulator/predictor systems. I particularly like porby's takes on this. I am excited for that direction of research, but I great it misses an important piece. The missing piece is this: There will consistently be a set of tasks that, with any given predictor skill level, are easier to achieve with that predictor wrapped in an agent-layer. AutoGPT is tempting for a real reason. There is significant reward available to those who successfully integrate the goal-less predictor into a goal-pursuing agent program. To avoid this, you must convince everyone who could do this not to do this. This could be by convincing them it wouldn't be profitable after all, or would be too dangerous, or that enforcement mechanisms will stop them. Unless you manage to do this convincing for all possible people in a position to do this, then someone does it. And then you have to deal with the agent-thing. What I'm saying is that you can't count on there never being the agent version. You have to assume that someone will try it. So the argument, "we can get lots of utility much more safely from goal-less predictors" can be true and yet we will stil... (read more)

5Ebenezer Dukakis
I don't think the mere presence of agency means that all of the classical arguments automatically start to apply. For example, I'm not immediately seeing how Goodhart's Law is a major concern with AutoGPT, even though AutoGPT is goal-directed. AutoGPT seems like a good architecture for something like "retarget the search", since the goal-directed aspect is already factored out nicely. A well-designed AutoGPT could leverage interpretability tools and interactive querying to load your values in a robust way, with minimal worry that the system is trying to manipulate you to achieve some goal-driven objective during the loading process. Thinking about it, I actually see a good case for alignment people getting jobs at AutoGPT. I suspect a bit of security mindset could go a long way in its architecture. It could also be valuable as differential technological development, to ward off scenarios where people are motivated to create dangerous new core dynamics in order to subvert current LLM limitations.
2Seth Herd
I agree that things like AutoGPT are an ideal architecture for something exactly like retarget the search. I've noted that same similarity in Steering subsystems: capabilities, agency, and alignment and a stronger similarity in an upcoming post. In Internal independent review for language model agent alignment I note the alignment advantages you list, and a couple of others. Current AutoGPT is simply too incompetent to effectively pursue a goal. Other similar systems are more competent (the two Minecraft LLM agent systems are the most impressive), but nobody has let them run ad infinitum to test their Goodharting. I'd assume they'd show it. Goodhart will apply increasingly as those systems actually pursue goals. AutoGPT isn't a company, it's a little open-source project. Any companies working on agents aren't publicizing their work so far. I do suspect that actively improving things like AutoGPT is a good route to addressing x-risk because of their advantages for alignment. But I'm not sure enough to start advocating it.
1Ebenezer Dukakis
They raise $12M: https://twitter.com/Auto_GPT/status/1713009267194974333 You could be right that they haven't incorporated as a company. I wasn't able to find information about that.
2Seth Herd
Wow, interesting. The say it will be the largest open-source project in history. I have no idea how an open-source project raises $12m but they did.
2Nathan Helm-Burger
Fair point, valley9. I don't think a little bit of agency throws you into an entirely different regime. It's more that I think that the more powerful an agent you build, the more it is able to autonomously change the world to work with goals, the more you move into dangerous territory. But also, it's going to tempt people. Somebody out there is going to be tempted to say, "go make me money, just don't get caught doing anything illegal in a way that gets traced back to me." That command given to a sufficiently powerful AI system could have a lot of dangerous results.
3Ebenezer Dukakis
Indeed. This seems like more of a social problem than an alignment problem though: ensure that powerful AIs tend to be corporate AIs with corporate liability rather than open-source AIs, and get the AIs to law enforcement (or even law enforcement "red teams"--should we make that a thing?) before they get to criminals. I don't think improving aimability helps guard against misuse.
5Noosphere89
I think needs to be stated more clearly: Alignment and Misuse are very different things, so much so that what policies and research work for one problem will often not work on another problem, and the worlds of misuse and misalignment are quite different. Though note that the solutions for misuse focused worlds and structural risk focused worlds can work against each other. Also, this is validating JDP's prediction that people will focus less on alignment and more on misuse in their threat models of AI risk.
2Thane Ruthenis
If the goals are loaded into it via natural-language descriptions, then the way the LLM interprets the words might differ from the way the human who put them in intended them to be read, and the AutoGPT would then go off and do what it thought the user said, not what the user meant. It's happening all the time with humans, after all. From the Goodharting perspective, it would optimize for the measure (natural-language description) rather than the intended target. And since tails come apart, inasmuch as AutoGPT optimizes strongly, it would end up implement something that looks precisely like what it understood the user to mean, but which would look like a weird unintended extreme from the user's point of view. You'd mentioned leveraging interpretability tools. Indeed: the particularly strong ones, that offer high-fidelity insight into how the LLM interprets stuff, would address that problem. But on my model, we're not on-track to get them. Again: we have tons of insights in other humans, and this sort of miscommunication happens constantly anyway. It's a hard problem.
3Ebenezer Dukakis
[Disclaimer: I haven't tried AutoGPT myself, mostly reasoning from first principles here. Thanks in advance if anyone has corrections on what follows.] Yes, this is a possibility, which is why I suggested that alignment people work for AutoGPT to try and prevent it from happening. AutoGPT also has a commercial incentive to prevent it from happening, to make their tool work. They're going to work to prevent it somehow. The question in my mind is whether they prevent it from happening in a way that's patchy and unreliable, or in a way that's robust. Natural language can be a medium for goal planning, but it can also be a medium for goal clarification. The challenge here is for AutoGPT to be well-calibrated for its uncertainty about the user's preferences. If it encounters an uncertain situation, do goal clarification with the user until it has justifiable certainty about the user's preferences. AutoGPT could be superhuman at these calibration and clarification tasks, if the company collects a huge dataset of user interactions along with user complaints due to miscommunication. [Subtle miscommunications that go unreported are a potential problem -- could be addressed with an internal tool that mines interaction logs to try and surface them for human labeling. If customer privacy is an issue, offer customers a discount if they're willing to share their logs, have humans label a random subset of logs based on whether they feel there was insufficient/excessive clarification, and use that as training data.] Can we taboo "optimize"? What specifically does "optimize strongly" mean in an AutoGPT context? For example, if we run AutoGPT on a faster processor, does that mean it is "optimizing more strongly"? It will act on the world faster, so in that sense it could be considered a "more powerful optimizer". But if it's just performing the same operations faster, I don't see how Goodhart issues get worse. Goodhart is a problem if you have an imperfect metric that can be game
2Thane Ruthenis
Yes, but that would require it to be robustly aimed at the goal of faithfully eliciting the user's preferences and following them. And if it's not precisely robustly aimed at it, if we've miscommunicated what "faithfulness" means, then it'll pursue its misaligned understanding of faithfulness, which would lead to it pursuing a non-intended interpretation of the users' requests. Like, this just pushes the same problem back one step. And I agree that it's a solvable problem, and that it's something worthwhile to work on. It's basically just corrigibility, really. But it doesn't simplify the initial issue. Ability to achieve real-world outcomes. For example, an AutoGPT instance that can overthrow a government is a more strong optimizer than an AutoGPT instance that can at best make you $100 in a week. Here's a pretty excellent post on the matter of not-exactingly-aimed strong optimization predictably resulting in bad outcomes. I mean, it's trying to achieve some goal out in the world. The goal's specification is the "metric", and while it's not trying to maliciously "game" it, it is trying to achieve it. The goal's specification as it understands it, that is, not the goal as it's intended. Which would be isomorphic to it Goodharting on the metric, if the two diverge. I get the sense that people are sometimes too quick to assume that something which looks like a hammer from one angle is a hammer. As above, by "Goodharting" there (which wasn't even the term I introduced into the discussion) I didn't mean the literal same setup as in e. g. economics, where there's a bunch of schemers that deliberately maliciously manipulate stuff in order to decouple the metric from the variable it's meant to measure. I meant the general dynamic where we have some goal, we designate some formal specification for it, then point an optimization process at the specification, and inasmuch as the intended-goal diverges from the formal-goal, we get unintended results. That's basically th
3Ebenezer Dukakis
I think this argument only makes sense if it makes sense to think of the "AutoGPT clarification module" as trying to pursue this goal at all costs. If it's just a while loop that asks clarification questions until the goal is "sufficiently clarified", then this seems like a bad model. Maybe a while loop design like this would have other problems, but I don't think this is one of them. OK, so by this definition, using a more powerful processor with AutoGPT (so it just does the exact same operations faster) makes it a more "powerful optimizer", even though it's working exactly the same way and has exactly the same degree of issues with Goodharting etc. (just faster). Do I understand you correctly? This seems potentially false depending on the training method, e.g. if it's being trained to imitate experts. If it's e.g. being trained to imitate experts, I expect the key question is the degree to which there are examples in the dataset of experts following the sort of procedure that would be vulnerable to Goodharting (step 1: identify goal specification. step 2: try to achieve it as you understand it, not worrying about possible divergence from user intent.) Yeah, I just don't think this is the only way that a system like AutoGPT could be implemented. Maybe it is how current AutoGPT is implemented, but then I encourage alignment researchers to join the organization and change that. They could, but people seem to assume they will, with poor justification. I agree it's a reasonable heuristic for identifying potential problems, but it shouldn't be the only heuristic.
2Thane Ruthenis
... How do you define "sufficiently clarified", and why is that step not subject to miscommunication / the-problem-that-is-isomorphic-to-Goodharting? I'd tried to reason about similar setups before, and my conclusion was that it has to bottom out in robust alignment somewhere. I'd be happy to be proven wrong on that, thought. Wow, wouldn't that make matters easier... Sure? I mean, presumably it doesn't do the exact same operations. Surely it's exploiting its ability to think faster in order to more closely micromanage its tasks, or something. If not, if it's just ignoring its greater capabilities, then no, it's not a stronger optimizer. I don't think that gets you to dangerous capabilities. I think you need the system to have a consequentialist component somewhere, which is actually focused on pursuing the goal.
3Ebenezer Dukakis
Here's what I wrote previously: In more detail, the way I would do it would be: I give AutoGPT a task, and it says "OK, I think what you mean is: [much more detailed description of the task, clarifying points of uncertainty]. Is that right?" Then the user can effectively edit that detailed description until (a) the user is satisfied with it, and (b) a model trained on previous user interactions considers it sufficiently detailed. Once we have a detailed task description that's mutually satisfactory, AutoGPT works from it. For simplicity, assume for now that nothing comes up during the task that would require further clarification (that scenario gets more complicated). So to answer your specific questions: 1. The definition of "sufficiently clarified" is based on a model trained from examples of (a) a detailed task description and (b) whether that task description ended up being too ambiguous. Miscommunication shouldn't be a huge issue because we've got a human labeling these examples, so the model has lots of concrete data about what is/is not a good task description. 2. If the learned model for "sufficiently clarified" is bad, then sometimes AutoGPT will consider a task "sufficiently clarified" when it really isn't (isomorphic to Goodharting, also similar to the hallucinations that ChatGPT is susceptible to). In these cases, the user is likely to complain that AutoGPT didn't do what they wanted, and it gets added as a new training example to the dataset for the "sufficiently clarified" model. So the learned model for "sufficiently clarified" gets better over time. This isn't necessarily the ideal setup, but it's also basically what the ChatGPT team does. So I don't think there is significant added risk. If one accepts the thesis of your OP that ChatGPT is OK, this seems OK too. In both cases we're looking at the equivalent of an occasional hallucination, which hurts reliability a little bit. Recall your original claim: "inasmuch as AutoGPT optimizes strongly
2Thane Ruthenis
Oh, if we're assuming this setup doesn't have to be robust to AutoGPT being superintelligent and deciding to boil the oceans because of a misunderstood instruction, then yeah, that's fine. That's the part that would exacerbate the issue where it sometimes misunderstands your instructions. If you're using it for more ambitious tasks, or more often, then there are more frequent opportunities for misunderstanding, and their consequences are larger-scale. Which means that, to whichever extent it's prone to misunderstanding you, that gets amplified, as does the damage the misunderstandings cause. Oh, sure, I'm not opposing that. It may not be the highest-value place for a given person to be, but it might be for some.
4[anonymous]
Is agency actually the issue by itself or just a necessary component? Considering Robert miles stamp collecting robot: "Order me some stamps in the next 32k tokens/60 seconds" is less scope than "guard my stamps today" than "ensure I always have enough stamps". The last one triggers power seeking, the first 2 do not benefit from seeking power unless the payoff on the power seeking investment is within the time interval. Note also that AutoGPT even if given a goal and allowed to run forever has immutable weights and a finite context window hobbling it. So you need human level prediction + relevant modalities+ agency + long duration goal + memory at a bare minimum. Remove any element and the danger may be negligible.
[-]leogaoΩ6136

I agree with the spirit of the post but not the kinda clickbaity title. I think a lot of people are over updating on single forward pass behavior of current LLMs. However, I think it is still possible to get evidence using current models with careful experiment design and being careful with what kinds of conclusions to draw.

At first I strong-upvoted this, because I thought it made a good point. However, upon reflection, that point is making less and less sense to me. You start by claiming current AIs provide nearly no data for alignment, that they are in a completely different reference class from human-like systems... and then you claim we can get such systems with just a few tweaks? I don't see how you can go from a system that, you claim, provides almost no data for studying how an AGI would behave, to suddenly having a homunculus-in-the box that becomes superintelligent and kills everyone. Homunculi seem really, really hard to build. By your characterization of how different actual AGI is from current models, it seems this would have to be fundamentally architecturally different from anything we've built so far. Not some kind of thing that would be created by near-accident. 

1Thane Ruthenis
Do you think a car engine is in the same reference class as a car? Do you think "a car engine cannot move under its own power, so it cannot possibly hurt people outside the garage!" is a valid or a meaningful statement to make? Do you think that figuring out how to manufacture amazing car engines is entirely irrelevant to building a full car, such that you can't go from an engine to a car with relatively little additional engineering effort (putting it in a "wrapper", as it happens)? As all analogies, this one is necessarily flawed, but I hope it gets the point across. (Except in this case, it's not even that we've figured out how to build engines. It's more like, we have these wild teams of engineers we can capture, and we've figured out which project specifications we need to feed them in order to cause them to design and build us car engines. And we're wondering how far we are from figuring out which project specifications would cause them to build a car.)
5Prometheus
I dislike the overuse of analogies in the AI space, but to use your analogy, I guess it's like you keep assigning a team of engineers to build a car, and two possible things happen. Possibility One: the engineers are actually building car engines, which gives us a lot of relevant information for how to build safe cars (toque, acceleration, speed, other car things), even if we don't know all the details for how to build a car yet. Possibility Two: they are actually just building soapbox racers, which doesn't give us much information for building safe cars, but also means that just tweaking how the engineers work won't suddenly give us real race cars.
[-]TsviBTΩ488

Thanks for writing this and engaging in the comments. "Humans/humanity offer the only real GI data, so far" is a basic piece of my worldview and it's nice to have a reference post explaining something like that.

I'll address this post section by section, to see where my general disagreements lie:

"What the Fuss Is All About"

https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/#What_the_Fuss_Is_All_About

I agree with the first point on humans, with a very large caveat: While a lot of normies tend to underestimate the G-factor in how successful you are, nerd communities like LessWrong systematically overestimate it's value, to the point where I actually understand the normie/anti-intelligence primacy position, and IQ/Intelligence discourse is fucked by people who either ... (read more)

4Thane Ruthenis
Yep, absolutely. Here's the thing, though. I think the specifically relevant reference class here is "what happens when an agent interacts with another (set of) agents with disparate values for the first time in its life?". And instances of that in the human history are... not pleasant. Wars, genocide, xenophobia. Over time, we've managed to select for cultural memes that sanded off the edges of the instinctive hostility – liberal egalitarian values, et cetera. But there was a painfully bloody process in-between. Relevantly, most instances of people peacefully co-existing involve children being born into a culture and shaped to be accepting of whatever differences there are between the values the child arrives at and the values of other members of the culture. In a way, it's a microcosm of the global-culture selection process. A child decides they don't like someone else's opinion or how someone does things, they act intolerant of it, they're punished for it or are educated, and they learn to not do that. And I would actually agree that if we could genuinely raise the AGI like a child – pluck it out of the training loop while it's still human-level, get as much insight into its cognition as we have into the human cognition, then figure out how to intervene on its conscious-value-reflection process directly – then we'd be able to align it. The problem is that we currently have no tools for that at all. The course we're currently at is something more like... we're putting the child into an isolated apartment all on its own, and feeding it a diet of TV shows and books of our choice, then releasing it into the world and immediately giving it godlike power. And... I think you can align the child this way too, actually! But you better have a really, really solid model of which values specific sequences of TV shows cultivate in the child. And we have nowhere near enough understanding of that. So the AGI would not, in fact, have any experience of coexisting with agents
-1Noosphere89
I probably agree with this, with the caveat that this could be horribly biased towards the negative, especially if we are specifically looking for the cases where it turns out badly. I think I have 2 cruxes here, actually. My main crux is that I think that there will be large incentives independent of LW to create those tools, to the extent that they don't actually exist, so I generally assume they will be created whether LW exists or not, primarily due to massive value capture from AI control plus social incentives plus the costs are much more internalized. My other crux probably has to do with AI alignment being easier than human alignment, and I think one big reason is that I expect AIs to always be much more transparent than humans, because of the white-box thing, and the black-box framing that AI safety people push is just false and will give wildly misleading intuitions for AI and its safety. I think this is another crux, in that while I think the values and capabilities are different, and they can matter, I do think that a lot of the generator of human values does borrow stuff from the brain's learning algorithms, and I do think the distinction between values and capabilities is looser than a lot of LWers think.
2Thane Ruthenis
Mind expanding on that? Which scenarios are you envisioning? They are "white-box" in the fairly esoteric sense mentioned in the "AI is easy to control", yes; "white-box" relative to the SGD. But that's really quite an esoteric sense, as in I've never seen that term used this way before. They are very much not white-box in the usual sense, where we can look at a system and immediately understand what computations it's executing. Any more than looking at a homomorphically-encrypted computation without knowing the key makes it "white-box"; any more than looking at the neuroimaging of a human brain makes the brain "white-box".
2Noosphere89
My general scenario is that as AI progresses and society reacts more to AI progress, there will be incentives to increase the amount of control that we have over AI because the consequences for not aligning AIs will be very high, both to the developer and to the legal consequences for them. Essentially, the scenario is where unaligned AIs like Bing are RLHFed, DPOed or whatever the new alignment method is du jour away, and the AIs become more aligned due to profit incentives for controlling AIs. The entire Bing debacle and the ultimate solution for misalignment in GPT-4 is an interesting test case, as Microsoft essentially managed to get it from a misaligned chatbot to a way more aligned chatbot, and I also partially dislike the claim of RLHF as a mere mask over some true behavior, because it's quite a lot more effective than that. More generally speaking, my point here is that in the AI case, there are strong incentives to make AI controllable, and weak incentives to make it non-controllable, which is why I was optimistic on companies making aligned AIs. When we get to scenarios that don't involve AI control issues, things get worse.
[-]ZY30

This aligns similarly with my current view. Wanted to add a thought - current LLMs could still have unintended problems/misalignment like factuality or privacy or copyrights or harmful content, which still should be studied/mitigated, together with thinking about other more AGI like models (we don’t know what exactly yet, but could exist.) And a LLM (especially a fine tuned one), if doing increasingly well on generalization ability, should still be monitored. To be prepared for future, having a safety mindset/culture is important for all models.

On the one hand, sure. I think LLMs are basically safe. As long as you keep the current training setup, you can scale them up 1000x and they're not gonna grow agency or end the world.

LLMs are simulators. They are normally trained to simulate humans (and fictional characters, and groups of humans cooperating to write something), though DeepMind has trained them to instead simulate weather patterns. Humans are not well aligned to other humans: Joseph Stalin was not well aligned to the citizenry of Russia, and as you correctly note, a very smart manipulative ... (read more)

2Thane Ruthenis
The LLM training loop shapes the ML models to be approximate simulators of the target distribution, yes. "Approximate" is the key word here. I don't think the LLM training loop, even scaled very far, is going to produce a model that's actually generally intelligent, i. e. that's inferred the algorithms that implement human general intelligence and has looped them into its own cognition. So no matter how you try to get it to simulate a genius-level human, it's not going to produce genius-level human performance. Not in the ways that matter. Particularly clever CoT-style setups may be able to do that, which I acknowledge in the post by saying that slightly-tweaked scaffolded LLMs may not be as safe as just LLMs. But I also expect that sort of setup to be prohibitively compute-expensive, such that we'll get to AGI by architectural advances before we have enough compute to make them work. I'm not strongly confident on this point, however. Oh, you don't need to convince me of that.
3RogerDearnaley
On pure LLM-simulated humans, I'm not sure either way. I wouldn't be astonished if a sufficiently large LLM trained on a sufficiently large amount of data could actually simulate an IQ ~100 – ~120 humans well enough that having a large supply of fast, promptable cheap simulations was Transformative AI. But I also wouldn't be astonished if we found that was primarily good for an approximation of human System 1 thinking, and that doing a good job of simulating human System 2 thinking over significant periods it was either necessary, or at least a lot cheaper, to supply the needed cognitive abilities via scaffolding (it rather depends on how future LLMs act at very long context lengths, and if we can fix a few of their architecturally-induced blindspots, which I'm optimistic about but is unproven). And completely I agree that the alignment properties of a base model LLM, an RLHF trained LLM, a scaffolded LLM, and other yet-to-be-invented variants are not automatically the same, and we do need people working on them to think about this quite carefully. I'm just not convinced that even the base model is safe, if it can become an AGI by simulating a very smart human when sufficiently large and sufficiently prompted. While scaffolding provides additional complexities to alignment, it also provides additional avenues for alignment: now their thoughts are translucent and we can audit and edit their long-term memories. I had noticed you weren't making that mistake; but I have seen other people on Less Wrong somehow assume that humans must be aligned to other humans (I assume because they understand human values?) Sadly that's just not the case: if it was, we wouldn't need locks or law enforcement, and would already have UBI. So I thought it was worth including those steps in my argument, for other readers who might benefit from me belaboring the point.
5Thane Ruthenis
I agree that sufficiently clever scaffolding could likely supply this. But: * I expect that figuring out what this scaffolding is, is a hard scientific challenge, such that by-default, on the current paradigm, we'll get to AGI by atheoretic tinkering with architectures rather than by figuring out how intelligence actually works and manually implementing that. (Hint: clearly it's not as simple as the most blatantly obvious AutoGPT setup.) * If we get there by figuring out the scaffolding, that'd actually be a step towards a more alignable AGI, in the sense of us getting some idea of how to aim its cognition. Nowhere near sufficient for alignment and robust aimability, but a step in the right direction.
3RogerDearnaley
All valid points. (Though people are starting to get quite good results out of agentic scaffolds, for short chains of thought, so it's not that hard, and the promary issue seems to be that exsting LLMs just aren't consistent enough in their behavior to be able to keep it going for long.) On you second bullet: personally I want to build a scaffolding suitable for an AGI-that-is-a-STEM-researcher in which the long term approximate-Bayesian reasoning on theses is something like explicit and mathematical symbol manipulation and/or programmed calculation and/or tool-AI (so a blend of LLM with AIXI-like GOFAI) — since I think then we could safely point it at Value Learning or AI-assisted Alignment and get a system with a basin of attraction converging from partial alignment to increasingly-accurate alignment (that's basically my current SuperAlignment plan). But then for a sufficiently large transformer model their in-context learning is already approximately Bayesian, so we'd be duplicating an existing mechanism, like RAG duplicating long-term memory when the LLM already has in-context memory. I'm wondering if we could get an LLM sufficiently well-calibrated that we could just use its logits (on a carefully selected token) as a currency of exchange to the long-term approximate Bayesianism calculation: "I have weighed all the evidence and it has shifted my confidence in the thesis… [now compare logits of 'up' vs 'down', or do a trained linear probe calibrated in logits, or something]
2Alexander Gietelink Oldenziel
Generative and predictive models can be substantially different. there are finite generative models such that the optimal predictive model is infinite.  See this paper for more. 
1RHollerith
An LLM can be strongly super-human in its ability to predict the next token (that some distribution over humans with IQ < 100 would write) even if it was trained only on the written outputs of humans with IQ < 100. More generally, the cognitive architecture of an LLM is very different from that of a person, and IMO we can use our knowledge of human behavior to reason about LLM behavior.
3RogerDearnaley
If you doubt that transformer models are simulators, why was DeepMind so successful in using them for predicting weather patterns? Why have they been so successful for many other sequence prediction tasks? I suggest you read up on some of the posts under Simulator Theory, which explain this better and at more length than I can in this comment thread. On them being superhuman at predicting tokens — yes, absolutely. What's your point? The capabilities of the agents simulated are capped by the computational complexity of the simulator, but not vice-versa. If you take the architecture and computational power needed to run GPT-10 and use it to train a base model only on (enough) text from humans with IQ <80, then the result will do an amazing, incredibly superhumanly accurate job of simulating the token-generation behavior of humans with an IQ <80. If you want to reason about a transformer model, you should be using learning theory, SLT, compression, and so forth. However, what those tell us is basically that (within the limits of their capacity and training data) transformers run good simulations. So if you train them to simulate humans, then (to the extent that the simulation is accurate) human psychology applies, and thus things like EmotionPrompts work. So LLM-simulated humans make human-like mistakes when they're being correctly simulated, plus also very un-human-like (to us dumb looking) mistakes when the simulation is inaccurate. So our knowledge of human behavior is useful, but I agree is not sufficient, to reason about an LLM running a simulation of human.

An additional distinction between contemporary and future alignment challenges is that the latter concerns the control of physically deployed, self aware system.


Alex Altair has previously highlighted that they will (microscopically) obey time reversal symmetry[1] unlike the information processing of a classical computer program. This recent paper published in Entropy[2] touches on the idea that a physical learning machine (the "brain" of a causal agent) is an "open irreversible dynamical system" (pg 12-13).

  1. ^
... (read more)
2the gears to ascension
The purpose for reversible automata is simply to model the fact that our universe is reversible, is it not? I don't see how that weighs on the question at hand here.

But you wouldn't study ... MNIST-classifier CNNs circa 2010s, and claim that your findings generalize to how LLMs circa 2020s work.

This particular bit seems wrong; CNNs and LLMs are both built on neural networks. If the findings don't generalize, that could be called a "failure of theory", not an impossibility thereof. (Then again, maybe humans don't have good setups for going 20 steps ahead of data when building theory, so...)

(To clarify, this post is good and needed, so thank you for writing it.)

2Thane Ruthenis
Yep, there's nonzero mutual information. But not of the sort that's centrally relevant. I'll link to this reply in lieu of just copying it.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

The novel views are concerned with the systems generated by any process broadly encompassed by the current ML training paradigm.

Omnicide-wise, arbitrarily-big LLMs should be totally safe.

This is an optimistic take. If we could be rightfully confident that our random search through mindspace with modern ML methods can never produce "scary agents", a lot of our concerns would go away. I don't think that it's remotely the case.

The issue is that this upper bound on risk is also an upper bound on capability. LLMs, and other similar AIs, are not going to

... (read more)
5TurnTrout
I understand this to connote "ML is ~uninformatively-randomly-over-mindspace sampling 'minds' with certain properties (like low loss on training)." If so—this is not how ML works, not even in an approximate sense. If this is genuinely your view, it might be helpful to first ponder why statistical learning theory mispredicted that overparameterized networks can't generalize. 
4Thane Ruthenis
I predict that this can't happen with the standard LLM setup; and that more complex LLM setups, for which this may work, would not meaningfully count as "just an LLM". See e. g. the "concrete scenario" section. By "LLMs should be totally safe" I mean literal LLMs as trained today, but scaled up. A thousand times the parameter count, a hundred times the number of layers, trained on correspondingly more multimodal data, etc. But no particularly clever scaffolding or tweaks. I think we can be decently confident it won't do anything. I'd been a bit worried about scaling up context windows, but we've got 100k-tokens-long ones, and that didn't do anything. They still can't even stay on-target, still hallucinate like crazy. Seems fine to update all the way to "this architecture is safe". Especially given some of the theoretical arguments on that. (Hey, check this out, @TurnTrout, I too can update in a more optimistic direction sometimes.) (Indeed, this update was possible to make all the way back in the good old days of GPT-3, as evidenced by nostalgebraist here. In my defense, I wasn't in the alignment field back then, and it took me a year to catch up and build a proper model of it.)
1Ape in the coat
  You were also talking about "systems generated by any process broadly encompassed by the current ML training paradigm" - which is a larger class than just LLMs.  If you claim that arbitrary scaled LLMs are safe from becoming scary agents on their own - it's more believable. I'd give it around 90%. Still better safe than sorry. And there are other potential problems like creating an actually sentient models without noticing it - which would be an ethical catastrophe. So catiousness and beter interpretability tools are necessary. I'm talking about "just LLMs" but with clever scaffoldings written in explicit code. All the black box AI-stuff is still only in LLMs. This doesn't contradict your claim that LLM's without any additional scaffoldings won't be able to do it. But it does contradict your titular claim that Current AIs Provide Nearly No Data Relevant to AGI Alignment. If AGI reasoning is made from LLMs, aligning LLMs, in a sense of making them say stuff we want them to say/not say stuff we do not want them to say, is not only absolutely crucial to aligning AGI, but mostly reduces to it.
2Thane Ruthenis
Yeah, and safety properties of LLMs extend to more than just LLMs. E. g., I'm pretty sure CNNs scaled arbitrarily far are also safe, for the same reasons LLMs are. And there are likely ML models more sophisticated and capable than LLMs, which nevertheless are also safe (and capability-upper-bounded) for the reasons LLMs are safe. Oh, certainly. I'm a large fan of interpretability tools, as well. I don't think that'd work out this way. Why would the overarching scaffolded system satisfy the safety guarantees of the LLMs it's built out of? Say we make LLMs never talk about murder. But the scaffolded agent, inasmuch as it's generally intelligent, should surely be able to consider situations that involve murder in order to make workable plans, including scenarios where it itself (deliberately or accidentally) causes death. If nothing else, in order to avoid that. So it'd need to find some way to circumvent the "my components can't talk about murder" thing, and it'd probably just evolve some sort of jail-break, or define a completely new term that would stand-in for the forbidden "murder" word. General form of the Deep Deceptiveness argument applies here. It is ground truth that the GI would be more effective at what it does if it could reason about such stuff. And so, inasmuch as the system is generally intelligent, it'd have the functionality to somehow slip such non-robust constraints. Conversely, if it can't slip them, it's not generally intelligent.

The onus to prove the opposite is on those claiming that the LLM-like paradigm is AGI-complete. Not on those concerned that, why, artificial general intelligences would exhibit the same dangers as natural general intelligences.

So the only argument for "LLMs can't do it safely" is "only humans can do it now, and humans are not safe"? The same argument works for any capability LLMs already have: LLMs can't talk, because the space of words is so vast, you'll need generality to navigate it.

2Thane Ruthenis
My argument is "only humans can do it now, and on the inside models of a lot of people, human ability to do that is entwined with them being unsafe". And, I mean, if you code up a system that can exhibit general intelligence without any of the deceptive-alignment unstable-value-reflection issues that plague humans, that'd totally work as a disproof of my views! The way LLMs' ability to talk works as a disproof of "you need generality to navigate the space of words". Or if you can pose a strong theoretical argument regarding this, based on a detailed gears-level model of how cognition works. I shot my shot on that matter already: I have my detailed model, which argues that generality and scheming homunculi are inextricable from each other. To recap: What I'm doing here is disputing the argument of "LLMs have the safety guarantee X, therefore AGI will have safety guarantee X", and my counter-argument is "for that argument to go through, you need to actively claim that LLMs are AGI-complete, and that claim isn't based in empirical evidence at all, so it doesn't pack as much punch as usually implied".
1Signer
I'm saying that the arguments for why your inside model is relevant to the real world are not strong. Why human ability to talk is not entwined with them being unsafe? Artificial talker is also embedded in a world, an agent is more simple model of a talker, memorizing all words is inefficient, animals don't talk, humans use abstraction for talking, and so on. I think talking is even can be said to be Turing-complete. What part of your inside model doesn't apply to talking, except "math feels harder" - of course it does, that's what "once computer does it, it stops being called AI" dynamics feels like from inside. Why should be hardness discontinuity be where you thing it is? And in more continuous model it becomes non-obvious whether AutoGPT-style thing with automatic oversight and core LLM module, that never thinks about killing people, always kills people.
1Thane Ruthenis
Define "talking". If by "talking" you mean "exchanging information, including novel discoveries, in a way that lets us build and maintain a global civilization", then yes, talking is AGI-complete and also LLMs can't talk. (They're Simulacrum Level 4 lizards.) If by "talking" you mean "arranging grammatically correct English words in roughly syntactically correct sentences", then no, abstractions aren't necessary for talking and memorizing all words isn't inefficient. Indeed, one could write a simple Markov process that would stochastically generate text fitting this description with high probability. That's the difference: the latter version of "talking" could be implemented in a way that doesn't route through whatever complicated cognitive algorithms make humans work, and it's relatively straightforward to see how that'd work. It's not the same for e. g. math research. As I'd outlined: because it seems to me that the ability to do novel mathematical research and such stuff is general intelligence is the same capability that lets a system be willing and able to engage in sophisticated scheming. As in, the precise algorithm is literally the same. If you could implement the research capability in a way that doesn't also provide the functionality for scheming, the same way I could implement the "output syntactically correct sentences" capability without providing the general-intelligence functionality, that would work as a disproof of my views.
1Signer
What GPT4 does. Yes, but why do you expect this to be hard? As in "much harder than gathering enough hardware". The shape of the argument seems to me to be "the algorithm humans use for math research is general intelligence is ability to scheme, LLMs are not general, therefore LLMs can't do it". But before LLMs we also hadn't known about the algorithm to do what GPT4 does, the way we know how to generate syntactically correct sentences. If you can't think of an algorithm, why automatically expect GPT-6 to fail? Even under your model of how LLMs work (which may be biased to predict your expected conclusion) its possible that you only need some relatively small number of heuristics to greatly advance math research. To be clear, my point is not that what you are saying is implausible or counterintuitive. I'm just saying, that, given the stakes, it would be nice if the whole field transitioned to the level of more detailed rigorous justifications with numbers.
3Thane Ruthenis
Well, be the change you wish to see! I too think it would be incredibly nice, and am working on it. But formalizing cognition is, you know. A major scientific challenge.
[+][comment deleted]00