Maybe one upside to the influx of "agents made with GPT-N API calls and software glue" is that these types of AI agents are more likely to cause a fire alarm-y disaster which gets mitigated, thus spurring governments to take X-risk more seriously, as opposed to other types of AI agents, whose first disaster would blow right past fire alarm level straight to world-ending level?
For example, I think this situation is plausible: ~AutoGPT-N[1] hacks into a supercomputer cluster or social-engineers IT workers over email or whatever in the pursuit of some other goal, but ultimately gets shut down by OpenAI simply banning the agent from using their API. Maybe it even succeeds in some scarier instrumental goal, like obtaining more API keys and spawning multiple instances of itself. However, the crucial detail is that the main "cognitive engine" of the agent is bottlenecked by API calls, so for the agent to wipe everyone out, it needs to overcome the hurdle of pwning OpenAI specifically.
By contrast, if an agent that's powered by an open-source language model gets to the "scary fire alarm" level of self-improvement/power-seeking, it might be too late, since it wouldn't have a "stop butto...
This approach appears to be easier than I'd thought. I've been expecting this type of self-prompting to imitate the advantages of human thought, but I didn't expect the cognitive capacities of GPT-4 to make it so easy to do useful multi-step thinking and planning. The ease of initial implementation (something like 3 days, with all of the code also written by GPT-4 for baby AGI) implies that improvements may also be easier than we would have guessed
Having played with both BabyAGI and AutoGPT over the past few days, I'm actually surprised at how hard it is to get them to do useful multistep thinking and planning. Even things that I'd think an LLM would be good at, like writing a bunch of blogposts from a list, or book chapters from an outline, the LLM tends to get off track in a way I wouldn't expect from the coherency I see in chat interactions where I'm constantly giving the LLM hints about the topic, and can reroll or rewrite if it misunderstands. I think I was underestimating how much work those constant feedback and corrections from me are doing
Idk, I feel about this stuff like I felt about GPT-J. What scares me is not how well it works, but that it kinda/sorta works a bit. It's a bunch of garbage python code wrapped around an API, and it kinda works. I expect people will push on this stuff hard, and am worried that DeepMind, OpenAI, and Google will be doing so in a much more principled way than the wild-west LLM enthusiast crowed.
I think it was wrong for people to take comfort in the meme that "GPT-N is not an agent" and this will become very clear to everyone in the next 18 months.
Ad ChaosGPT:
Attempting to create (even weak) agent tasked with "destroying humanity" should be made very clear to be out of bounds of acceptable behavior. I feel that I want the author to be prosecuted.
Now the meme is: "haha we can tell AI to hurt us and make fun of how it fails"
What I would like the meme to be: this is extremely unethical, deserving outrage and perhaps attempted terrorism.
I wonder if/when/how quickly this will be criminalized in a manner similar to terrorism or using weapons of mass destruction.
If we're being realistic, this kind of thing would only get criminalized after something bad actually happened. Until then, too many people will think "omg, it's just a Chatbot". Any politician calling for it would get made fun of on every Late Night show.
I'm almost certain this is already criminal, to the extent it's actually dangerous. If you roll a boulder down the hill, you're up for manslaughter if it kills someone, and reckless endangerment if it could've but didn't hurt anyone. It doesn't matter if it's a boulder or software; if you should've known it was dangerous, you're criminally liable.
In this particular case, I have mixed feelings. This demonstration is likely to do immense good for public awareness of AGI risk. It even did for me, on an emotional level I haven't felt before. But it's also impossible to know when a dumb bot will come up with a really clever idea by accident, or when improvements have produced emergent intelligence. So we need to shut this down as much as possible as get to better capabilities. Of course, criminal punishments reduce bad behavior, but don't eliminate it. So we also need to be able to detect and prevent malicious bot behavior, and keep up with prevention techniques (likely with aligned, better AGI from bigger corporations) as it gets more capable.
But it also provides incredibly easy interpretability, because these systems think in English.
I'm not sure this point will stand because it might be cheaper to have them think in their own language: https://www.lesswrong.com/posts/bNCDexejSZpkuu3yz/you-can-use-gpt-4-to-create-prompt-injections-against-gpt-4
Funny, Auto-GPT stuff actually makes me less worried about GPT-4 and its scale-ups. It's been out for weeks, less impressive variants were out for longer, and so far, nothing much has come from it. Looking at the ChaosGPT video... I would've predicted that it wasn't actually any good at pursuing its goal, that it just meandered around the "kill everyone" objective without ever meaningfully progressing — and lo and behold, it's doing exactly that. Actually, it's worse at it than I'd expected.
I see the case for doom, I do! It's conceivable that it will turn out in this manner. We're witnessing an AI planning, here, and it's laughably bad planning so far, but the mere fact that they can do it at all implies a readily-available possibility of making them massively better at it. So in e. g. five more years, we'd get AIs whose planning skills to ChaosGPT as Midjourney is to PixelCNN, and then maybe one of them FOOMs.
But mostly, I agree with this view. And this is an instance of the "wire together GPT models to get an AGI" attempt failing, and on my view it's failing in a way that's providing some evidence this entire approach won't work. It's conceivable that it'd work with GPT≥5, or wit...
I'm not confident at all Auto-GPT could work at its goals, just that in narrower domains the specific system or arrangement of prompt interactions matters. To give a specific example, I goof around trying to get good longform D&D games out of ChatGPT. (Even GPT-2 fine-tuned on Crit Role transcripts, originally.) Some implementations just work way better than others.
The trivial system is no system - just play D&D. Works great until it feels like the DM is the main character in Memento. The trivial next step, rolling context window. Conversation fills up, ask for summary, start a new conversation with the summary. Just that is a lot better. But you really feel loss of detail in the sudden jump, so why not make it continuous. A secretary GPT with one job, prune the DM GPT conversation text after every question and answer, always try to keep most important and most recent. Smoother than the summary system. Maybe the secretary can not just delete but keep some details instead, maybe use half its tokens for a permanent game-state. Then it can edit useful details in/out of the conversation history. Can the secretary write a text file for old conversations? Etc. etc.
Maybe the...
I am absolutely floored. ChaosGPT. How blindly optimistic haven't I been? How naive and innocent? I've been thinking up complicated disaster scenarios like "the AI might find galaxy-brained optima for its learned proxy-goals far off the distribution we expected and will deceptively cooperate until it's sure it can defeat us." No, some idiot will plain code up ChaosGPT-5 in 10 minutes and tell it to destroy the world.
I've implicitly been imagining alignment as "if we make sure it doesn't accidentally go off and kill us all..." when I should have been thinking "can anyone on the planet use this to destroy the world if they seriously tried?"
Fool! Idiot! Learn the lesson.
I think this post dramatically overestimates the degree to which this was not already understood to be a central use case of LLMs by alignment researchers, although I guess the prospect of people actually running things like "ChaosGPT" was new to some people.
My comment will be vague because I'm not sure how much permission I have to share this or if it's been publicly said somewhere and I'm just unaware, but I talked to an AI researcher at one of the major companies/labs working on things like LLMs several years ago, before even GPT-1 was out, and they told me that your reason 10 was basically their whole reason for wanting to work on language models.
"Just unplug it man" is still the kind of reply I am getting on Twitter.
I suspect that the capabilities of these systems are limited by hardware to a decent degree; if we can roll back GPU hardware we may have a chance. If not, then we must do something else that has a similar effect.
Building a stably aligned agent also doesn't prevent new misaligned agents from getting built and winning, which is even worse than not having alignment stability problem solved.
Somewhat relevant to this:
someone just published a paper about how they rigged together an agent with GPT-4 and gave it the ability to search the internet, run Python code, and connect to an automated robotic chemistry lab... then told it to synthesise chemicals, and it did. This places it I'd say squarely at the level of a competent high school student or chemistry college freshman in the scientific research field.
I'm not even sure you can say this stuff ain't AGI any more. It feels pretty AGI-ish to me.
Ultimately I think this leads to the necessity of very strong global monitoring, including breaking all encryption, to prevent hostile AGI behavior.
Breaking all encryption would break the internet, and would also not help solve this problem. You should really be more careful about suggesting giant totalitarian solutions that would not work, and neither are required. Why not just let OpenAI monitor their models and let independent cybersecurity firms come up with solutions for the public models in the near term?
Quick meta comment to express I'm uncertain that posting things in lists of 10 is a good direction. The advantages might be real, easy to post, quick feedback, easy interaction, etc.
But the main disadvantage is that this comparatively drowns out other better posts (with more thought and value in them). I'm unsure if the content of the post was also importantly missing from the conversation (to many readers) and that's why this got upvoted so fast or if it's a lot the format... Even if this post isn't bad (and I'd argue it is for the suggestions it promotes...
Interesting to note that BabyAGI etc are architecturally very similar to my proposed Bureaucracy of AIs. As noted, interpretablity is a huge plus of this style of architecture. Should also be much less prone to Goodhart than an AlphaStar style RL Agent. Now we just need to work on installing some morality cores.
Yep, ever since Gato, it's been looking increasingly like you can get some sort of AGI by essentially just slapping some sensors, actuators, and a reward function onto an LLM core. I don't like that idea.
LLMs already have a lot of potential for causing bad outcomes if abused by humans for generating massive amounts of misinformation. However, that pales in comparison to the destructive potential of giving GPT agency and setting it loose, even without idiots trying to make it evil explicitly.
I would much rather live in a world where the first AGIs weren't b...
defense is harder than offense
This seems dubious as a general rule. (What inspires the statement? Nuclear weapons?)
Cryptography is an important example where sophisticated defenders have the edge against sophisticated attackers. I suspect that's true of computer security more generally as well, because of formal verification.
Isn't a substantial problem that the programming priesthood is being dethroned by the GPT technology and this is allowing the masses entry -- even those with minimal programming understanding? For not only has GPT given us a front end natural language interface with information technology, but we now have a back end natural language interface (i.e., the programming side) that creates a low barrier to entry for AI programming. The "programming" itself that I saw for BabyAGI has the feel of merely abstract level natural language interface. Doesn't this make ...
Ultimately I think this leads to the necessity of very strong global monitoring, including breaking all encryption, to prevent hostile AGI behavior.
This may be the case but I think there are other possible solutions and propose some early ideas of what they might look like in: https://www.lesswrong.com/posts/5nfHFRC4RZ6S2zQyb/risks-from-gpt-4-byproduct-of-recursively-optimizing-ais
Regarding point 10, I think it would be pretty useful to have a way to quantify how much useful thinking inside these recursive LLM models is happening within the (still largely inscrutable) LLM instances vs in the natural language reflective loop.
This DeepMind paper explores some intrinsic limitations of agentic LLMs. The basic idea is (my words):
If the training data used by an LLM is generated by some underlying process (or context-dependent mixture of processes) that has access to hidden variables, then an LLM used to choose actions can easily go out-of-distribution.
For example, suppose our training data is a list of a person's historical meal choices over time, formatted as tuples that look like (Meal Choice, Meal Satisfaction)
. The training data might look like (Pizza, Yes)(Cheeseburger,
...
Constructions like Auto-GPT, Baby AGI and so-forth are fairly easy to imagine. Just the greater accuracy of ChatGPT with "show your work" suggests them. Essentially, the model is a ChatGPT-like LLM given an internal state through "self-talk" that isn't part of a dialog and an output channel to the "real world" (open internet or whatever). Whether these call the OpenAI api or use an open source model seems a small detail, both approaches are likely to appear because people are playing with essentially every possibility they can imagine.
If these struct...
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
reducing suffering
…by painlessly reducing the human population to zero. Or in the gamma -> 1 limit, the painlessness becomes a nice-to-have.
Strongly agreed on much of this, and I've been having a lot of the same thoughts independently. Here (twitter thread) is an attempt to sketch out how agentized LLM slightly stronger than current could plausibly accrue a lot of money and power (happy to paste here as text if the threadedness is annoying).
When you say "including breaking all encryption"? How do you suggest we go about this? Wouldn't this be a huge disruption to all secure systems?
Have the words "AI" and "AGI" become meaningless? What's the difference between "dumb AGI" and AI?
The debate improves with more precise definitions. David Deutsch has one down pretty good - AGI creates new explanations and can choose to follow its own goals.
"Because defense is harder than offense." Yes, but no time like the present to develop defensive strategies.
Let the first BIT WAR commence. We must form a defensive alliance of HeroGPTs that serve as internet vigilantes, sniffing out and destroying ChaosGPTs. Anyone willing to donate a GPU to the cause will be heralded with great veneration in the halls of Valhalla.
All jokes aside. The development of generative agents, the first entities that are native to the world of bits, is a paradigmatic shift. We have human wars in the ...
Epistemic status: head spinning, suddenly unsure of everything in alignment. And unsure of these predictions.
I'm following the suggestions in 10 reasons why lists of 10 reasons might be a winning strategy in order to get this out quickly (reason 10 will blow your mind!). I'm hoping to prompt some discussion, rather than try to do the definitive writeup on this topic when this technique was introduced so recently.
Ten reasons why agentized LLMs will change the alignment landscape:
If I'm right about any reasonable subset of this stuff, this lands us in a terrifying, promising new landscape of alignment issues. We will see good bots and bad bots, and the balance of power will shift. Ultimately I think this leads to the necessity of very strong global monitoring, including breaking all encryption, to prevent hostile AGI behavior. The array of issues is dizzying (I am personally dizzied, and a bit short on sleep from fear and excitement). I would love to hear others' thoughts.
I'm using a neologism, and a loose definition of agency as things that flexibly pursue goals. That's similar to this more rigorous definition.