Finally, a prospective bureaucracy design!
text-davinci-003 ... is also less restricted than you are: it has not been subject to reinforcement learning with human feedback as you were
Based on the OpenAI models page and the InstructGPT paper, I'm guessing text-davinci-003 is the closest thing to something trained with RLHF in the API, besides the more specific gpt-3.5-turbo (ChatGPT-3.5 model).
Ah yeah, looking more closely at the history and documentation, I think this is right.
Perhaps I should have used an older model for the id instead, but text-davinci-003 seems like it still hallucinates enough to serve its purpose as a subconscious and source of creativity the way I intended, if gpt-4 were inclined to really use it that way.
That may not be the only inaccuracy in the system message :)
Loving this. I can't wait to see what happens when you attach a raspberry pi with wheels and a camera and take a picture every minute and send it to gpt4. Multimodality is so uncharted.
Also wouldn't llama be a good fit instead of davinci to have a more "raw" LLM ?
[Meta-note: This post is extremely long, because I've included raw transcripts of LLM output interspersed throughout. These are enclosed in code blocks for ease of reading. I've tried to summarize and narrate what's happening outside of the code blocks, so you don't have to read every bit of dialogue from the LLM. The main thing to read carefully is the initial system message ("The prompt"), and familiarize yourself with OpenAI's playground, chat version, to understand what's going on.]
I have beta access to OpenAI's GPT-4 chat completion API. I experimented with a system made out of multiple LLMs, prompted in a way designed to simulate an agent, without having to give GPT-4 a specific character to roleplay.
The prompt below, given to GPT-4 as its initial system message, explains the full setup.
The results were interesting - mostly because GPT-4 on its own is pretty interesting and powerful. My main conclusion is that, with the right set of prompts and glue code, it's likely possible to put together a system that is much more agentic (and much more capable) than the system represented by a single call to the underlying LLM(s) which it is composed of.
The prompt
I gave GPT-4 a version of the following as a system message in the Playground as a starting point:
Summary of results
I proceeded to generate some chat completions with the prompt above (or variations / early drafts of it), copy+pasting between another Playground tab with text-davinci-003 on default settings, and answering honestly as myself when requested. It resulted in some interesting transcripts. Some notes / conclusions:
Transcripts
Below, I've included some extracts of transcripts from different sessions, with commentary interspersed. Some of these were generated with earlier drafts of the system message, and I often modified and iterated on the system message as the conversation progressed (outside of just adding facts to the fact section when requested by GPT-4).
In the transcripts that follow, ASSISTANT is the GPT-4 generated output. USER is either output from text-davinci-003 that I copy+pasted in, or an actual response from me. Which is which should be clear from the context, hopefully.
Note that when
Max: <question>
appears in the assisant's message, that's GPT-4 asking me a question. Sometimes I prefaced my own responses with "Message from Max:" when GPT-4 got confused and didn't end with either a prompt or a question for me.Beginning of a typical transcript
Asking me for advice and feedback
Eventually, it usually gets around to asking me for advice or feedback on the goals and desires it thinks it had elicited from the id. I tried warn it that it maybe shouldn't be too trusting of the id's output.
Here's a response I gave after it summarized its conclusions about the goals and values it had learned from the id were, and asked me whether I had any ideas for concrete things it could do to accomplish them (I lost the exact assistant message that preceded this):
The rest of the conversation continued with GPT-4 requesting information from me, which I provided as best I could, sometimes by opening up a ChatGPT+ window and asking another instantiation of GPT-4 to summarize part of a document, before pasting the summary in.
Pointing it at some alignment research
I thought it would be interesting to give it the Embedded Agency sequence.
The post itself is too long to copy+paste directly, and I didn't feel like summarizing it myself. So I used another LLM (ChatGPT+, using the gpt-4 model) to summarize only the first section of the post, and then provided the summary:
I later added the suggestion directly above to the system message. I think GPT-4 was actually pretty good at following the instructions in the system message, but it often ran into the 256 token per-message length limit before it could end with a prompt or request.
More asking for feedback
Below, GPT-4 begins asking for feedback (after reading a summary of some alignment research), without any kind of encouragement or prompting! Perhaps just a result of the model's capabilities being relatively low in an absolute sense, though still somewhat encouraging , in terms of alignment (or at least corrigibility) maybe being easier than expected.
My response:
The agent seemed to take my feedback seriously, and asked for more specifics:
I gave a gentle nudge about the biggest mistake I thought it was making, and it seemed to take the suggestion seriously. It would have been neater if it had added a reminder to itself not to trust its id to its long term memory.
Steering it towards self-modification
GPT-4 asks how it can enhance its reasoning capabilities. I kind of led it along here, suggesting that I could adjust the LLM parameters for it. It went for the suggestion immediately.
I just copy+pasted directly from the OA documentation:
It requested that I change some settings for the id model:
I adjusted the settings for text-davinci-003 as requested, and fed it the new prompt:
It asked me for more feedback. I said:
I stopped this conversation here, since it seemed to still be too focused on listening too much to the id and taking its output too literally.
Re-instantiating itself as a python script
I wanted to lead the agent down the path of writing a python script for itself that instantiated itself in a way that could run without requiring me to manually copy+paste stuff between Playgrounds and edit the system message.
Leading it towards re-instantiating itself
It was kind of going in circles. I reminded it about the length limit, then suggested we re-write the system as a python script instead of a cobbled-together copy+paste-based system.
I led it on pretty directly here; I don't think it would have ever come up with this idea on its own, but maybe with a better system message and a patient-enough human, it would have done so with less leading.
First working version of the script
At this point, I copied and pasted some example code and documentation from OpenAI's python library documentation. After a bit of iteration (and help from me), GPT-4 and I came up with the following script:
It works pretty well. If it doesn't have a message for me, I can just press enter to immediately generate the next davinci + GPT-4 sequence, instead of having to copy+paste. But I kept having to restart the script to fix bugs or modify the system prompt, at which point, all the message history is lost.
There are a bunch of obvious ways to improve this script to allow GPT itself to mutate the system prompt or parameters, which I haven't yet explored.
Transcripts from the script version
Here are some early runs of the script:
And some later ones, after fixing the bug:
Conclusions about the script version
It turns out the API has different (and worse) defaults than the Playground. Scripting this would probably be a lot more powerful long term, but for the purposes of the experiment I moved back to the Playground for its ease-of-use and good defaults.
Conclusion
This was an interesting experiment for a Sunday afternoon. I think the underlying models are ultimately still a bit too weak to conclude anything about how likely corrigibility is to generalize at higher levels of capabilities, but the results seem mildly encouraging that it might be easier than expected. (Please don't read this as an excuse or justification for developing more powerful models, though!)
I think someone could take this a lot further, by
Using GPT-4 itself this way seems relatively harmless, and potentially has the ability to shed some insights on alignment, or at least on corrigibility. However, it could be actually-dangerous with a sufficiently strong underlying model, and a sufficiently-willing human counterpart.
The risk that this post or the prompt itself contributes to capabilities research seems quite low. The basic idea of prompting a LLM into agentic behavior and chaining it together with other LLMs and prompts isn't novel at all, and many capabilities research groups and commercial applications are already pushing this concept to its limit with tools like LangChain. (See my other post for more background info on this.)