Review


Pre-preface

This post was written about a year ago, but I decided not to publish it with concerns of burning timelines. Since then, after the hype caused by ChatGPT, similar systems have been proposed, implemented and even tested; so I think that “the secret”, even if there was one, is out, and there’s no more risks in publishing it. Moreover, I think that publishing it will be net positive, because I can see ways in which we can align such systems. I will leave notes in italics throughout this text to comment on things that changed after this post was written, otherwise not changing the text (sans minor grammar edits).
This was written before Simulators by Janus, but what is described here is essentially a simulacra. I call it Identity-Context-Action Simulacra, or, short, ICA Simulacra.
Examples of early systems of this kind:
Langchain
Task-driven autonomous agent using Langchain
 

Preface

This post is inspired by many hours of thinking about human consciousness and the state of AI research, as well as some reading here and there, and of course some hands-on experience of working on AI systems, particularly in NLP. Recent big transformer model releases solidified that language models have almost reached or even surpassed human-level performance on many tasks, with some researchers even hinting on bigger model capabilities given the right prompt.

That led me into thinking that we’re what’s called “one prompt away” from AGI.
It is, of course, oversimplification of what’s to be done to reach truly human or sub-human level agents with capabilities of self-improvement and independent action. 
To get to the true potential of proposed architecture we need to improve attention window (Note: since then have been significantly improved from ~2k to 32k), research better ways for implementation of submodules(Note: well, see OpenAI Tools), and get a lot of pretty specific not-yet-existing training data. Moreover, by my estimation we’re not even close to the continued inference of such a model in terms of compute.
But I’m getting ahead of myself here a little bit.

The following is my very brief take of how these agents might look like, what is needed for them to work, and potential research avenues. Aim is to inspire and introduce more than solidify the theory. 

The one prompt

Before getting to the part where I describe the one-prompt-to-rule-them all, I have to say that it’s not really “one prompt’, it’s more of a way of approaching prompts and transformer models in general.

My understanding is roughly as follows: big transformer models work as a sort-of subconscious part of human cognition, so to reach human-level cognition we have to add the conscious part to it. Think of dialog transformer agents for a second: they can and do often work by specifying the “identity” of agent, then specifying dialog context (previous replies) or lack thereof, and then inferencing on this prompt, adding user input and then prompting it with all this info and something like “LaMDA:” in the end.

This is roughly the same to how I see the universal model will work, with prompt including following parts:

  1. Identity (Personality, etc)
  2. Context (World-state, if you will)
  3. Action

With the crucial part being, of course, the model itself, which should be fine tuned with new historical interaction data every k hours. This should be something akin to sleep, when lossy data of your day is written into your long-term memory. Parallels kinda write themselves at this point.

One important part that I see missing in most if not all models right now is self-inference. Models don’t run per se, they just work query-by-query, in chunks. I think it will be important in the future to let models run themselves, inferencing every n ms, and choosing their action or lack thereof.

For example, right now I am Ozyrus, an aspiring AI researcher 27 years of age. I see before me my computer screen and keyboard, thinking of how to continue this paper. I decided to write this paragraph, so I did.

It’s a gross oversimplification of processes happening under the hood, of course. The “inside-prompt” presented my surroundings and my thoughts to me not as text per se, but as an ever-changing flow of information (tokens?), some encoding visual stimuli, some - thoughts, et cetera. And the action of writing the paragraph was of course a lot of actions of writing, deleting, rewriting, et cetera. It’s why I think it’s important to broaden the context window and inference continuously: each separate moment of human existence can’t fit into the 2048 token window, and superhuman existence might take even wider ones. (Note: Seems like OpenAI thinks that, as well, given GPT-4's significantly wider window.)

Identity

Probably the most important part of the whole system, because it should in theory directly affect alignment of such systems. It should contain explicit description of the agent’s identity, goals, taboos. Whether or not the agent should be able to modify this part of themselves is up to discussion; my take is that modifying one’s identity is part of one’s progress as a person. Maybe this take is also part of my identity that is not modifiable?
Anyway, it seems like this might be a very promising alley of research right about now -- whether it’s a convergent instrumental goal to modify one’s identity or stay the same at least in some ways; if we discover that identity is very important for any type of agent, this might be our saving grace in terms of alignment.

Of course, it’s important to note that this is only the conscious part of identity; I would predict that an agent with the most noble explicit identity could still be corrupted with unaligned training data, just like how bad experiences might lead noble people down a dark path, while keeping the inside view of one’s identity as a good and noble person intact.

And, of course, Identity is part of Context, it’s just that it shouldn’t fluidly change (much) with input.

(Note: System prompt introduced in OpenAI chat models can be considered an “Identity” part.)

Context

Context is the world-state for the agent, it’s what’s happening around at any given interval of time. For dialog agents, it’s just dialog history. For what I predict would be the first useful implementations of this architecture that would be the browser window.
I predict that for research purposes for first agents of this type other transformer models can be used, fine tuned for generating fluid context. For superintelligence it very well might broaden to cosmic proportions. For humans, well, you can see and feel for yourself. Context very well can and should be fluid, especially with small context windows; I predict it would be counterproductive to just feed everything the agent can sense there. There is a potential place for a submodule here, in fact, that filters out useless information and feeds only useful info into the prompt. For example, at first, instead of just streaming visual data into a multimodal model we can use visual-based transformers for captioning said data. Same with audio, et cetera, et cetera. Of course, with the possibility to shift the focus of attention with the following module, Action.

Action

Action is not a prompt module per se, of course. It’s where the inference happens, using Identity and Context. But for it to become the Action, there have to be submodules to execute these actions. Without these submodules, these actions are just thoughts, which are very important in their own way, of course, as they do shape the following actions in the end.

What could these submodules look like? Well, the most basic, of course, should be “Type”. Others might take more sophisticated subsystems. For browser-based implementations that’s “MoveCursor”, “Click”, etc. “MoveAttention” (Focus) is the other potential action discussed above in the Context section. Optimal level of abstraction in implementation of these submodules is of course up to research -- I’m pretty sure it’s counterproductive to make model output vectors of how the mouse should move. If an instrument is in any way more sophisticated it becomes more clear that it’s best left to subsystems; my intuition is that we humans do it that way. Just to be explicit -- submodules detect said action encoded in output and execute it, which influences the Context.

I also think that there’s a place for Memory here -- both writing to and accessing the usual database.

And of course, it is through this module the system can fine-tune itself on novel data it experienced (Sleep), and self-improve by editing its Identity and, in the end, its very code.
Note: part of this is currently implemented as “Tools” in the recent OpenAI API.

Model

Core to the agent, Model is that huge unconscious part of it that actually generates next tokens. As said above, I predict that we will need a lot of novel training data for anything to work productively here.

I don’t think current big text models can be used for inference on such prompts out-of the box and be useful, rather be used for alignment experiments and basic research on how such agents work. Another potentially good use case for these models is dataset creation for use in fine-tuning the actual agent models, when prompted correctly and filtered by human assessors.

I am pretty sure that Facebook AI research has already thought about it (see this), but I am not sure other research orgs have come to the same architecture, as I myself work in a research org and this idea wasn’t discussed or researched upon at all to my knowledge.
Note: it seems to still be the case.

Research?

I think a good proof-of-concept of such a system should be a “Player” model-prompt that can interact with an AI-dungeon-like model as a context.
With this, we can test if and how model-prompt will modify its identity, if (and how) it will act according to the taboos and personality stated in Identity, and how consistent it will be with utilising different actions. This line of research could also incorporate testing of Memory Action (database access).

I think that dataset-generating with transformer models should be looked upon and tested on if it will be useful for getting the required data to get such an agent running, and will also be useful in previous line of research.
Note: seems like capabilities to use tools just emerge, if GPT-4 is of any indication, though it is possible that they fine-tuned the model for tool use.
 


Alignment implications


I assign a very high (90%+) probability of the first weak AGI being made with this architecture, as I don’t see any major obstacles (please comment if you do see them).
And I think that’s very good news for alignment.
Here’s why:
1. Reasoning and task-execution in such agents will be visible. Of course, superintelligent agents can plausibly still covertly execute unaligned actions, but there’s ways we can explicitly counter it! Humanity has been making text filters for a lot of time.


2. We can implicitly counter it! Yes, we don’t know what’s going on inside models in terms of mechanistic interpretability. But the way text models are being aligned right now is actually significant; since they will still output text, some part of the text will just be executed as commands.

3. Speed. Yes, I know that Eliezer’s speed analogy is actually an analogy about superintelligence. But still, these systems will be very slow, compared to humans, in the beginning. Inference (especially in most capable systems) is far from real-time. There is market pressure on lowering inference costs and increasing speed, yes. But we seem to be currently far away from being able to infer big models significantly faster than realtime.
But we need to research these systems RIGHT NOW! I feel like we actually have a significant chance of aligning these systems before it’s too late.


Final notes


I propose a name for such agents: ICA Simulacra.
I think that significant resources should be spent right now on alignment research of this type of agent, as I assign a high probability of the first superhuman systems being arranged in the described way.
I’m doing my own independent research right now and will be publishing follow up articles. I will appreciate any collaboration, employment of grant opportunities for this research.

Submodules should be a separate and very challenging area of research -- these should be crucial for interfacing with the world. This should be left to capabilities folks, I’m sure.

Continuous inference also needs research, I think. Also sure that capabilities folks will figure this out. 

I think at this point it’s pretty safe to discuss and post this openly; my take is we're not close enough to compute, data, and context window requirements for the kind of system I'm describing to work in dangerous ways.

This way we can make groundwork in theory before any practical danger is present. Although my intuition is this is kinda the point where this implementation is one Manhattan Project away from working, if it does work.


 

New Comment
2 comments, sorted by Click to highlight new comments since:

This seems similar to Role Architectures, perhaps with the roles broken down into even more specialized functions.

I suspect lots of groups are already working on building these kinds of agents as fast as possible, for both research and commercial purposes. LangChain just raised a $10m seed round.

I'm less optimistic that such constructions will necessarily be safe. One reason is that these kinds of agents are very easy to build, especially compared to the work required to train the underlying foundation models - a small team or even a single programmer can put together a pretty capable agent using the OpenAI API and LangChain. These agents seem like they have the potential to exhibit discontinuous jumps in capabilities when backed by more powerful foundation models, more actions, or improvements in their architecture. Small teams capable of making discontinuous progress seems like a recipe for disaster. 

I agree. Do you know of any existing safety research of such architectures? It seems that aligning these types of systems can pose completely different challenges than aligning LLMs in general.