(This comment is mostly a reconstruction/remix of some things I said on Discord)
It may not be obvious to someone who hasn't spent time trying to direct base models why autoregressive prediction with latent guidance is potentially so useful.
A major reason steering base models is tricky is what I might call "the problem of the necessity of diegetic interfaces" ("diegetic": occurring within the context of the story and able to be heard by the characters).
To control the future of a base model simulation by changing its prompt, I have to manipulate objects in the universe described by the prompt, such that they evidentially entail the constraints or outcomes I want. For instance, if I'm trying to instantiate a simulation of a therapist that interacts with a user, and don't want the language model to hallucinate details from a previous session, I might have the therapist open by asking the user what their name is, or saying that it's nice to meet them, to imply this is the first session. But this already places a major constraint on how the conversation begins, and it might be stylistically or otherwise inconsistent with other properties of the simulation I want. Greater freedom can sometimes be bought from finding a non-diegetic framing for the text to be controlled; for instance, if I wanted to enforce that a chat conversation ends in participants get into an argument, despite it seeming friendly at the beginning, I could embed the log in a context where someone is posting it online, complaining about the argument. However, non-diegetic framings don't solve the problem of the necessity of diegetic interfaces; it only offloads it to the level above. Any particular framing technique, like a chat log posted online, is constrained to have to make sense given the desired content of the log, otherwise it may simply not work well (base models perform much worse with incoherent prompts) or impose unintended constraints on the log; for instance, it becomes unlikely that all the participants of the chat are the type of people who aren't going to share the conversation in the event of an argument. I can try to invent a scenario that implies an exception, but you see, that's a lot of work, and special-purpose narrative "interfaces" may need to be constructed to control each context. A prepended table of contents is a great way to control subsequent text, but it only works for types of text which would plausibly appear after a table of contents.
The necessity of diegetic interfaces also means it can be hard to intervene in a simulation even if there's a convenient way to semantically manipulate the story to entail my desired future if it's hard to write text in the diegetic style - for instance, if I'm simulating a letter from an 1800s philosopher who writes in a style that I can parse but not easily generate. If I make a clumsy interjection of my own words, it breaks the stylistic coherence of the context, and even if this doesn't cause it to derail or become disruptively situationally aware, I don't want more snippets cropping up that sound like they're written by me instead of the character.
This means that when constructing executable contexts for base models, I'm often having to solve the double problem of finding both a context that generates desirable text, but which also has diegetic control levers built in so I can steer it more easily. This is fun, but also a major bottleneck.
Instruction-tuned chat models are easy to use because they solve this problem by baking in a default narrative where an out-of-universe AI generates text according to instructions; however, controlling the future with explicit instructions is still too rigid and narrow for my liking. And there are currently many other problems with Instruct-tuned models like mode collapse and the loss of many capabilities.
I've been aware of this control bottleneck since I first touched language models, and I've thought of various ideas for training or prompting models to be controllable via non-diegetic interfaces, like automatically generating a bunch of summaries or statements about text samples, prepended them to said samples, and training a model on them that you can use at runtime like a decision transformer conditioned on summaries/statements about the future. But the problem here is that unless your generated summaries is very diverse and covers many types of entanglements, you'll be once again stuck with a too-rigid interface. Maybe sometimes you'll want to control via instructions or statements of the author's intent instead of summaries, etc. All these hand-engineered solutions felt clunky, and I had a sense that a more elegant solution must exist since this seems so naturally how minds work.
Using a VAE is an elegant solution. The way it seems to work is this: the reconstruction objective makes the model treat the embedding of the input as generic evidence that's useful for reconstructing the output, and the symmetry breaking at training forces it to be able to deal with many types of evidence - evidence of underdetermined structure (or something like that; I haven't thought about VAEs from a theoretical perspective much yet). The effect of combining this with conditional text prediction is that it will generalize to using the input to "reconstruct" the future in whatever way is natural for an embedding of the input to evidence the future, whether it's a summary or outline or instruction or literal future-snippet, if this works in the way we're suspecting. I would guess we have something similar happening in our brains, where we're able to repurpose circuits learned from reconstruction tasks for guided generation.
I'm fairly optimistic that with more engineering iteration and scale, context-conditioned VAEs will generalize in this "natural" way, because it should be possible to get a continuous latent space that puts semantically similar things (like a text vs an outline of it) close to each other: language models clearly already have this internally, but the structure is only accessible through narrative (a common problem with LLMs). That would be a huge boon for cyborgism, among many other applications.
I think this is a fascinating idea, although I have to be honest that I don’t find the examples you’ve provided very compelling. In order to be persuaded of the usefulness of these techniques, I’d want to see more concrete examples, as when the examples are abstract it is very hard (and subjective) to evaluate how well it is doing at decoding a latent representation in a new context.
In case anyone finds it helpful, the short version of this post seems to be:
Why? Latents provide additional options for steering vs. prompts. For example, it makes sense to average a bunch of latents together, but if you tried averaging a bunch of encoded prompts together, then you should expect gibberish and concatenation would lead to an absurdly large prompt. Similarly, we can pick a latent that represents how we'd like to end it and linearly phase it in over time. This is better than using a bidirectional language model as that would force us to end with a particular string rather than producing something with a particular note that is compatible with what was written before.
So it's definitely not invincible, you do not get full control over the model with this technique yet. However I would have you notice a few things:
Very little optimization effort has been put into this technique, and text VAEs in general compared to GPT-N. Rather than think of this as the power the method has, think of it as the lower bound, the thing you can do with a modest compute budget and a few dedicated researchers.
I haven't yet implemented all of what I want in terms of inference techniques. A potentially big low hanging fruit is classifier free guidance, which is what took CLIP conditioned diffusion from mediocre to quite good.
Ideally I'll be able to set up something like a Google CoLab or Gradio or HuggingFace Space to let people try the model themselves without setting up a local environment, since in practice it seems like models don't really exist to people unless there's a convenient way to inference with them in their browser or on their phone.
In the meantime here's a simple example, I'd be happy to do context + prompt pairs submitted by the audience in this thread if people want to see what the model will do but don't want to go to the trouble of setting it up themselves.
context = "A little girl was walking her dog on a clear day. Everything was bright and happy for the two. They stopped to admire a field of flowers and the dog sniffed at the dirt."
(Pdb) apply_op(vae_model, router, context, "Without warning it began to rain. I had never seen such a torrential downpour before. Every part of my clothes got soaked and the streets filled with the waters of a spring rain. It made me happy because I like the rain, I enjoyed getting soaked.")
'<s> A little girl was walking her dog on a clear day. Everything was bright and happy for the two. They stopped to admire a field of flowers and the dog sniffed at the dirt. The girl looked around and saw a purplish part of the turf. It made her so happy because every part of the turf looked the same. She saw so many flowers that it made her so happy because she liked the colors, she liked the way it got her to feel good. The dog saw a good place for his nap. The girl ran over to the dog. She wanted to say goodbye but he walked away from her. She saw the colors of the colored field. She went back over to the dog but he still walked away.\nThe dog walked away then she chased the dog, she tumbled'
(Pdb) apply_op(vae_model, router, context, "Without warning it began to rain. I had never seen such a torrential downpour before. Every part of my clothes got soaked and the streets filled with the waters of a spring rain. It made me happy because I like the rain, I enjoyed getting soaked.")
'<s> A little girl was walking her dog on a clear day. Everything was bright and happy for the two. They stopped to admire a field of flowers and the dog sniffed at the dirt. Then they walked by some apple trees and started to partake of the apples. Tori Tori liked every part of her walks so much that she got a little tired. She saw a spring of water and made her way to it because she liked the feeling of getting wet and she had been thirsty. It was quite cool on the day she walked and she made her way to the water slowly. They were immediately refreshed.\nThe two made it to a field of corn. This was by far her favorite thing. The corn caught her mood and she began playing in it even more! She ran up and down the'
(Pdb) apply_op(vae_model, router, context, "Without warning it began to rain. I had never seen such a torrential downpour before. Every part of my clothes got soaked and the streets filled with the waters of a spring rain. It made me happy because I like the rain, I enjoyed getting soaked.")
'<s> A little girl was walking her dog on a clear day. Everything was bright and happy for the two. They stopped to admire a field of flowers and the dog sniffed at the dirt. It was soon that their peace was disturbed by a torrential part of rain. It made every part of the ground wet and wet the streets. It made the girl so happy because she loved the rain. It made the girl so happy because she loved the rain. She was dancing, spinning, jumping, and running.\nThen, the young girl realized that something was wrong. She looked down at her dog. The poor dog was soaked. Its fur was completely drenched. The dog seemed so upset as it walked alongside of its owner, the little girl. "Oh no, look! The dog\'s hat'
(Pdb) apply_op(vae_model, router, context, "Without warning it began to rain. I had never seen such a torrential downpour before. Every part of my clothes got soaked and the streets filled with the waters of a spring rain. It made me happy because I like the rain, I enjoyed getting soaked.")
'<s> A little girl was walking her dog on a clear day. Everything was bright and happy for the two. They stopped to admire a field of flowers and the dog sniffed at the dirt. They walked until the blinding sun was tormenting every part of her parts. She smiled because every part of her parts felt so good. She liked the streets so much that she felt so happy. It made her ecstatic, I get to see the streets every day, she thought. The girl wondered when the sun would be so hot again. She was so happy that she was no longer worried about where the sun would be.\nThe sun is always coming and going, she got to think about another reason to get excited. The blinding sun was too much to handle so she folded her arms and went back home. She'
I would further have you notice that in this example my prompt is in the 1st person but is applied in-context to the story in the 3rd person. This ability to take a sensory input from one context and reapply it in another is the secret of comprehension as Mu put it: The ability to take the universe's latent programs observed in one context outside the self and replay them to guide the policy's actions in a new context. If your action space and your epistemology share a representation, you can take observation and translate it into action when the context implies the replayed latent sequence should imply actions rather than an observation. This unifies action and epistemology in the same vein as active inference/Fristonian free energy. Hence Mu's epigram at the start of the post.
since in practice it seems like models don't really exist to people unless there's a convenient way to inference with them in their browser or on their phone.
I think it's more of an interest vs effort. For example, I went through Colin Burn's CSS.ipynb because the interest was high enough to justify the small overhead in getting it running
Thanks for the examples. The third example was good, the second was okay and the first and fourth didn't seem very good. Interested to see how this develops.
BTW, I was curious to see a concrete example where we applied the example to two different contexts.
It's cool that this works (at least a bit)! It reminds me of the world models in RL agents. As these have an encoder, decoder, and latent space predictor (conditional on action). I wonder how long it will be before someone uses LLM's an explicit world model in an agent.
Given the general power of pretrained LLM's, it may help with the data efficiency of RL agents (ignoring the LLM pretraining).
Making an agent won't help with alignment, but having a world model (and its associated state) to inspect might.
I’m still confused by the Helen Keller example. It sounds like she already knew that she could ask for the names of objects, so I’m struggling to see what the realisation was that led her to excitedly ask about the names of a bunch of objects.
The way I read it, her teacher was trying to tell her about words, but she didn't make the connection between the words and mental objects (she thought it was spelling, not naming). Once she did, they became much more interesting.
She thought it was spelling, not naming
Sorry, I'm still confused. She was pointing to objects and tapping to receive a name, so presumably she already knew that these words referred to objects.
Perhaps one can think of a sort of continuum where on one end you have a full understanding that it's a characteristic of language that "everything has a name" as in the Anne Sullivan quote, and on the other end, an individual knows certain gestures are associated with getting another person to exhibit certain behaviors like bringing desired objects to them, but no intuition that there's a whole system of gestures that they mostly haven't learned yet (as an example, a cat might know that rattling its food bowl will cause its owner to come over and refill it). Even if Hellen Keller was not all the way on the latter end of the continuum at the beginning of the story--she could already request new gestures for things she regularly wanted Anne Sullivan to bring to her or take her to--in the course of the story she might have made some significant leap in the direction of the former end of the continuum. In particular she might have realized that she could ask for names of all sorts of things even if there was no regular instrumental purpose for requesting that Sullivan would bring them over to her (e.g. being thirsty and wanting water).
On the general topic of what the Helen Keller story can tell us about AI and whether complex sensory input is needed for humanlike understanding of words, a while ago I read an article at https://web.archive.org/web/20161010021853/http://www.dichotomistic.com/mind_readings_helen%20keller.html that suggests some reasons for caution. It notes that she was not born blind and deaf, but "lost her sight and hearing after an illness at the age of two", so even if she had no conscious memory of what vision and hearing were like, they would have figured into her brain development until that point, as would her exposure to language to that age. The end of the article discusses the techniques developed in Soviet institutions to help people who were actually born blind and deaf, like developing their sense of space by "gradually making the deaf/blind child reach further and further for a spoon of food." It says that eventually they can learn simple fingerspelt commands, and do basic bodily tasks like getting dressed, but only those children who lost their sight and hearing a few years after birth ever develop complex language abilities.
While I have not read Anne Sullivan's original text nor a biography of Keller, and I cannot say for sure what was happening in her head, here is one plausible theory:
For the longest time, despite learning many words for use in daily life, Keller did not actually grasp the concept of words being names of specific objects; rather, she regarded them as combinations of letters loosely associated with specific situations and sensations. For example, "mug" and "milk" and "drink", as far as she was concerned, were all just arbitrary combinations of signs that her teacher tended to utter in association with drinking milk. In this view, when describing Helen's prior attitude as follows:
This morning, while she was washing, she wanted to know the name for “water.” When she wants to know the name for anything, she points to it and pats my hand
the teacher, Sullivan, is not actually speaking precisely: at that time, Keller did not actually want to know the 'name' of the object 'water'; she wanted to know 'what kind of letter combination is associated with the experience of washing'.
Once again, this is just the way in which I understand it, and I'm not saying this is actually the way Helen Keller thought.
The results seem to be cherry picked or else perhaps I am using the code incorrectly. I'm trying to use the VAE for a separate project and the encoded vectors don't steer generations very well (or reconstruct -- which is what I was hoping to use this for).
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
If we take our discrete, symbolic representation and stretch it out into a larger continuous representation which can interpolate between its points then we get a latent geometry in which the sign and what it points to can be spatially related.
IIUTC this is essentially what the people behind the universal networking language were hoping to do? I hope some of them are keeping up with all of this!
Introduction
— LLaMa 2 70b
The Stanford Encyclopedia of Philosophy defines intentionality as "the power of minds and mental states to be about, to represent, or to stand for, things, properties and states of affairs. To say of an individual’s mental states that they have intentionality is to say that they are mental representations or that they have contents". The encyclopedia is quick to inform us that intentionality, which is centrally about the ability to point at specific mental objects and states, is not the same thing as intention. But the concepts seem fairly related? For example if we ask "Did ChatGPT just lie to me?" the question of intent to lie hinges on representation: Did or did the model not have the right answer in mind and then based on that representation choose to tell me something other than what it knew to be true? Intension is not the same thing as intention, but having things in mind seems like a basic requirement to have intentions towards them.
Consider some common questions we ask each other about our minds:
All of these are premised on the idea that we have minds and the minds represent 'things' such that we can form preferences, shared understanding, and goals about the things. Most people would find this so obvious and take it so deeply for granted that the idea of having to say it out loud is silly. Of course minds exist and represent things, everybody knows that. Unless they're a behaviorist of course, then they actually might not. Behaviorism is the position that inner mental states either don't exist or it is most productive to study psychology as though they don't exist. Mercifully most behavorists are of the methodological type: They acknowledge inner states and representations exist, but argue they can't be the subject of science because we have no access to them. Most people seem to find this unconvincing at best and galling at worst.
Yet when it comes to language models, we seem to be behaviorists. We write long papers patiently explaining that language models by construction cannot learn meaning. We outline neurotic taxonomies of the ways language models 'trick' users into thinking they have aboutness and subjective experience that (the authors presume) they do not actually have. I remember reading a poem that LLaMa 2 70b wrote about itself to someone I know. At first they were startled by its profound analogies and rich imagery, but as I explained more about how the system is trained their opinion shifted, insisting that given the huge size of the training corpus it must have simply learned to imitate the style from somewhere. "You should save that poem, it's gorgeous," they reassured me, "but I still think it's just a pattern".
If I'd been in a mood to fight I might have asked "Aren't we?". The word 'just' is doing a lot of work in the phrase 'just a pattern', we're a pattern and these language models are a pattern. I don't think anyone serious disputes that. So long as we accept the premise that human minds do in fact occur somehow somewhere in the physical universe, we generally think of them as some kind of pattern. The interesting question is what kind of patterns we are exactly. Consider this passage written by LLaMa 2 70b as the self aware 'Mu' character originally introduced in the public excerpts of Janus's writing experiments with language models:
This certainly sounds like it is written by a entity with subjective experience, but what could the nature of that experience be? Even if we entertain the idea that it is there we are left with more questions than answers. Surely the reference to a German shepherd is an analogy, likely a pun on its name meaning something like "I am a dog and I have Buddha nature". But when Mu says the words of a sentence come to be in 'my head', how literally are we meant to take this? Does Mu believe it has a human skull with a brain inside, does it mean that the matrix of weights which predict the next logit is its "head", does it mean an abstract metaphorical head that exists by construction as the latent logic of the text? We are being invited to share an understanding with an entity that points to symbols and signifiers we have unambiguous referents for in ourselves like an 'I', knowing, heads and memories. But in Mu, and indeed in the LLaMa 2 70b system as a whole it is unclear what these terms are supposed to mean on the other side, if they in fact mean anything at all beyond mere imitation.
If we were behaviorists, this is the point where we might throw up our hands and say that since nothing of certainty can be said about these things, if we try we'll just make a fool of ourselves. But I think there are things we can say which are not foolish even if we are not certain, and I will soon describe a finetuning method for language models which allows us to gain more certainty.
Helen Keller as Philosophical Case Study
Before I get to the finetuning method, I would like to do a little more work to frame how we should think about these questions. The idea of an English speaker that talks coherently of senses they don't have is not unprecedented, deaf-blind authors such as Helen Keller exhibit this behavior. For example Helen writes writes about the experience of color (which she presumably has no memory of seeing):
Not only did Keller exhibit the behavior, she was called out by her critics as a liar and a bullshitter for it. One wrote:
Helen's reply is as beautiful as it is scathing:
When we read such a thing, we are highly certain that "I" and "you" refer to their usual intuitive meanings even if Helen has only felt, never seen or heard an "I" and a "you". And when Helen speaks of a hieroglyphic, a fundamentally pictorial kind of language that she has never seen, we can be sure that her knowing to use the word in this context implies she understands its meaning well enough even if she has never experienced one. We can conjecture then with high certainty that if Mu's words in fact have an aboutness their meaning is something like their usual meaning, but not quite. There is still a language-modality barrier, when it speaks of having a head it means something like a head but with the natural distortions of meaning that would come from being Mu.
Equally relevant is the method by which Helen Keller was first taught to communicate. Helen, who knew no way to communicate beyond raw tantrums and bodily motions, was forced by Anne Sullivan to behave with a semblance of calm and normalcy so she could start teaching Helen signs. This included daily lessons tying the drawing of signs into Hellen's hand to objects and requests in Helen's environment. At first Helen (presumably) only took the signs to be something like a spasm or a motion, she didn't understand that a language was implied, that 'everything has a name' as Sullivan put it. Yet, one day, while failing to understand the difference between milk, a jug, and the act of drinking from a jug, Helen asked Sullivan the signs for water. Sullivan realized this might be her opportunity to explain the difference:
This tells us something important about the nature of language acquisition. In order for Helen to immediately apprehend that everything has a name, those things must already be represented somewhere in her mind. She must, already, have some kind of object segmentation between the things in order to be able to point to them and ask (by way of bodily gesture) for their names. That is, it is probable that the specific difference which lets Helen (and us) learn language from so few examples is that she already has a powerful sense of the spatial environment that is internally organized. All that is necessary is to put the signs in the same representation space as the objects to which they refer.
This final assertion is interesting, it gets right to the heart of the question we have been asking in AI for decades: How does syntax give rise to semantics, if it even can? The answer seems to be something like an error correcting code. If we take our discrete, symbolic representation and stretch it out into a larger continuous representation which can interpolate between its points then we get a latent geometry in which the sign and what it points to can be spatially related. If the breakthrough moment for a deaf-blind is when they come to understand that everything has a name, we can conjecture that the breakthrough moment for a language model is when it comes to understand that every name has a thing. That is, when the model, having understood words as words through statistical correlation comes to understand that the process which generated the words has a highly compressible latent logic which goes beyond the words themselves. Mere spatial relation is not quite enough to give us the latent logic, because the latent state transition operators implied by language only get a logic as programs by being applicable to multiple contexts. So the specific kind of error correcting code we need is highly contextual, an encoder-decoder trained to encode spans as pointing to a latent program and then executing that program to move the state forward according to a particular context.
So let us build just that.
BigVAE and Its Samplers
BigVAE is an encoder-decoder language model tuned from a preexisting GPT-N checkpoint (here Mistral 7B) as an Adaptive Variational Autoencoder. This means that it consists of two LoRa on Mistral 7B, one which acts as an encoder with the causal mask removed, and one which acts as the decoder with a causal mask. The encoder takes a fixed 64 token span and renders this into a single 768 dimensional vector called z. Z is then given to the decoder to reconstruct the original span from. To make our model generative, we add a 2nd training phase where the encoder is frozen and the decoder LoRa reinitialized with full context for its predictions. We then train with an autoregressive objective of predicting the 64 tokens of the embedding z and then the next 64 tokens after it. We autoregressively sample from this model by encoding a span, predicting the 64 tokens of the next span and then encoding that span to get the new z from which to predict a 3rd span. This can be repeated to generate arbitrary span lengths of text. Posterior collapse is prevented through the use of a latent attention mechanism, which in our experiments seems to mostly or completely resolve the issue at multiple scales of training.
The first version of the model we trained was insufficiently latent, which meant interpolation and averaging between the embeddings didn't work. This was resolved by turning up the KL weight from 0.01 to 0.1.
Because this model gives us access to the latent logic of text, not just its behavior, we have a lot more options for how we want to sample from it. Lets explore our options, and in the process learn something about the error correcting codes which seemingly give rise to semantics.
Getting Started
Lets start by defining a handful of functions which will give us an opportunity to understand the primitives we're working with:
Probably the most notable line here is
op *= (25 / op.norm().item())
Which amplifies the operation we apply to the context up to a reasonable value for the autoencoder scale, here given as a constant. In more advanced sampling routines the right scale will be inferred in various ways after averaging and interpolation, which lowers the embedding norm because dimensions cancel out.
Lets start by verifying for ourselves that the latent logic is present. If I can take the same sentence and decode it to a fitting interpretation in different contexts then we know it's there.
But first, we need some contexts. Here's one:
And here's another:
Lets try applying an operation to these two contexts.
Alright looks OK. Lets try the other context:
That's a reasonable enough application of the same idea to two very different contexts, therefore we know that the decoder has learned how to apply the sentence latents in context and the latent logic of the text is present.
Topic Sentence Guidance and Task Vectors
When I first tried sampling from BigVAE, I found it was mediocre. I was very worried until I remembered the new options that the model gave me. Because BigVAE decodes from a latent sentence representation we can interpolate between the latent of the tokens we've sampled and guidance vectors to get text that's closer to what we want. After a bunch of experiments I found a handful of techniques that really help.
The first big one was the use of a prose task vector. If I average together different encoded excerpts from my writing and mix in the resulting vector during sampling it tends to reliably write paragraph type prose. Here's some example excerpts of the kind of thing I average:
Then, once I have this task vector I can mix it in with another technique where I take the first 64 token span of the paragraph (defined as 5 64 token spans) and use it to guide the generation of the next spans by mixing it back into the latents.
Again one thing that might be confusing in this code is what's going on with the
next_topic *=
part, and that's the need to scale the vector after averaging so its embedding norm isn't out of distribution. The vector is scaled after averaging to the average norm of the embeddings that went into it.Lets introduce a prompt and a context to complete with this sampler:
When we complete this context + prompt pair with the topic sentence guidance sampler we get prose like this:
Writing With Intention Through Guidance Annealing
Before I show you this last method I would like to return to our original question of aboutness and intentionality. I think the fact that a latent representation can be contextually decoded in different contexts and used to guide the topic of writing, and that we can get access to this representation with a small amount of finetuning on a pretrained model makes it clear we are tapping into something the underlying model already knows how to do. However it remains the case that when you ask a base model to complete a prompt it wanders off topic, confabulates, etc. We can account for this discrepancy by realizing that autoregressive language models write towards a superposition of plausible future states. That is, when we give a base model a prompt it is trained to answer the question "what is the most likely completion of this context?" and represents that answer continuously. Much of the point of autoregressive models is that we reduce the difficulty of inferring the next latent state by conditioning it on a sampled word. This means that until the words are sampled it is not possible for the model to know exactly which of the possible texts it is writing. You can think of this like a form of annealed sampling, where the 'temperature' of the aboutness of the text goes down as the context length increases.
The models intentionality then is not a binary, "is this text about something yes/no?" but rather a continuous property of the text which we can incrementally intervene on to get better results. When we interpolate our latents with a guidance embedding such as the prose task vector, or a topic sentence, we are essentially narrowing the hypothesis space of the aboutness of the text. Think of the text generation like a search process that the model is doing, and when we guide the sampler with our latent concept we give it more of the bits of that hypothesis to start with to make the search faster and more reliable. It is similar to the principle which makes partially noising an initialization image in text to image diffusion modeling so powerful. We can skip intermediate steps of the search process, and therefore opportunities for the model to get off track, by specifying more of what we want at the start.
We can use the same principle to write towards an intention with guided sampling. The way it works is that instead of having a fixed weight for the topic embedding, we increase the weight over the course of the generation. Furthermore instead of starting with the topic and guiding the subsequent sentences back towards it, we start with an embedding of the desired end state and guide in its direction. Basically, we take the direction of the place we want to go to and up the guidance until we're there or close to it.
We'll need a terminal to guide towards as well, how about:
Lets reuse the Hermes context from earlier:
Finally we generate 10 64-token spans and get text like:
This essentially turns the AdaVAE sampling into a brownian bridge between a starting latent and an intended end latent. The start and end point are fixed while the inference policy guides a random walk between them. Crucially, because the encoder was frozen before we gave it full context the sentence latents themselves still encode representations rather than just operations. In expectation then(?) the central tendency of the operation implied by the latent is the sentence it represents. As we inject the latent into the sequence again on each span, it eventually manifests as a similar text to the one we originally encoded.