Hi! Co-author of the linked “exploration” here. I have some reservations about the exact request (left as a separate comment) but I’m very excited about this idea in general. I’ve been advocating for direct spending on AI research as a place with a huge ROI for alignment research for a while and it’s very exciting to see this happening.
I don’t have the time (or aptitude) to produce a really high quality dataset, but I (and EleutherAI in general) would be happy to help with training the models if that’s desired. We’d be happy to consult on model design or training set-up, or to simply train the models for you all. No compensation necessary, just excited to contribute to worthwhile alignment research.
What is the purpose of requesting such extremely long submissions? This comes out to ~600 pages of text per submission, which is extremely far beyond anything that current technology could leverage. Current NLP systems are unable to reason about more than 2048 tokens at a time, and handle longer inputs by splitting them up. Even if we assume that great strides are made in long-range attention over the next year or two, it does not seem plausible to me to anticipate SOTA systems in the near future to be able to use this dataset to its fullest. There’s inherent value in a more diverse set of scenarios, given the strong propensity of language models to overfit on repeated data. While this isn’t strictly speaking talking about repeating data, I am under the strong impression that having more diverse short scripts is going to train a much better mode than less diverse long scripts, assuming that the short scripts are still at or beyond the maximum context length a language model can handle.
For the same reasons it is challenging to leverage, I think that this will also be very challenging to produce. I think that changing the request to 100 different 6 page (10 step) or 10 different 60 p...
Answer 1: Longer is easier to write per-step.
Fitting a coherent story with interesting stuff going on into 100 steps, is something I expect to be much harder for a human author than fitting that story into 1000 steps. Novels are famously easier to write on a page-level basis than short stories.
If you take zombies attacking a magical academy for 1000 steps, you might get something that looks like a coherent quest. If you take zombies attacking a magical academy for 100 steps, I think you get something that looks like a quest that was just getting started when the dataset ran out... unless the author has somehow carefully figured out a plot that will, given unknown user actions, get somewhere interesting within 100 steps, which sounds much harder for the author I imagine; they can't just pick a premise, run with it, and make stuff up as they go along....
1: I expect that it's easier for authors to write longer thoughtful things that make sense;
I pretty strongly disagree. The key thing I think you are missing here is parallelism: you don't want one person to write you 100 different 600 page stories, you one person to organize 100 people to write you one 600 page story each. And it's a lot easier to scale if you set the barrier of entry lower. There are many more people who can write 60 page stories than 600 page stories, and it's easier to find 1,000 people to write 60 pages each than it is to find 100 people to write 600 pages each. There's also much less risk on both your side and theirs. If someone drops out half way through writing you lose 30 pages not 300.
Based on this comment:
I state: we'd be happy, nay, ecstatic, to get nice coherent complete shorter runs, thereby disproving my concern that short runs won't be possible to complete, and to pay for them proportionally.
I'm now under the impression that you'd be willing to pay out the 20k for 10 runs of 100 steps each (subject to reasonable quality control) and bringing that about was my main goal in commenting.
The other major worry I have about this pitch is the experimen...
I state: we'd be happy, nay, ecstatic, to get nice coherent complete shorter runs, thereby disproving my concern that short runs won't be possible to complete, and to pay for them proportionally.
In wake of the censorship regime that AI Dungeon implemented on OpenAI's request, most people moved to NovelAI, HoloAI, or the open source KoboldAI run on colab or locally. I've set up KoboldAI locally and while it's not as featureful as the others, this incident is another example of why you need to run code locally and not rely on SaaS.
For background, you could read 4chan /vg/'s /aids/ FAQ ("AI Dynamic Storytelling"). For a play-by-play of Latitude and OpenAI screwing things up, Remember what they took from you has the history of them leaking people's personal stories to a 3rd party platform.
This plausibly looks like an existing collection of works which seem to be annotated in a similar way: https://www.amazon.com/Star-Wars-Screenplays-Laurent-Bouzereau/dp/0345409817
How do you think this project relates to Ought? Seems like the projects share a basic objective (having AI predict human thoughts had in the course of solving a task). Ought has more detailed proposals for how the thoughts are being used to solve the task (in terms of e.g. factoring a problem into smaller problems, so that the internal thoughts are a load-bearing part of the computation rather than an annotation that is predicted but not checked for being relevant).
So we are taking one of the outputs that current AIs seem to have learned best to design, and taking one of the places where human thoughts about how to design it seem most accessible, and trying to produce a dataset which the current or next generation of text predictors might be able to use to learn how to predict thoughts about designing their outputs and not just predict the outputs themselves.
As the proposal stands it seems like the AI's predictions of human thoughts would offer no relevant information about how the AI is predicting the non-thought story content, since the AI could be predicting these different pieces of content through unrelated mechanisms.
This challenge seems incredibly intriguing and well put-together, and I certainty admire how a million dollars have been used to improve DnD specific AI!
I believe a small team (3-4) of dedicated writers, co-ordinating each other online, has a genuine shot at writing a quality Story Arc quickly and bagging $20,000, to be spilt proportional to work. I have ideas about how to quickly streamline the process of multiple people working on one single Annotated Dungeon Run, and think we can really expedite this process in a number of significant ways. If interested, please contact me through my account on LessWrong- we can swap drafts, talking about writing commitments and get a good feel for fit before committing to anything.
Also, to those with the enterprise to attempt to scale the project up, I believe I can find around 10-20 people, (given enough notice), with considerable storytelling ability willing to write annotated dungeon runs as a part-time occupation. I have unique access to exclusive TTRPG internet communities and artsy university students looking for casual work over summer- and I would be happy to find and direct them to you. If you want to work a deal out, also message me on LessWrong.
I have some ideas about how to scale up and extradite this project, and am happy to help bounce ideas around.
For anyone who may have the executive function to go for the 1M, I propose myself as a cheap author if I get to play as the dungeon master role, or play as the player role, but not if I have to do both. I recommend taking me as the dungeon master role. This sounds genuinely fun to me. I would happily do a dollar per step.
I can also help think about how to scale the operation, but I don’t think I have the executive function, management experience, or slack to pull it off myself.
I am Ronny Fernandez. You can contact me on fb.
I'm setting up a place for writers and organizers to find each other, collaborate, and discuss this; please join the Discord. More details in this comment.
I similarly offer myself as an author, in either the dungeon master or player role. I could possibly get involved in the management or technical side of things, but would likely not be effective in heading a project (for similar reasons to Brangus), and do not have practical experience in machine learning.
I am best reached through direct message or comment reply here on Lesswrong, and can provide other contact information if someone wants to work with me.
I think this is an interesting project, and one that (from a very different angle) I’ve spent a bit of time on, so here are a few notes on that, followed by a few suggestions. Stella, in another comment, made several great points that I agree with and that are similar in spirit to my suggestions.
Anyway, based on a fairly similar motivation of wanting to be able to “ask a LM what it’s actually thinking/expecting”, combined with the general tendency to want to do the simplest and cheapest thing possible first… and then try to make it even simpler still before starting… we’ve experimented with including metadata in language pretraining data. Most large language datasets have this information, e.g. books have titles and (maybe) blurbs, websites have titles, URLs, and (maybe) associated subreddit links, etc. This data is obviously much noisier and lower quality than what you get from paying people for annotations, but it’s voluminous, diverse, and ~free.
When inserting this metadata for pretraining, we made sure to do so completely randomly, i.e. a book title might be inserted anywhere within a book (maybe several times in different context windows etc). We added separate <META_START&...
I initially tried doing post-hoc annotation and found it much more difficult than thinking my own actual thoughts, putting them down, and writing the prompt that resulted. Most of the work is in writing the thoughts, not the prompts, so adding pregenerated prompts at expense of making the thoughts more difficult is a loss.
Hey, I wanted to clarify my thoughts on the concrete AI problem that is being solved here. No comment on the fantastic grant making/give-away scheme.
I don't have much expertise on the mechanisms of the GPT-3 systems, but I wonder if there is a more efficient way in providing human comprehendible intermediaries that expose the workings of the algorithm.
My worry is that many of the annotated thoughts imputed by authors are irrelevant to the actual process of design the AI goes through to create it's output. Asking the machine to produce a line of 'thoughts' alongside it's final statement is fair-play, although this doesn't seem to solve the problem of creating human comprehendible intermediaries, but instead gives the AI a pattern-matching/prediction task similar to what it goes through to create the original output. Wouldn't it be the case that the 'thoughts' the machine creates serve no more effect on the process of calculation than the original output (prompt)?
This process still seems to be serve a rudimentary function of indirectly shedding more light on the processes of calculation, much the same as how a bigger prompt would. Yet puzzlingly, we in fact want to "get differe...
I think you're essentially correct - but if I understand you, what you're suggesting is similar to Chris Olah et al's Circuits work (mentioned above in the paragraph starting "This sort of interpretability is distinct..."). If you have a viable approach aiming at that kind of transparency, many people will be eager to provide whatever resources are necessary.
This is being proposed as something different, and almost certainly easier.
One specific thought:
but my intuition suggests this would limit the complexity of the prompt by shackling it's creation to an unnecessary component, the thought
To the extent that this is correct, it's more of a feature than a bug. You'd want the thoughts to narrow the probability distribution over outputs. However, I don't think it's quite right: the output can still have just as much complexity; the thoughts only serve to focus that complexity.
E.g. consider [This will be a realist novel about 15th century France] vs [This will be a surrealist space opera]. An output corresponding to either can be similarly complex.
We have now received the first partial run that meets our quality bar. The run was submitted by LessWrong user Vanilla_cabs. Vanilla's team is still expanding the run (and will probably fix some typos, etc. later), but I'm providing a copy of it here with Vanilla's permission, to give others an example of the kind of thing we're looking for:
https://docs.google.com/document/d/1Wsh8L--jtJ6y9ZB35mEbzVZ8lJN6UDd6oiF0_Bta8vM/edit
Vanilla's run is currently 266 steps long. Per the Visible Thoughts Project FAQ, we're willing to pay authors $20 / step for partial runs that meet our quality bar (up to at least the first 5,000 total steps we're sent), so the partial run here will receive $5320 from the prize pool (though the final version will presumably be much longer and receive more; we expect a completed run to be about 1000 steps).
Vanilla_cabs is open to doing paid consultation for anyone who's working on this project. So if you want feedback from someone who understands our quality bar and can demonstrably pass it, contact Vanilla_cabs via their LessWrong profile.
I can't tell if it is purposeful that this is set up in an adversarial/ winner-take-all kind of way. It's really off-putting to me, and seems to encourage everyone being out for themselves, rather than collaboration. Particularly for such an inherently collaborative product. Maybe Nate and Eliezer just expect cooperation to fail?
Anyways, if people DO want to attempt some kind of collaboration... EDIT- Don't join my Facebook group, join plex's Discord linked in the comment below instead
We pay out $20,000 per run for the first 10 runs, as quality runs are received, not necessarily all to one group. If more than one group demonstrates the ability to scale, we might ask more than one group to contribute to the $1M 100-run dataset. Them cooperating with each other would hardly be a problem. That said, a lot of the purpose of the 10-run trial is exactly to locate executives or groups that can scale - and maybe be employed by us again, after the prize ends - so everybody getting together to produce the first 10 runs, and then disbanding, in a process that doesn't scale to produce 100 runs, is not quite what we are hoping for here!
It seems to me that their priority is find a pipeline that scales. Scaling competitions are frequently long-tailed, which makes them winner-take-all. A winner-take-all system has the bonus benefit of centralized control. They only have to talk to a small number of people. Working through a single distributor is easier than wrangling a hundred different authors directly.
I am very excited about finding scalable ways to collect large volumes of high-quality data on weird, specific tasks. This seems very robustly useful for alignment, and not something we're currently that good at. I'm a bit less convinced that this task itself is particularly useful.
Have you reached out to e.g. https://www.surgehq.ai/ or another one of the companies that does human-data-generation-as-a-service?
Random small note - the 'dungeon' theme is slightly ...culturally offputting? or something for me, as someone who's never been into this kind of thing or played any of these and is therefore a bit confused about what exactly this involves, and has vague negative associations (I guess because dungeons sound unpleasant?). I wonder if something a bit blander like a story, play, or AI assistant setting could be better?
IDEAS THREAD:
Team up with friends who already play DnD or write glowfic. Less scalable but can grab the $20k.
Similarly, if you're unemployed/ have lots of free time just sit down and write it yourself.
Recruit from a local University. This can be very scalable if you e.g. know the creative writing professor.
Recruit from roleplaying groups or online roleplaying forums. Requires a bit more filtering than the above.
Recruit from fiverr or similar. Requires lots of initial filtering but can end up with low price. Create a series of increasingly less automated tasks as a filter (eg start with a multiple choice quiz that's automatically graded)
Ask a person who already does this kind of thing how they would go about it.
I don't want to name names publicly here, but post on BR, or talk to MR to use his team.
Use the volunteers who are posting here.
Show this post to a whole bunch of people who you think might want to grab the $20k as individuals. Let them know that if enough of them make the $20k thing that you will all team up to churn out the $1m thing, split proportionally.
My questions are mostly about the player side, and about how deeply the DM should model the player:
When studying the provided 30-page thought-annotated sample, I thought about the <Yo be real> command a little more. In my opinion it should be applied in the training data a little differently than how it's done. Here are my thoughts:
In the sample, there are some places where the authors carefully tried to construct “AI nonsense” that matches what we regularly see in the current tech AI dungeon prompts. The player then responses with “<Yo be real>” plus some explanation on what the AI did wrong.
(obvious example: page 17 in this sample: https://docs.google.com/document/d/1PosMUaminpsR6_czFXBBlCrzMrsDGomajgLp6Y7q4Yw/edit)
This is probably intended for re-training the current models to accept such “<Yo be real>” sudo commands and deal with them correctly. You can’t include those (in a sensible way) in training data without having the DM first make mistakes.
I see a problem here, though:
A neural network learns every reoccurring feature of the training data set. If the training data often contains erroneous thoughts leading to nonsense prompts, this is what the AI will learn. You probably don’t want that. You want a model that makes such mistakes as rarely as possi...
I find this project very interesting and thought a lot about it in the last 2 weeks. The way I understand the main goal of the project is the following:
Not sure if it was suggested already or not, but one option is to look for “lets play” style videos for some game (gonna be hard to find one that’s simple enough, probably) and take the spoken text the youtuber says as thoughts. Some of them already have the transcript as subtitles.
On the same vein, looking for people who explain their choices in very clear-decision games, like chess. I once saw a booklet of chess games where the actual player explained most of his moves. If there is a way to get many of those, that might work.
The problem with commentary not made by the players themselves is that, as far as I understand it, the project wants the general thoughts of the player and not just the motivations for every specific move. Like, ideally, they want some stream of consciousness commentary style "oh look, that guy looks kind of tough, I'll go see if I can agro him. Oh no! he's way too strong... lets go hide behind this tree it looks kinda safe [...]". That's why I suggested the lets plays and not e-sports in general.
If they're ok with just noise-free motivational analysis, anything with good commentators might work, and chess is indeed a pretty clean option.
I've set up a Discord server for discussing collaborations and thinking about mechanism design for sharing out credit (current top idea is borrowing Rob Miles's Discord eigenkarma system with modifications, but liable to change), please join if you're considering becoming a run author (no commitment to being part of this effort).
I don't need the money and won't be skimming off any funds for my contributions to the project, but am very open to people turning up with a bunch of great ideas and making everything work smoother and taking a management fee as compensation, so please also join if you're interested in becoming a project leader or organizational assistant.
Practical question: Can the DM and players switch roles in the course of one "run" or does the DM have to remain the same individual? What else has to be continuous or uniform about the run? Does there have to be one overarching plot or just continuous narrative?
My coauthor and myself generated the sample run by taking turns on Action, Thought, Prompt. That is, I wrote an Action, she wrote a Thought, I wrote a Prompt, she wrote an Action, I wrote a Thought, she wrote a Prompt. This also helped show up immediately when a Thought underspecified a Prompt, because it meant the Thought and Prompt were never written by the same person.
More coherent overall plot is better - that current systems are terrible at this is all the more reason to try to show a dataset of it being done better. There doesn't necessarily need to be an advance-planned endpoint which gets foreshadowed; that is demanding a bit much of the author when they're dealing with somebody else's Actions or when people are taking turns on the Thoughts.
I have an idea for testing this approach, before getting authors to write tens of thousands of pages of annotated dungeon tests.
It's hard to generate explanations of prose, but easy, for a computer, to generate explanations of particular subsets of math. For example, WolframAlpha can explain its reasoning for finding the derivative of a polynomial (click "step by step solution", then "show all steps"): Wolfram Alpha derivative example
There's a wide variety of math problems which we can programmatically solve, and can therefore programmatically generate explanations for:
(Actually, most of these are probably too hard to learn. Should focus on the really simple ones like long division.)
The idea is to:
(cross posting this comment from E. S. Yudkowksy's Facebook with some edits / elaboration)
Has anyone tried fine-tuning a transformer on small datasets of increasing size to get a sense of how large a dataset would be needed to do this well? I suspect it might have to be very large.
Note this is similar to the "self explaining AI" idea I explored in early 2020, which I threw together a paper on (I am hesitant to link to it because it's not that great of a paper and much of the discussion there is CNN specific, but here it is.). I can see how producing "thoughts" could help us trust/determine how much a model really understands what's going on or how to make a good story.
However I also could see the "thoughts" output misleading people - people might mistake the model's explanations as mapping onto the calculations going on inside the model to produce an output. The way GPT-3 works, I suspect, is very far from how humans think. GPT-3 is very bad at a lot of common sense and physics-based reasoning, for instance, yet based on the thoughts output the user might be misled into thinking the model understands common sense notions or physics since it's spouting off a version of some stuff it...
We're guessing 1000 steps per reasonably-completed run (more or less, doesn't have to be exact) and guessing maybe 300 words per step, mostly 'thought'. Where 'thoughts' can be relatively stream-of-consciousness once accustomed (we hope) and the dungeon run doesn't have to be Hugo quality in its plotting, so it's not like we're asking for a 300,000-word edited novel.
However I also could see the "thoughts" output misleading people - people might mistake the model's explanations as mapping onto the calculations going on inside the model to produce an output.
I think the key point on avoiding this is the intervening-on-the-thoughts part:
"An AI produces thoughts as visible intermediates on the way to story text, allowing us to watch the AI think about how to design its output, and to verify that we can get different sensible outputs by intervening on the thoughts".
So the idea is that you train things in such a way that the thoughts do map onto the calculations going on inside the model.
Came across this today on r/mlscaling and thought I'd put it here since it's relevant: https://arxiv.org/abs/2201.11903#google
This paper explores the ability of language models to generate a coherent chain of thought—a series of short sentences that mimic the reasoning process a person might have when responding to a question. Experiments show that inducing a chain of thought via prompting can enable sufficiently large language models to better perform reasoning tasks that otherwise have flat scaling curves.
This looks exciting! I wonder about the proposed training setup: If one model produces the thoughts, and another one takes those as input to the prompts, are we actually learning anything about the internal state of either model? What is the advantage (beyond scalability) of this training setup vs just using the second model to produce continuations conditional on thoughts?
I know next to nothing about AI, so please correct me if I'm wrong, but it seems like the thought process of a dungeon master is a difficult starting point, since they're balancing out multiple levels of considerations. They're simulating a world, but also trying to shape a story (plus modelling the players & writing decent prose). The data would seem to be simpler to understand if you're dealing with a pure simulationist DM, or player who's 100% roleplayer (or munchkin), as the chains of reasoning would be focused on maximizing a single clear metric.
I ask because if so, some of those options might be also be easier to produce than a true AI Dungeon run.
I think there a lot of amazing people in the roleplaying games community that could help meet this project's goals. That said I'm worried this document would be hard to understand for most of that community, which doesn't overlap that much with the AI community. I'd suggest rephrasing the ask in plain english.
"We're looking to pay dungeon masters to submit transcripts of games with a documentation of their thought process, so we can train algorithms to think the way dungeon masters do. Here's the format we need them in and steps for how to apply, and how much you can get paid".
A possible way to scale it: "collaborative fanfic dungeons":
Can we apply for consultation as a team of two? We only want remote consultation of the resources you are offering because we are not based in bay area.
I appreciate this post, though mostly secondhand. It's special to me because it provided me with a way to participate more-or-less directly in an alignment project: one of my glowfic buddies decided to rope me in to write a glowfic thread in this format for the project [here](https://glowfic.com/posts/5726). I'd like to hear more updates about how it's gone in the last year, though!
It seems to me like this should be pretty easy to do and I'm disappointed there hasn't been more action on it yet. Things I'd try:
- reach out to various human-data-as-a-service companies like SurgeHQ, Scale, Samasource
- look for people on upwork
- find people who write fiction on the internet (e.g. post on fanfiction forums) and offer to pay them to annotate their existing stories (not a dungeon run exactly, but I don't see why the dungeon setting is important)
I'd be interested to hear if anyone has tried these things and run into roadblocks.
I'm also ...
Related work:
Show Your Work: Scratchpads for Intermediate Computation with Language Models
https://arxiv.org/abs/2112.00114
(from very surface-level perusal) Prompting the model resulted in
1) Model outputting intermediate thinking "steps"
2) Capability gain
I'm just "importing" my twitter thread and adding some additional thoughts.
If some model could spit out 100 of these annotated adventures, then the challenge would have already been solved.
Not sure about that 300,000 word count document idea though... A word-dump focused "result" plays into the strength of LLM's while providing none of the structure that is missing.
The more I work on this, the more I think you want something different. Perhaps use existing choose your own adventure books as a starting point, and work on deconstructing them; expanding...
I think the bounty amount is too low to attract skilled writers. The rate of ~3.3 cents/word is substantially less than the 6-10 cents per word most publications pay. Though it is stated in this post that a run "does not need to be published-novel-quality literature", this project is sufficiently weird that I'd imagine most skilled writers would rather write traditional short fiction, especially when considering that this project wouldn't move writers towards either career development or their passions.
Are you familiar with forum rps (role play)s? A group of people would collectively write a story, each playing the role of one character. (Looking it up now, looks like there are some Choose-Your-Own-Adventure variants akin to AI Dungeon) It was more popular back in the day, but it looks like there are some still extant.
These people are already doing something like what you're asking for, so it might be worth someone dropping in and offering to pay them in exchange for them taking copious notes.
Funnily enough, the first one I found when googlin...
I started with something more "contained" and easier to manage because actual users will go off script every chance they get, and this is basically like playing chess against yourself while reading a book on how to play chess. But, I may have found a kind of working compromise in terms of format and what needs to be captured. Will need a few days to see how it holds up, but right now, this is the basic idea:
Initial PROMPT to get the story started, followed by THOUGHTS that examine them from a gaming perspective, an ACTION, my THOUGHTS, another PROMPT...
Some naive thoughts in case useful:
A) Is the structured annotation format more useful than a gamemaster/writer thinking aloud while recording themselves (possibly with an audience)?
That could be the closest thing to a full transcript of the human process which downstream tasks could condense as needed. An adopted annotation format (prescribed or not) could potentially cause thoughts to be filtered, reinterpreted, or even steer human generation?
One key example against a fixed-format annotation, I think is that human gamemasters and writers do not spend appr...
It seems to me that the comments in code provide "visible thoughts" for what the programmer intends. What do you hope to learn from training language models on thought-annotated dungeons that you couldn't learn from language models that have already been trained on commented code?
silly idea: instead of thought-annotating ai-dungeon plays, we can start with annotating thoughts for akinator gameruns.
pros: much more easier and faster way to build a dataset, with less ambiguity
cons: somewhat restricted than the original idea.
As a photographer, I got excited at first by the inclusion of the word "visible", but I guess today is not my day. Is there any chance for me to participate in training ML models by collecting a dataset of photos? I'm in the process of relocating to Singapore, but getting a work visa takes a while so I have a lot of free time now.
I don't understand why showing the thinking of the DM/Author is important for this problem. To me it feels sufficient to show the thinking of the characters alone?
(Update Jan. 12: We released an FAQ last month, with more details. Last updated Jan. 7.)
(Update Jan. 19: We now have an example of a successful partial run, which you can use to inform how you do your runs. Details.)
We at MIRI are soliciting help with an AI-alignment project centered around building a dataset, described below. We have $200,000 in prizes for building the first fragments of the dataset, plus an additional $1M prize/budget for anyone who demonstrates the ability to build a larger dataset at scale.
If this project goes well, then it may be the first of a series of prizes we offer for various projects.
Below, I’ll say more about the project, and about the payouts and interim support we’re offering.
The Project
Hypothesis: Language models can be made more understandable (and perhaps also more capable, though this is not the goal) by training them to produce visible thoughts.
We’d like to test this hypothesis by fine-tuning/retraining a language model using a dataset composed of thought-annotated dungeon runs. (In the manner of AI dungeon.)
A normal (un-annotated) dungeon run is a sequence of steps in which the player inputs text actions and the dungeon master responds with text describing what happened in the world as a result.
We’d like a collection of such runs, that are annotated with "visible thoughts" (visible to potential operators or programmers of the system, not to players) describing things like what just happened or is about to happen in the world, what sorts of things the player is probably paying attention to, where the current sources of plot tension are, and so on — the sorts of things a human author would think while acting as a dungeon master. (This is distinct from producing thoughts explaining what happened in the dungeon; “visible thoughts” are meant to play an active role in constructing the output.)
Once we have such a dataset, MIRI’s hope is that present or future technology will be able to train a model or models which iteratively produce visible thoughts along with storytelling, based on user actions plus previous history (including previous thoughts). The goal is to transition the state of AI dungeon technology from “An AI outputs story text in response to actions (and we have no idea how)” to “An AI produces thoughts as visible intermediates on the way to story text, allowing us to watch the AI think about how to design its output, and to verify that we can get different sensible outputs by intervening on the thoughts”.
Here’s an example of the first couple of steps of a thought-annotated dungeon run (or “quest”), in the format MIRI currently thinks is worth trying. Some kinds of thoughts are marked with parentheses and/or brackets; see the next section for details on this.
A difficult first step in testing the hypothesis above is generating a sufficiently large dataset (suitable for language model retraining) of thought-annotated dungeon runs. This likely requires at least a moderate degree of introspective and authorial skill from the people creating the dataset. See this sample of a partial run to get a further sense of what we are looking for. More detail on the type of thing we’re looking for can hopefully be inferred from that sample, though applicants will also have a chance to ask clarifying questions.
The project of producing this dataset is open starting immediately, in a hybrid prize/grant format. We will pay $20,000 per run for the first 10 completed runs that meet our quality standard (as decided unilaterally by Eliezer Yudkowsky or his designates), and $1M total for the first batch of 100 runs beyond that.
If we think your attempt is sufficiently promising, we’re willing to cover your expenses (e.g., the costs of paying the authors) upfront, and we may also be willing to compensate you for your time upfront. You’re welcome to write individual runs manually, though note that we’re most enthusiastic about finding solutions that scale well, and then scaling them. More details on the payout process can be found below.
The Machine Learning Experiment
In slightly more detail, the plan is as follows (where the $1.2M prizes/budgets are for help with part 1, and part 2 is what we plan to subsequently do with the dataset):
1. Collect a dataset of 10, then ~100 thought-annotated dungeon runs (each run a self-contained story arc) of ~1,000 steps each, where each step contains:
It’s unclear to us how much skill is required to produce this dataset. The authors likely need to be reasonably introspective about their own writing process, and willing to try things and make changes in response to initial feedback from the project leader and/or from MIRI.
A rough estimate is that a run of 1,000 steps is around 300k words of mostly thoughts, costing around 2 skilled author-months. (A dungeon run does not need to be published-novel-quality literature, only coherent in how the world responds to characters!) A guess as to the necessary database size is ~100 runs, for about 30M words and 20 author-years (though we may test first with fewer/shorter runs).
2. Retrain a large pretrained language model, like GPT-3 or T5
A reasonable guess is that performance more like GPT-3 than GPT-2 (at least) is needed to really make use of the thought-intermediates, but in lieu of a large pretrained language model we could plausibly attempt to train our own smaller one.
Our own initial idea for the ML architecture would be to retrain one mode of the model to take (some suffix window of) the history units and predict thoughts, by minimizing the log loss of the generated thought against the next thought in the run, and to retrain a second mode to take (some suffix window of) the history units plus one thought, and produce a prompt, by minimizing the log loss of the generated prompt against the next prompt in the run.
Imaginably, this could lead to the creation of dungeon runs that are qualitatively “more coherent” than those generated by existing methods. The primary goal, however, is that the thought-producing fragment of the system gives some qualitative access to the system’s internals that, e.g., allow an untrained observer to accurately predict the local developments of the story, and occasionally answer questions about why things in the story happened; or that, if we don’t like how the story developed, we can intervene on the thoughts and get a different story in a controllable way.
Motivation for this project
Many alignment proposals floating around in the community are based on AIs having human-interpretable thoughts in one form or another (e.g., in Hubinger’s survey article and in work by Christiano, by Olah, and by Leike). For example, this is implicit in the claim that humans will be able to inspect and understand the AI’s thought process well enough to detect early signs of deceptive behavior. Another class of alignment schemes is based on the AI’s thoughts being locally human-esque in some fashion that allows them to be trained against the thoughts of actual humans.
I (Nate) personally don’t have much hope in plans such as these, for a variety of reasons. However, that doesn’t stop Eliezer and me from wanting to rush ahead and start gathering empirical evidence about how possible it is in practice to get modern AI systems to factor their cognition through human-interpretable visible intermediates.
Modern AIs are notably good at crafting English text. Some are currently used to run dungeons (with modest success). If you wanted to look at the place where current AIs excel the most in crafting artifacts, among the artifacts they are best and most impressive at crafting are English paragraphs.
Furthermore, compared to many other things AIs have learned to do, if you consider the task of running a responsive text dungeon, it seems relatively possible to ask a (relatively unusually) introspective human author to write down their thoughts about how and why they would generate the next prompt from the user’s input.
So we are taking one of the outputs that current AIs seem to have learned best to design, and taking one of the places where human thoughts about how to design it seem most accessible, and trying to produce a dataset which the current or next generation of text predictors might be able to use to learn how to predict thoughts about designing their outputs and not just predict the outputs themselves.
This sort of interpretability is distinct from the sort of transparency work in something like Circuits (led by Chris Olah) — while Circuits is trying to “open the black box” of machine learning systems by directly looking at what is happening inside of them, the project proposed here is just attempting the less ambitious task of having black-box models output interpretable intermediates producing explanations for their behavior (but how such black box models might go about doing that internally is left unconstrained). The reason for our focus on this particular project of visible thoughts isn’t because we believe it to be better or more fruitful than Circuits-style transparency (we have said for years that Circuits-style research deserves all possible dollars that can be productively spent on it), but just because it’s a different approach where it might also be possible to push progress forward.
Note that proponents of alignment strategies that involve human-esque thoughts (such as those linked above) do not necessarily endorse this particular experiment as testing any of their key uncertainties or confusions. We welcome suggested tweaks to the experiment (in the comments of the version of this announcement as it occurs on LessWrong) from any such proponents, to render it a better test of your ideas. (Though even if it doesn’t sate your own curiosity, we expect to learn some things ourselves.)
The main thing this project needs is a dataset, so MIRI is starting on producing that dataset. It’s plausible to us that GPT-3 will prove wholly unable to make use of this dataset; even if GPT-3 can’t, perhaps GPT-4 or some other future system will be able to.
There are additional more general reasons to work on this project. Specifically, it seems to me (Nate) and to Eliezer that capacity to execute projects such as this one is the current limiting bottleneck on MIRI. By pursuing this project, we attempt to resolve that bottleneck.
We hope, through this process, to build our capacity to execute on a variety of projects — perhaps by succeeding at the stated objective of building a dataset, or perhaps by learning about what we’re doing wrong and moving on to better methods of acquiring executive talent. I’ll say more about this goal in “Motivation for the public appeal” below.
Notes on Closure
I (Nate) find it plausible that there are capabilities advances to be had from training language models on thought-annotated dungeon runs. Locally these might look like increased coherence of the overall narrative arc, increased maintenance of local story tension, and increased consistency in the described world-state over the course of the run. If successful, the idiom might generalize further; it would have to, in order to play a role in later alignment of AGI.
As a matter of policy, whenever a project like this has plausible capabilities implications, we think the correct response is to try doing it in-house and privately before doing it publicly — and, of course, only then when the alignment benefits outweigh the plausible capability boosts. In this case, we tried to execute this project in a closed way in mid-2021, but work was not proceeding fast enough. Given that slowness, and in light of others publishing related explorations and results, and in light of the relatively modest plausible capability gains, we are moving on relatively quickly past the attempt to do this privately, and are now attempting to do it publicly.
Motivation for the public appeal
I (Nate) don’t know of any plan for achieving a stellar future that I believe has much hope worth speaking of. I consider this one of our key bottlenecks. Offering prizes for small projects such as these doesn’t address that bottleneck directly, and I don’t want to imply that any such projects are going to be world-saving in their own right.
That said, I think an important secondary bottleneck is finding people with a rare combination of executive/leadership/management skill plus a specific kind of vision. While we don’t have any plans that I’m particularly hopeful about, we do have a handful of plans that contain at least a shred of hope, and that I’m enthusiastic about pursuing — partly in pursuit of those shreds of hope, and partly to build the sort of capacity that would let us take advantage of a miracle if we get one.
The specific type of vision we’re looking for is the type that’s compatible with the project at hand. For starters, Eliezer has a handful of ideas that seem to me worth pursuing, but for all of them to be pursued, we need people who can not only lead those projects themselves, but who can understand the hope-containing heart of the idea with relatively little Eliezer-interaction, and develop a vision around it that retains the shred of hope and doesn’t require constant interaction and course-correction on our part. (This is, as far as I can tell, a version of the Hard Problem of finding good founders, but with an additional constraint of filtering for people who have affinity for a particular project, rather than people who have affinity for some project of their own devising.)
We are experimenting with offering healthy bounties in hopes of finding people who have both the leadership/executive capacity needed, and an affinity for some ideas that seem to us to hold a shred of hope.
If you’re good at this, we’re likely to make you an employment offer.
The Payouts
Our total prize budget for this program is $1.2M. We intend to use it to find a person who can build the dataset in a way that scales, presumably by finding and coordinating a pool of sufficiently introspective writers. We would compensate them generously, and we would hope to continue working with that person on future projects (though this is not a requirement in order to receive the payout).
We will pay $20k per run for the first 10 thought-annotated runs that we accept. We are willing to support applicants in producing these runs by providing them with resources up-front, including small salaries and budgets for hiring writers. The up-front costs a participant incurs will be deducted from their prizes, if they receive prizes. An additional $1M then goes to anyone among the applicants who demonstrates the ability to scale their run-creating process to produce 100 runs. Our intent is for participants to use some of that money to produce the 100 runs, and keep the remainder as a prize. If multiple participants demonstrate similar abilities to scale at similar quality-levels and similar times, the money may be split between them. We plan to report prize awards publicly.
In principle, all you need to do to get paid for thought-annotated dungeon runs is send us runs that we like. If your run is one of the first 10 runs, or if you’re the first to provide a batch of 100, you get the corresponding payment.
That said, whether or not we decide to pay for a run is entirely and unilaterally up to Eliezer Yudkowsky or his delegates, and will depend on whether the run hits a minimum quality bar. Also, we are willing to pay out from the $1M prize/budget upon becoming convinced that you can scale your process, which may occur before you produce a full 100 runs. We therefore strongly recommend getting in contact with us and proactively making sure that you’re on the right track, before sinking large amounts of time and energy into this project. Our senior research staff are willing to spend time on initial conversations and occasional check-ins. For more information on our support resources and how to access them, refer to the support and application sections below.
Note that we may tune or refine the bounty in response to feedback in the first week after this post goes live.
Support
We intend to offer various types of support for people attempting this project, including an initial conversation; occasional check-ins; office space; limited operational support; and certain types of funding.
We currently expect to have (a limited number of) slots for initial conversations and weekly check-ins, along with (a limited amount of) office space and desks in Berkeley, California for people working on this project. We are willing to pay expenses, and to give more general compensation, in proportion to how promising we think your attempts are.
If you’d like to take advantage of these resources, follow the application process described below.
Application
You do not need to have sent us an application in order to get payouts, in principle. We will pay for any satisfactory run sent our way. That said, if you would like any of the support listed above (and we strongly recommend at least one check-in to get a better understanding of what counts as success), complete the following process:
If we think your application is sufficiently promising, we’ll schedule a 20 minute video call with some senior MIRI research staff and work from there.