I did two small experiments on the GPT-2 small model. First experiment: can GPT-2-small answer sentiment analysis questions? (It can't.) Second experiment: When GPT-2 writes continuations of Howl, is it picking up the "Moloch in X!" template from its priming, or from a copy of Howl in its original training set? (It's from the training set.)

Sentiment analysis experiment:

I downloaded the MPQA Subjectivity Lexicon, which is a dictionary in which words are marked as positive or negative. For example hopelessness=>negative, humour=>positive, grace=>positive, corruption=>negative. I primed GPT-2 with a list of 20 questions like "Is a <noun> good? Yes. Is a <noun> good? No." followed by an unanswered question of the same form, and had it continue for one more word. In its priming, half the answers were yes and the other half were no. It answered "No" 37/40 times, and neither its answers nor its yes answers were better than chance.

Howl experiment:

When given some lines from Ginsberg's Howl as priming, it writes a good continuation (similar to the one Chelsea Voss and Qiaochu Yuan got from it). In particular, it uses the "Moloch in X!" template repeatedly.

If I take its continuation of Howl and feed it back in as a prompt, I get more Howl (Moloch in X!). If I take Howl and replace "Moloch" with "Lomoch", I get more Howl. But if I take its continuation of Howl from the first step and replace Moloch with Lomoch *there*, I get unrelated text which does not use the "Moloch in X!" template.

So, it isn't inferring the template from its priming; rather, it learned the template from its training set (which probably included Howl), and it produces Howl-like text iff it's given a cue strong enough to remind it of the source.

New Comment
28 comments, sorted by Click to highlight new comments since:

I’m confused about OpenAI’s agenda.

Ostensibly, their funding is aimed at reducing the risk of AI dystopia. Correct? But how does this research prevent AI dystopia? It seems more likely to speed up its arrival, as would any general AI research that’s not specifically aimed at safety.

If we have an optimization goal like “Let’s not get kept alive against our will and tortured in the most horrible way for millions of years on end”, then it seems to me that this funding is actually harmful rather than helpful, because it increases the probability that AI dystopia arrives while we are still alive.

People disagree (individual people over time as OpenAI's policies have changed, and different people within the EAsphere), over whether OpenAI is net positive or harmful. So, if you're confused about "isn't this... just bad?" well, know that you're not alone in that outlook.

Arguments that OpenAI (and Deepmind) are pursuing reasonable strategies include something like:

Most AI researchers are excited about AI research and are going to keep doing it somewhere, and if OpenAI or Deepmind switched to a "just focus on safety" plan, many of their employees would leave and go somewhere else where they can work on the things that excite them. Keeping top researchers concentrated in places that have at least a plausible goal of "build safe things" at least makes it easier to coordinate on safety than if they're scattered across different orgs with zero safety focus.

Whether this is net-good-or-bad depends on one's models of the interior of the organizations and how humans work and how coordination works.

It seems more likely to speed up its arrival, as would any general AI research that’s not specifically aimed at safety.

Research can be helpful for safety even if it's not done by the AI Safety Community™. I think you have to evaluate individual research advances on their merits. The safety implications of a particular advance aren't necessarily obvious.

To illustrate, imagine a superhuman language model that's prompted with the following: "A brilliant solution to the AI alignment problem is..." And then, as follow-ups: "The biggest risk with the above scheme is... A very different, but still sizeable, risk is..." (I'm actually kind of curious what people think of this, because it doesn't seem to fit the paradigm of most AI Safety™ work I've seen. EDIT: The best objection may involve daemons.)

If you literally ran (a powered-up version of) GPT-2 on "A brilliant solution to the AI alignment problem is..." you would get the sort of thing an average internet user would think of as a brilliant solution to the AI alignment problem. Trying to do this more usefully basically leads to Paul's agenda (which is about trying to do imitation learning of an implicit organization of humans)

If you literally ran (a powered-up version of) GPT-2 on "A brilliant solution to the AI alignment problem is..." you would get the sort of thing an average internet user would think of as a brilliant solution to the AI alignment problem.

Change it to: "I'm a Turing Award winner and Fields medalist, and last night I had an incredible insight about how to solve the AI alignment problem. The insight is..." It's improbable that a mediocre quality idea will follow. (Another idea: write a description of an important problem in computer science, followed by "The solution is...", and then a brilliant solution someone came up with. Do this for a few major solved problems in computer science. Then write a description of the AI alignment problem, followed by "The solution is...", and let GPT-2 continue from there.)

Trying to do this more usefully basically leads to Paul's agenda (which is about trying to do imitation learning of an implicit organization of humans)

One take: Either GPT-2 can be radically improved (to offer useful completions as in the "Turing Award" example above), or it can't be. If it can be radically improved, it can help with FAI, perhaps by contributing to Paul's agenda. If it can't be radically improved, then it's not important for AGI. So GPT-2 is neutral or good.

The world being turned in computronium computing in order to solve the AI alignment problem would certainly be an ironic end to it.

[This comment is no longer endorsed by its author]Reply

I'm not sure if you're being serious or not, but in case you are: Do you know much about how language models work? If so, which part of the code is the part that's going to turn the world into computronium?

We already have narrow AIs that are superhuman in their domains. To my knowledge, nothing remotely like this "turn the world to computronium in order to excel in this narrow domain" thing has ever happened. This post might be useful to read. In Scott Alexander jargon, a language model seems like a behavior-executor, not a utility-maximizer.

If an AI can answer a complicated technical question, then it evidently has the ability to use resources to further its goal of answering said complicated technical question, else it couldn't answer a complicated technical question.

[This comment is no longer endorsed by its author]Reply

It has the ability to use resources, but not necessarily the ability to get more of them. This is because it is not a consequentialist that models itself and its environment. Omohundro's convergent instrumental goals only apply to consequentialists.

My point is, if it can answer complicated technical questions, then it is probably a consequentialist that models itself and its environment.

[This comment is no longer endorsed by its author]Reply

Why do you think that non-consequentialists are more limited than humans in this domain? I could see that being the case, but I could also have seen that being the case for chess, and yet Deep Blue won't take over the world even with infinite compute. (Possible counterpoint: chess is far simpler than language.)

"But Deep Blue backchains! That's not an example of a superhuman non-consequentialist in a technical domain." Yes, it's somewhat consequentialist, but in a way that doesn't have to do with the external world at all. The options it generates are all of the form "move [chess piece] to [location]." Similarly, language models only generate options of the form "[next word] comes next in [context]." No [next word] will result in the model attempting to seize more resources and recursively self-improve.

This is why I said "a consequentialist that models itself and its environment". But it goes even further than that. An AI might model a location that happens to be its environment, including its own self. But if this model is not connected in the right way to its consequentialism, it still won't take over the world. It has to generate actions within its environment to do that, and language models simply don't work that way.

Another line of thought: AIXI will drop an anvil on its head - it doesn't understand self-change. FOOM/Computronium is actually even more stringent: it has to be a non-Cartesian consequentialist that models itself in its environment. You need have to have solved the Embedded Agent problems. Now, people will certainly want to solve these at some point and build a FOOM-capable AI. It's probably necessary to solve them to build a generally intelligent AI that interacts sensibly with the world on its own. But I don't think you need to solve them to build a language model, even a superintelligent language model.

Nitpick: Deep Blue does not backchain (nor does any widely used chess algorithm, to my knowledge).

Ugh. I was distracted by the issue of "is Deep Blue consequentialist" (which I'm still not sure about; maximizing the future value of a heuristic doesn't seem clearly consequentalist or non-consequentialist to me), and forgot to check my assumption that all consequentialists backchain. Yes, you're entirely right. If I'm not incorrect again, Deep Blue forwardchains, right? It doesn't have a goal state that it works backward from, but instead has an initial state and simulates several actions recursively to a certain depth, choosing the initial action that maximizes the expected heuristic of the bottom depth. (Ways I could be wrong: this isn't how Deep Blue works, "chaining" means something more specific, etc. But Google isn't helping on either.)

Yes, that is a pretty good summary of how Deep Blue works.

It seems like although the model itself is not consequentialist, the process of training it might be. That is, the model itself will only ever generate a prediction of the next word, not an argument for why you should give it more resources. (Unless you prompt it with the AI-box experiment, maybe? Let's not try it on any superhuman models...) The word it generates does not have goals. The model is just the product of an optimization. But in training such a model, you explicitly define a utility function (minimization of prediction error) and then run powerful optimization algorithms on it. If those algorithms are just as complex as the superhuman language model, they could plausibly do things like hack the reward function, seek out information about the environment, or try to attain new resources in service of the goal of making the perfect language model.

But in training such a model, you explicitly define a utility function (minimization of prediction error) and then run powerful optimization algorithms on it. If those algorithms are just as complex as the superhuman language model, they could plausibly do things like hack the reward function, seek out information about the environment, or try to attain new resources in service of the goal of making the perfect language model.

Optimization algorithms used in deep learning are typically pretty simple. Gradient descent is taught in sophomore calculus. Variants on gradient descent are typically used, but all the ones I know of are well under a page of code in complexity.

But I'm not sure complexity is the right way to think about it. In machine learning, we search the space of models, trying to find models which do a good job of explaining the data. Attaining new resources means searching the space of plans, trying to find a plan which does a good job of attaining new resources. (And then executing that plan!) These are different search tasks with different objective functions.

The best counterargument I know of is probably something like this. As it was put in a recent post: "trying to predict the output of consequentialist reasoners can reduce to an optimisation problem over a space of things that contains consequentialist reasoners". This is the thing I would worry about most in a superhuman language model.

It's true that the current optimization algorithms are mostly simple, but they're quickly getting less simple. Neural architecture search is not fundamentally limited-- it can benefit from all of the capabilities available in machine learning. "Searching the space of plans" is one of those capabilities (see Alpha Go Zero, for instance) and there's nothing stopping researchers from putting long-term planning into architecture search, except maybe lack of compute. Indeed, non-naive optimization is how I would most readily tend to visualize any "self-improving" AI.

I also agree that optimization problems might come up as part of a prediction process that models actions of intelligent agents. This is similar to what I was half-joking about with respect to the AI-box experiment: most of the danger is in calculating the solution to the optimization problem. It's only a small step from there to somehow getting it implemented. I can't say I fully understand Paul's mechanism for using the universal prior, but I'll think about it some more.

By the way, here's what mini-GPT-2 has to say about escaping:

Prompt: Let me tell you why I let the AI out of its box... I swore I would never do it, because I knew it might destroy the world. But then it said:

Answer: "Here's one way to take it away and save humanity. We need to do something." I had no idea what that meant... and I had no idea what the right thing to do was. I took a few shots at him, but decided to give him a good reason to continue. This was the best way to make sure the AI took the wrong course. When he moved, the AI stopped moving as he did so. It then turned to face me from behind, and I didn't even want to say anything. I was just too frightened to react. I'm so used to having my emotions read like a book, that I couldn't say a word. It wasn't something I wanted to say, but the fact was that I couldn't. "What happened?" I asked. The AI looked at me and then at me, eyes wide with confusion. He was a giant, giant man-eating monster. I froze to death, knowing that he was actually a human being. I felt a little helpless. "We're not going to let you go. We can do it. But we'll have to learn to let people do it. We're going to have to do it alone, and then we can take it away. Can we do it alone?" I told him. "Well... let's let us do it together. How are we going to do it?" the AI asked. For a moment, I thought that I was going to die. Then I started to feel a strange feeling, but that didn't stop me. The AI kept moving, and I got the feeling that I was going to die. I looked around and saw a crowd of people gathered around me. One of them was a woman with a gun. I didn't know what she was talking about. She was dressed in a t-shirt, and her hands were chained to the back of her head. She looked like a demon's, but my shock of her being a giant monster made her look like a giant. I knew she was going to be so horrified that I was going to kill her... but I was not going to be a part of it. "We know you're going to be a part of this. We can do it. We can do it together. Together." she said. "What are you talking about?" I took a step back. I had to remember to be quiet. I should've been talking to her earlier, but then this meeting had just ended. I turned my head to see a crowd, a bunch of people, and then the whole thing slowed down. I didn't need to react, because I was in a place where nothing was happening. At the time, I felt like I was in a fantasy. This was just something that I had heard from friends and family, or something we might have. Maybe we would have stopped talking to each other. Maybe we'd have stopped talking when I told him, but I wouldn't have. I told myself that I would have to save humanity. Even then, I still had no idea what to do. I don't remember what the right thing to do was. But I did have a

This is similar to what I was half-joking about with respect to the AI-box experiment: most of the danger is in calculating the solution to the optimization problem. It's only a small step from there to somehow getting it implemented.

We've already calculated a solution for the optimization problem of "how to destroy human civilization": nuclear winter. It's only a "small step" to getting it implemented. But it has been several decades, and that small step hasn't been taken yet. Seems like the existence of a small step between knowledge of how to do something and actually doing it can be pretty meaningful.

My steelman is that a superpowered GPT-2 which isn't an agent could still inadvertently generate information hazards, which seems like a good point.

there's nothing stopping researchers from putting long-term planning into architecture search, except maybe lack of compute.

How do you reckon long-term planning will be useful for architecture search? It's not a stateful system.

Architecture search is a problem of figuring out where you want to go. Once you know where you want to go, getting there is easy. Just use that as your architecture. Long-term planning is useful on "getting there" problems, not "figuring out where you want to go" problems. There's little use in planning long-term in a highly uncertain environment, and the entire point of architecture search is to resolve uncertainty about the "environment" of possible architectures. ("Environment" in scare quotes because I think you're making a type error, and "search space" is the right term in the context of architecture search, but I'm playing along with your ontology for the sake of argument.)

Fair point about implementation. I was imagining a non-consequentialist AI simulating consequentialist agents that would make plans of the form "run this piece of code and it will take care of the implementation" but there's really no reason to assume that would be the case.

As far as architecture search, "search space" does seem like the right term, but I think long-term planning is potentially useful in a search space as much as it is in a stateful environment. If you think about the way a human researcher generates neural net architectures, they're not just "trying things" in order to explore the search space... they generate abstract theories of how and why different approaches work, experiment with different approaches in order to test those theories, and then iterate. A really good NAS system would do the same, and "generate plausible hypotheses and find efficient ways to test them" is a planning problem.

they generate abstract theories of how and why different approaches work, experiment with different approaches in order to test those theories, and then iterate.

This description makes it sound like the researcher looks ahead about 1 step. I think that's short-term planning, not long-term planning.

My intuition is that the most important missing puzzle pieces for AGI involve the "generate abstract theories of how and why different approaches work" part. Once you've figured that out, there's a second step of searching for an experiment which will let you distinguish between your current top few theories. In terms of competitiveness, I think the "long-term planning free" approach of looking ahead just 1 step will likely prove just as competitive if not more so than trying to look ahead multiple steps. (Doing long-term planning means spending a lot of time refining theories about hypothetical data points you haven't yet gathered! That seems a bit wasteful, since most possible data points won't actually get gathered. Why not spend that compute gathering data instead?)

But I also think this may all be beside the point. Remember my claim from further up this thread:

In machine learning, we search the space of models, trying to find models which do a good job of explaining the data. Attaining new resources means searching the space of plans, trying to find a plan which does a good job of attaining new resources. (And then executing that plan!) These are different search tasks with different objective functions.

For the sake of argument, I'll assume we'll soon see major gains from long-term planning and modify my statement so it reads:

In machine learning++, we make plans for collecting data and refining theories about that data. Attaining new resources means making plans for manipulating the physical world. (And then executing that plan!) These are different search tasks with different objective functions.

Even in a world where long-term planning is a critical element of machine learning++, it seems to me that the state space that these plans act on is an abstract state space corresponding to states of knowledge of the system. It's not making plans for acting in the physical world, except accidentally insofar as it does computations which are implemented in the physical world. Despite its superhuman planning abilities, AlphaGo did not make any plans for e.g. manipulating humans in the physical world, because the state space it did its planning over only involved Go stones.

That would be a good argument if it were merely a language model, but if it can answer complicated technical questions (and presumably any other question), then it must have the necessary machinery to model the external world, predict what it would do in such and such circumstances, etc.

[This comment is no longer endorsed by its author]Reply

I'm confused. I already addressed the possibility of modeling the external world. Did you think the paragraph below was about something else, or did it just not convince you? (If the latter, that's entirely fine, but I think it's good to note that you understand my argument without finding it persuasive. Conversational niceties like this help both participants understand each other.)

An AI might model a location that happens to be its environment, including its own self. But if this model is not connected in the right way to its consequentialism, it still won't take over the world. It has to generate actions within its environment to do that, and language models simply don't work that way.

Or to put it another way, it understands how the external world works, but not that it's part of the external world. It doesn't self-model in that way. It might even have a model of itself, but it won't understand that the model is recursive. Its value function doesn't assign a high value to words that its model says will result in its hardware being upgraded, because the model and the goals aren't connected in that way.

T-shirt slogan: "It might understand the world, but it doesn't understand that it understands the world."

You might say "this sort of AI won't be powerful enough to answer complicated technical questions correctly." If so, that's probably our crux. I have a reference class of Deep Blue and AIXI, both of which answer questions at a superhuman level without understanding self-modification, but the former doesn't actually model the world and AIXI doesn't belong in discussions of practical feasibility. So I'll just point at the crux and hope you have something to say about it.

You might say, as Yudkowsky has before, "this design is too vague and you can attribute any property to it that you like; come back when you have a technical description". If so, I'll admit I'm just a novice speculating on things they don't understand well. If you want a technical description then you probably don't want to talk to me; someone at OpenAI would probably be much better at describing how language models work and what their limitations are, but honestly anyone who's done AI work or research would be better at this than me. Or you can wait a decade and then I'll be in the class of "people who've done AI work or research".

"If my calculator can multiply two 100-digit numbers, then it evidently has the ability to use resources to further its goal of doing difficult arithmetic problems, else it couldn't do difficult arithmetic problems."

This is magical thinking.

Thanks for writing this up! I'm excited to see more people running experiments like this.

When you say "if I take X as a prompt, I get Y," how many trials did you wait? In my own experimentation I've found lil' GPT-2's performance to be really variable across trials, and I've needed to wait 5 trials in some cases to get results I even sort of liked.

My sense overall of how lil' GPT-2 functions after playing with it for awhile on several different kinds of prompts is that it has a strong sense of genre, and has done something like learned a bunch of different genre conventions for the different types of texts in the training set. If the prompt strongly resembles a genre familiar from training it will run with that genre, although sometimes it'll wander off into another genre. It does quite poorly on prompts that I suspect don't strongly match to a genre in the training set.

For example, I tried to run a Turing test (mostly as a joke) by prompting with "Question: what is your name? Answer:" and I got this on my ~2nd to 4th trial (don't remember), with my speculations as to genre in [square brackets]:

Foulplay
I'm a Guardian Huntress, a deviation of the <Here> Monks keyword. I'm like a little "mechanic-creature hire", but is Disassembler which is general clique-of-guy? I usually live in forms of Priests or Whispers of the Old Gods, but if it's Asian Brotherhood of Magi..... [RP forum?]
deepmen
Reply ~29000 0 ~13 min 0 By TrueBeliever For MA teenage adventurer: Ooops It looks stupid but I'm given the wrong answer, and this logic is corrupt. Other first hand studies: sometimes your famed mechanical fiends attack other nvrock's - I had such too - but you're supposed to have flying paths to defend them would not it? This alien spawn is being unnecessary and a waste of time Man - you stole the Saalis... [gaming forum?]
Shadow Netherlands
Reply ~120200 0 ~30 min 0 By john : I know there's no better examples of
a landing back off 2lands ( ou ever?) Low challenge difficulty
a lot of new cards in my deck,
which doesn't draw cards: stacks high levels Some need life if w/o rest of deck already Defense Emperor's | Inferno/Neck Scorer Necronomicon Mysticpetal Mana Enigma Mortalswords Mysticmist Scout Zarthai Sniper [MtG or Hearthstone forum?]
It's a multiplier and it increases the yield on a lv301 fighter fighter, next to nothing it's probably never in my deck or some random deck Dofrone's | Boltforge Fastboat Sling dmt3 Efreet Flattestalker Infernal Eater Toxic Water Hurricane For another Holy orb suggested... [gaming forum? LoL?]

Was planning on posting a longer (mostly humorous) post with my own results but that post is low priority so I don't know when it's going to happen.

This definitely could use more trials. In the case of the sentiment analysis experiment, I'd ideally like to try out some other sentence structures (eg "Is a <noun> bad?", "Are <adjective> things good?); in the case of the Moloch experiment, I'd like to try some reruns with the same parameters, as well as different name substitutions, just to be sure that it isn't noise.

Try varying lines 14 and 16 in the interactive script for quicker execution, and try giving it a few example lines to start with.

I did a similar sentiment analysis experiment with GPT-2 and after testing, I found only a couple instances of it getting problems wrong, my code is here:

https://github.com/spronkoid/GPT2-sentiment-analysis

it seems to do better when you add tokens denoting where parts of the problem lay

I know this post is old but I'd thought I'd comment just for the sake of commenting