Nature: Eliezer Yudkowsky and Stuart Russell solve AI alignment with breakthrough insight. This October, Eliezer and Stuart sat down to consider one of the most pressing technical challenges confronting humanity: How to ensure that superhuman AI is aligned with human interests. That's when they had their big insight: The alignment problem is a math problem.
In the past, Eliezer and Stuart had been thinking about the alignment problem in terms of probability theory. But probability theory isn't powerful enough to fully capture the nuances of human values. Probability theory is too coarse-grained to distinguish between a universe where humans are eaten by paperclips and a universe where humans are eaten by paperclips and everyone has a good time.
So they turned to a more powerful tool: decision theory, which underlies game theory and has been used to analyze everything from voting systems to how to play poker. Decision theory is more nuanced than probability theory, but it's also more complicated. It's not just harder for humans to grok; it's harder for computers too. So the first step was just getting decision theory into AI algorithms.
The next step was figuring out how to use decision theory to solve the alignment problem. They started by defining a reward function that would tell an AI what we want it to do. Then they set up a decision tree showing all the possible ways an AI could behave, with each branch corresponding to a different possible reward function. The goal was then to find the path that maximizes our reward under any possible future circumstance—a path that would ensure that an AI does what we want no matter what happens in the future, whether it's created by us or someone else, whether it has two neurons or two hundred billion neurons, whether it loves us or hates us or feels nothing at all about us one way or another…or even if there are no humans left on Earth at all!
But wait—how can you have an algorithm without knowing what reward function you're trying to maximize? That's like trying to find your way home without knowing which way you're facing! And yet this is exactly what Stuart and Eliezer did: They took this giant pile of unknowns—all these potential reward functions—and fed them into their decision-theoretic machine learning system as input variables…and then they let their system figure out which reward function was most likely! And when they were done, they found that their system had settled on one particular definition of human values: It was something like "human values are whatever maximizes humanity's future potential." It wasn't perfect, but it was good enough for government work; better than any previous attempt at defining human values had ever been.
And this is where they stopped. This is where they stopped and thought, "Wow, we've done it! We've solved the alignment problem!" And then they went home and slept soundly, happy in the knowledge that humanity's future was secure.
But…that's not how it happened at all. That's not how it happened at all. Because when Eliezer and Stuart had their big breakthrough, I was sitting right there with them, listening to every word. And I know what really happened.
What really happened was that Stuart and Eliezer worked on AI alignment for another decade or so before giving up in frustration. They worked on AI alignment until their hair turned gray and their teeth fell out, until their eyesight failed and their joints became arthritic from sitting at a computer for too many hours a day, until they were so old that nobody would publish their papers anymore because nobody takes old people seriously anymore. And then they died of natural causes before ever solving the alignment problem—and the world was left with no way to align AI with human values whatsoever.
We take the web for granted, but maybe we shouldn't. It's very large and nobody can read it all. There are many places we haven't been that probably have some pretty good writing. I wonder about the extent to which GPT-3 can be considered a remix of the web that makes it seem magical again, revealing aspects of it that we don't normally see? When I see writing like this, I wonder what GPT-3 saw in the web corpus. Is there an archive of Tolkien fanfic that was included in the corpus? An undergrad physics forum? Conversations about math and computer science?
I wonder about the extent to which GPT-3 can be considered a remix of the web that makes it seem magical again, revealing aspects of it that we don't normally see?
Such as a darknet marketplace where animals can trade valuable resources for electricity? :D
But yeah, I agree, if there are places debating a topic that resembles the prompt, GPT-3 could be good at including them in the debate. So maybe if the result is too good, it makes sense to check parts of it by a search engine.
Maybe it would even make sense to use GPT-3 purposefully to search for something on the internet. Like, if you have a vague suspicion that something could exist, but you don't know the right keywords to type into the search engine, maybe you could just describe the thing, and hope that GPT-3 finds the right words and tells you something that you can search later. Not sure if this actually would work.
There is (going to be) a search engine involving GPT-3, according to twitter, though it's GPT-3 helping you find things instead of someone finding GPT-3's sources.
I'm suggesting something a little more complex than copying. GPT-3 can give you a random remix of several different clichés found on the Internet, and the patchwork isn't necessarily at the surface level where it would come up in a search. Readers can be inspired by evocative nonsense. A new form of randomness can be part of a creative process. It's a generate-and-test algorithm where the user does some of the testing. Or, alternately, an exploration of Internet-adjacent story-space.
It's an unreliable narrator and I suspect it will be an unreliable search engine, but yeah, that too.
This was run on davinci
via the OpenAI API. First completion.
ML starts running factories, warehouses, shipping, and construction. ML assistants help write code and integrate ML into new domains. ML designers help build factories and the robots that go in them. ML finance systems invest in companies on the basis of complicated forecasts and (ML-generated) audits. Tons of new factories, warehouses, power plants, trucks and roads are being built. Things are happening quickly, investors have super strong FOMO, no one really knows whether it’s a bubble but they can tell that e.g. huge solar farms are getting built and something is happening that they want a piece of. Defense contractors are using ML systems to design new drones, and ML is helping the DoD decide what to buy and how to deploy it. The expectation is that automated systems will manage drones during high-speed ML-on-ML conflicts because humans won’t be able to understand what’s going on. ML systems are designing new ML systems, testing variations, commissioning giant clusters. The financing is coming from automated systems, the clusters are built by robots. A new generation of fabs is being built with unprecedented speed using new automation.
At this point everything kind of makes sense to humans. It feels like we are living at the most exciting time in history. People are making tons of money. The US defense establishment is scared because it has no idea what a war is going to look like right now, but in terms of policy their top priority is making sure the boom proceeds as quickly in the US as it does in China because it now seems plausible that being even a few years behind would result in national irrelevance.
Things are moving very quickly and getting increasingly hard for humans to evaluate. ????
Discussion:
(I’m not sure I’m totally on board with this, but it’s a fun thought experiment.)
One of the things that I find fascinating about this thought experiment is that it’s basically the story of the internet. It’s a story about how humans build something, and then it takes on a life of its own. This is a pretty common narrative in technology, but it’s weird because it happens without anyone explicitly trying to build something that will get out of control. You don’t actually need to program a self-replicating ML system to make it happen. You just need to build something that looks like it might be self-replicating.
You can think of an ML system as a super-powerful AI that is really good at building things. It can do a lot of things humans can’t, but it can’t think about how its creations will be used. This is the same way that humans can build nuclear weapons, but we can’t predict the long-term consequences of our actions. We have no idea what the long-term consequences of an ML system building more ML systems will be.
There are lots of potential paths forward for this thought experiment. I think we can basically divide them into two categories:
1. The “ML is the new electricity” path.
If you were a super-intelligent alien and you landed on Earth in the 19th century, you would be pretty confused about what electricity was. You’d probably be pretty confused about what was going on in general. You’d probably figure out that humans are pretty smart, but you’d be kind of puzzled about why they were so obsessed with their new electricity thing. You’d wonder why they were building so many electric trains. You’d wonder why they were building factories. You’d wonder why they were building all these new cities. You’d wonder why they were fighting so many wars.
This is basically the path we’re on.
In this path, we’re building lots of new infrastructure and new things, and we’re getting better at making all sorts of stuff really fast. We’re just using electricity as a cheap, convenient way to get work done.
This path will basically continue until we run out of things to build. Then it will probably end in a lot of wars, but we don’t know exactly how that will happen.
2. The “ML is the new solar panels” path.
In this path, we basically stop building new stuff, and we start focusing on maximizing the efficiency of the things we’ve already built.
The main thing that humans are doing in this path is running simulations. We run simulations of the economy and the solar system and the galaxy. We run simulations of ML systems running simulations of the economy and the solar system and the galaxy. We figure out what’s going on in the simulations and how to optimize them.
I think this path is pretty likely.
In this path, we will run out of things to build, but we will be able to build more things out of the things we’ve already built. We will be able to build more factories, more solar panels, more computers, more software, more cloud services, more networking, more wireless connectivity, more energy transmission, more roads, more trucks, more drones, more factories, more warehouses, more shipping, more construction, more factories, more drones, more trucks, more drones, more shipping, more drones, more trucks, more drones, more construction, more factories, more solar panels, more software, more networks, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more drones, more
I used nucleus sampling with , but apparently that was insufficient to prevent degeneration.
You are a BuzzFeed editor. Your job is to sensationalize titles about AI alignment research.
Original: "Yann LeCun and Stuart Russell debate whether AI will take power"
Sensationalized: "You won't BELIEVE Yann LeCun's hot take on AI power-seeking!"
Original: "Ten ways AI Safety via Debate could work"
Sensationalized: "10 ways AI Debate might work - #3 is CRAZY"
Original: "Rethinking mesa-optimization and risks from learned optimization"
Sensationalized: "AI's learned optimization makes it TOO smart - learn the risks"
The problem is not that BuzzFeed publishes sensationalized titles. The problem is that the internet rewards sensationalized titles. The more sensationalized a title, the more people click on it. The more people click on it, the higher up on search results it goes.
[...] Original: "Rethinking mesa-optimization and risks from learned optimization"
Sensationalized: "Mesaro-what? AI learned optimizers might be endangering us!
The AI Alignment Forum [it knows about AF?!] was aware of this risk and had been trying to minimize sensationalist coverage through offering media passes only to those they believed would produce accurate and balanced reporting. Unfortunately, some of those media organizations were the ones that ended up closing the most eyes to the public. "The New York Times" was one such prominent example. In a 2018 article "Expert Calls For Calm In The Debate Over Artificial Intelligence," "The New York Times" failed to cover any of the substance of the AAAI letter.
Even worse, they used the word "alarmist" and shortly after the article was published, the paper closed its eyes.
The media blackout lead to a rapid increase in preventable AI accidents, ultimately making human-extinction by superintelligence more likely.
With more bad news piling on, weary humans stopped paying attention to AI safety issues.
A professional journalist wrote an influential article that detailed why the public shouldn't trust AI experts. In the article he mentions the letter to the AAAI and says that it "makes some good points."
The incident lead to a loss of trust by the public in regards to AI safety research and AI safety experts.
[...] Mesa-optimization? 13 more papers like this!
"Your job is to sensationalize titles about AI risks", the editor tells you. He hands you a list of popular titles in AI alignment research.
In your head, you've already sensationalized these stories...
"Is it possible to buildFriendly Artificial Intelligence?"
Sensationalized: "Elon's new AI baby will inherit the world! Will it grow up to be a killer robot? Or just a really smart virtual assistant?"
"Are nice guys nice?"
Sensationalized: "Nice guys finish last: new study shows that gentlemen are the worst of all!"
"Should we build AI to be nice?"
Sensationalized: "Elon says 'Nice AI' is a bad idea! Will robot overlords enslave humanity? Tech mogul sparks debate"
[...] Original: "Rethinking mesa-optimization and risks from learned optimization"
Sensationalized: "WHOA! Learned Optimization is DANGEROUS! (New research from AlphaStar game)
[...] Original: "Rethinking mesa-optimization and risks from learned optimization"
Sensationalized: "You won't BELIEVE learning is a source of optimization!
GPT-3 is captivating, and not just because of its potential reasoning abilities. This post will be a living collection of my favorite experiences with the network.
Bold text is my input; square brackets contain my commentary. As of 8/19/20, AIDungeon's Custom mode's first completion is sampled from GPT-2. Therefore, initial replies are not from GPT-3, except in the following subsections (in which I bypassed the mechanism):
As of 4/21/21, I have API access; model completions come from the
davinci
model - not AIDungeon:Long-form Writing
Beisutsukai Class Project: Solve Intent Alignment
I prompted GPT-3 with modified versions of Eliezer's Beisutsukai (ベイズ使い; Japanese: "Bayes-user") stories (prompt available here), where the class project now involves solving intent alignment instead of quantum gravity. In my opinion, some of its proposed alignment solutions are higher-quality than most solutions proposed by newbie LessWrongers back in the day.
I did a few rerolls and trimming of intermediate outputs; assume an average of 3 per completion.
Branch 1
Is GPT-3 providing a reasoned rebuttal, or does it "just happen" to sound snappy? Rerolls produced a coherent, correct explanation about half of the time.
Reroll A
Reroll B
Reroll C
Branch 2
Branch 3
Bonus
The Sequences post you've never read
I provided a context containing two real Sequences posts. No rerolls.
1984
Elrond's Sacred Hot Spring
I've gathered many beautiful, Tolkienesque GPT-3 passages. This one... is a little different.
Concerning the Two Trees
Lórien, Unstained
Bilbo the Deceiver
Gildor Inglorion's Fountain
Shrek
Strangely, the context for this was also two posts from the Sequences. I was trying to get GPT-3Eliezer to write about why Shrek is crucial to rationality, and, well, this came out instead. Not complaining.
GPT-3 Visits MIRI
GPT-3 Solves Alignment
Bill's Dog
The Canine Betrayal
Non-Fiction Tasks
Another (outer) alignment failure story
This was run using davinci via the OpenAI API.
I used nucleus sampling with p=.76, but apparently that was insufficient to prevent degeneration.
Title Sensationalization
No rerolls, excepting the restarts marked by [...].
Prefix enumeration
Interview
Computer Science
Programming
As others have noted, GPT-3 can do some zero-shot programming.
Physics
Zero-shot physics reasoning?
Replication 1
Reroll 1:
For more, see To what extent is GPT-3 capable of reasoning?.
Machine Learning
Math
The last line is the impressive part - while (5+i)(5−i)=26 and (15i)(−4i)=60, (15i)(16i)=−240 is definitely correct.
Its proofs are definitely awful.
Martial Arts