This post is a not a so secret analogy for the AI Alignment problem. Via a fictional dialog, Eliezer explores and counters common questions to the Rocket Alignment Problem as approached by the Mathematics of Intentional Rocketry Institute.
MIRI researchers will tell you they're worried that "right now, nobody can tell you how to point your rocket’s nose such that it goes to the moon, nor indeed any prespecified celestial destination."
Post for a somewhat more general audience than the modal LessWrong reader, but gets at my actual thoughts on the topic.
In 2018 OpenAI defeated the world champions of Dota 2, a major esports game. This was hot on the heels of DeepMind’s AlphaGo performance against Lee Sedol in 2016, achieving superhuman Go performance way before anyone thought that might happen. AI benchmarks were being cleared at a pace which felt breathtaking at the time, papers were proudly published, and ML tools like Tensorflow (released in 2015) were coming online. To people already interested in AI, it was an exciting era. To everyone else, the world was unchanged.
Now Saturday Night Live sketches use sober discussions of AI risk as the backdrop for their actual jokes, there are hundreds...
buildings are not typically built by arsonists
Vernor Vinge is a legendary and recently deceased sci-fi author. I’ve just finished listening to the first two books in the Zone of Thought trilogy. Both books are entertaining and culturally influential. The audio versions are high-quality.
A Deepness in the Sky is about two spacefaring human civilizations with clashing societal structures converging on a mysterious solar system where a third, alien spider civilization is undergoing an industrial revolution. This combination of high-tech spacefaring and low-tech steampunk or fantasy shows up in both books and is one of the most compelling and unique parts of Vinge’s work. It allows him to focus on depicting rapid technological change. This is surprisingly rare in sci-fi which mostly takes the Star Wars or Trek route of depicting advanced but static technology.
Vinge...
This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.
This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.
We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.
Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."
We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...
...New tricks and tools can make cognitively demanding tasks less painful.
As a college student two decades ago, Andrew Westbrook struggled to stay focused in class. He certainly had the capacity to concentrate; he could do it intensely when he really loved a topic. “When you’re totally engrossed in a good book, thinking can feel effortless, or even magnetic,” he says. But when it came to his other work, thinking was difficult or even painful. “I would fail an exam and then face the shame and guilt that I felt about that, especially since I knew I could do better,” he recalls.
Westbrook achieved academic success in the end, becoming a neuroscientist. Today he leads a laboratory at Rutgers University, where his research is helping to upend long-held beliefs
This is an entry in the 'Dungeons & Data Science' series, a set of puzzles where players are given a dataset to analyze and an objective to pursue using information from that dataset.
You have the excellent fortune to live under the governance of The People's Glorious Free Democratic Republic of Earth, giving you a Glorious life of Freedom and Democracy.
Sadly, your cherished values of Democracy and Freedom are under attack by...THE ALIEN MENACE!
Faced with the desperate need to defend Freedom and Democracy from The Alien Menace, The People's Glorious Free Democratic Republic of Earth has been forced to redirect most of its resources into the Glorious Free People's Democratic War...
Thanks for running this when my one was going to be late, and thanks for checking with me beforehand.
(Also, thanks for the scenario, like, in general: it looks like a fun one!)
If we achieve AGI-level performance using an LLM-like approach, the training hardware will be capable of running ~1,000,000s concurrent instances of the model.
Although there is some debate about the definition of compute overhang, I believe that the AI Impacts definition matches the original use, and I prefer it: "enough computing hardware to run many powerful AI systems already exists by the time the software to run such systems is developed". A large compute overhang leads to additional risk due to faster takeoff.
I use the types of superintelligence defined in Bostrom's Superintelligence book (summary here).
I use the definition of AGI in this Metaculus question. The adversarial Turing test portion of the definition is not very relevant to this post.
Due to practical reasons, the compute requirements for training LLMs...
On the other hand, the world already contains over 8 billion human intelligences. So I think you are assuming that a few million AGIs, possibly running at several times human speed (and able to work 24/7, exchange information electronically, etc.), will be able to significantly "outcompete" (in some fashion) 8 billion humans? This seems worth further exploration / justification.
Good point, but a couple of thoughts:
I don't know if this is a well known argument that has already been responded to. If it is, just delete the post.
An implicit assumption most people make when discussion takeover risk, is that any misaligned agent will stave off enacting takeover plans with less than very high probability of success. If the Agent has an immediate takeover plan with 20% chance of success, but waiting a month would allow it to better its position to the point where the takeover plan it could enact then had a 99% chance of success, it would do so.
This assumption seems to break in multi-polar scenarios where you have many AI labs at similar levels of capabilities. As a toy example. If you have 20 AI labs, each having developed...
Epistemic Status: Musing and speculation, but I think there's a real thing here.
When I was a kid, a friend of mine had a tree fort. If you've never seen such a fort, imagine a series of wooden boards secured to a tree, creating a platform about fifteen feet off the ground where you can sit or stand and walk around the tree. This one had a rope ladder we used to get up and down, a length of knotted rope that was tied to the tree at the top and dangled over the edge so that it reached the ground.
Once you were up in the fort, you could pull the ladder up behind you. It was much, much harder to get into the fort without the ladder....
In less serious (but not fully unserious) citation of that particular site, it also contains an earlier depiction of literally pulling up ladders (as part of a comic based on treating LOTR as though it were a D&D campaign) that shows off what can sometimes result: a disruptive shock from the ones stuck on the lower side, in this case via a leap in technology level.
The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.
But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.
Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.
To...
I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.
I don't think these conditions are particularly weak at all. Any prior that fulfils it is a prior that would not be normalised right if the parameter-function map were one-to-one.
It's a kind of prior like to use a lot, but that doesn't make it a sane choice.
A well-normalised prior for a regular model probably doesn't look very continuous or dif...