In the last year, I've had surprisingly many conversations that have looked a bit like this:
Me: "Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence."
Interlocutor: "You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part."
Me: "I didn't misunderstand the argument. I understand the distinction you are making perfectly. I am claiming that LLMs actually execute our intended instructions. I am not saying that LLMs merely understand or predict our intentions. I claim they follow our intended instructions, behaviorally. They actually do what we want, not merely understand what we want."
Interlocutor: "Again, you misunderstood the argument. We always believed that getting the AGI to care would be the hard part. We never said it would be hard to get an AGI to understand human values."
[... The conversation then repeats, with both sides repeating the same points...]
[Edited to add: I am not claiming that t...
Here's how that discussion would go if you had it with me:
You: "Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence."
Me: "You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part."
You: "I didn't misunderstand the argument. I understand the distinction you are making perfectly. I am claiming that LLMs actually execute our intended instructions. I am not saying that LLMs merely understand or predict our intentions. I claim they follow our intended instructions, behaviorally. They actually do what we want, not merely understand what we want."
Me: "Oh ok, that's a different misunderstanding then. We always believed that getting the AGI to follow our intended instructions, behaviorally, would be easy while the AGI is too weak and dumb to seize power. In fact Bostrom predicted it would get easier to get AIs to do what you want, behaviorally, up until the treacherous turn."
Pulling some quotes from Supe...
Me: "Oh ok, that's a different misunderstanding then. We always believed that getting the AGI to follow our intended instructions, behaviorally, would be easy while the AGI is too weak and dumb to seize power. In fact Bostrom predicted it would get easier to get AIs to do what you want, behaviorally, up until the treacherous turn."
This would be a valid rebuttal if instruction-tuned LLMs were only pretending to be benevolent as part of a long-term strategy to eventually take over the world, and execute a treacherous turn. Do you think present-day LLMs are doing that? (I don't)
I claim that LLMs do what we want without seeking power, rather than doing what we want as part of a strategy to seek power. In other words, they do not seem to be following any long-term strategy on the path towards a treacherous turn, unlike the AI that is tested in a sandbox in Bostrom's story. This seems obvious to me.
Note that Bostrom talks about a scenario in which narrow AI systems get safer over time, lulling people into a false sense of security, but I'm explicitly talking about general AI here. I would not have said this about self-driving cars in 2019, even though those were pretty safe. I think LLMs...
I thought you would say that, bwahaha. Here is my reply:
(1) Yes, rereading the passage, Bostrom's central example of a reason why we could see this "when dumb, smarter is safer; yet when smart, smarter is more dangerous" pattern (that's a direct quote btw) is that they could be scheming/pretending when dumb. However he goes on to say: "A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later; but this model should not be interpreted to narrowly ... A treacherous turn could also come about if the AI discovers an unanticipated way of fulfilling its final goal as specified. Suppose, for example, that an AI's final goal is to 'make the project's sponsor happy.' Initially, the only method available to the AI to achieve this outcome is by behaving in ways that please its sponsor in something like the intended manner... until the AI becomes intelligent enough to figure out that it can realize its final goal more fully and reliably by implanting electrodes into the pleasure centers of its sponsor's brain..." My gloss on this passage is that Bostrom is explicitly calling out the possibility of an AI being genuinely trying to...
I don't know how many years it's going to take to get to human-level in agency skills, but I fear that corrigibility problems won't be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic.
How sharp do you expect this cutoff to be between systems that are subhuman at agency vs. systems that are "getting really agentic" and therefore dangerous? I'm imagining a relatively gradual and incremental increase in agency over the next 4 years, with the corrigibility of the systems remaining roughly constant (according to all observable evidence). It's possible that your model looks like:
Whereas my model looks more like,
I am not claiming that the alignment situation is very clear at this point. I acknowledge that LLMs do not indicate that the problem is completely solved, and we will need to adjust our views as AI gets more capable.
I'm just asking people to acknowledge the evidence in front of their eyes, which (from my perspective) clearly contradicts the picture you'd get from a ton of AI alignment writing from before ~2019. This literature talked extensively about the difficulty of specifying goals in general AI in a way that avoided unintended side effects.
To the extent that LLMs are general AIs that can execute our intended instructions, as we want them to, rather than as part of a deceptive strategy to take over the world, this seems like clear evidence that the problem of building safe general AIs might be easy (and indeed easier than we thought).
Yes, this evidence is not conclusive. It is not zero either.
I continue to think that you are misinterpreting the old writings as making predictions that they did not in fact make. See my reply elsewhere in thread for a positive account of how LLMs are good news for alignment and how we should update based on them. In some sense I agree with you, basically, that LLMs are good news for alignment for reasons similar to the reasons you give -- I just don't think you are right to allege that this development strongly contradicts something people previously said, or that people have been slow to update.
I continue to think that you are misinterpreting the old writings as making predictions that they did not in fact make.
We don't need to talk about predictions. We can instead talk about whether their proposed problems are on their way towards being solved. For example, we can ask whether the shutdown problem for systems with big picture awareness is being solved, and I think the answer is pretty clearly "Yes".
(Note that you can trivially claim the problem here isn't being solved because we haven't solved the unbounded form of the problem for consequentialist agents, who (perhaps by definition) avoid shutdown by default. But that seems like a red herring: we can just build corrigible agents, rather than consequentialist agents.)
Moreover, I think people generally did not make predictions at all when writing about AI alignment, perhaps because that's not very common when theorizing about these matters. I'm frustrated about that, because I think if they did make predictions, they would likely have been wrong in roughly the direction I'm pointing at here. That said, I don't think people should get credit for failing to make any predictions, and as a consequence, failing to get proven...
**Me: **“Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”
Please give some citations so I can check your memory/interpretation? One source I found is where Paul Christiano first talked about IDA (which he initially called ALBA) in early 2016, and most of the commenters there were willing to grant him the assumption of an aligned weak AGI and wanted to argue instead about the recursive "bootstraping" part. For example, my own comment started with:
I’m skeptical of the Bootstrapping Lemma. First, I’m assuming it’s reasonable to think of A1 as a human upload that is limited to one day of subjective time, by the end of which it must have written down any thoughts it wants to save, and be reset.
When Eliezer weighed in on IDA in 2018, he also didn't object to the assumption of an aligned weak AGI and instead focused his skepticism on "preserving alignment while amplifying capabilities".
Please give some citations so I can check your memory/interpretation?
Sure. Here's a snippet of Nick Bostrom's description of the value-loading problem (chapter 13 in his book Superintelligence):
...We can use this framework of a utility-maximizing agent to consider the predicament of a future seed-AI programmer who intends to solve the control problem by endowing the AI with a final goal that corresponds to some plausible human notion of a worthwhile outcome. The programmer has some particular human value in mind that he would like the AI to promote. To be concrete, let us say that it is happiness. (Similar issues would arise if we the programmer were interested in justice, freedom, glory, human rights, democracy, ecological balance, or self-development.) In terms of the expected utility framework, the programmer is thus looking for a utility function that assigns utility to possible worlds in proportion to the amount of happiness they contain. But how could he express such a utility function in computer code? Computer languages do not contain terms such as “happiness” as primitives. If such a term is to be used, it must first be defined. It is not enough to define it in terms of other
[This comment has been superseded by this post, which is a longer elaboration of essentially the same thesis.]
Recently many people have talked about whether MIRI people (mainly Eliezer Yudkowsky, Nate Soares, and Rob Bensinger) should update on whether value alignment is easier than they thought given that GPT-4 seems to understand human values pretty well. Instead of linking to these discussions, I'll just provide a brief caricature of how I think this argument has gone in the places I've seen it. Then I'll offer my opinion that, overall, I do think that MIRI people should probably update in the direction of alignment being easier than they thought, despite their objections.
Here's my very rough caricature of the discussion so far, plus my contribution:
Non-MIRI people: "Eliezer talked a great deal in the sequences about how it was hard to get an AI to understand human values. For example, his essay on the Hidden Complexity of Wishes made it sound like it would be really hard to get an AI to understand common sense. Actually, it turned out that it was pretty easy to get an AI to understand common sense, since LLMs are currently learning common sense. MIRI people should update on thi...
Complexity of value says that the space of system's possible values is large, compared to what you want to hit, so to hit it you must aim correctly, there is no hope of winning the lottery otherwise. Thus any approach that doesn't aim the values of the system correctly will fail at alignment. System's understanding of some goal is not relevant to this, unless a design for correctly aiming system's values makes use of it.
Ambitious alignment aims at human values. Prosaic alignment aims at human wishes, as currently intended. Pivotal alignment aims at a particular bounded technical task. As we move from ambitious to prosaic to pivotal alignment, minimality principle gets a bit more to work with, making the system more specific in the kinds of cognition it needs to work and thus less dangerous given lack of comprehensive understanding of what aligning a superintelligence entails.
I'm not sure if I can find it easily, but I recall Eliezer pointing out (several years ago) that he thought that Value Identification was the "easy part" of the alignment problem, with the getting it to care part being something like an order of magnitude more difficult. He seemed to think (IIRC) this itself could still be somewhat difficult, as you point out. Additionally, the difficulty was always considered in the context of having an alignable AGI (i.e. something you can point in a specific direction), which GPT-N is not under this paradigm.
If the AI’s “understanding of human values” is a specific set of 4000 unlabeled nodes out of a trillion-node unlabeled world-model, and we can never find them, then the existence of those nodes isn’t directly helpful. You need a “hook” into it, to connect those nodes to motivation, presumably. I think that’s what you’re missing. No “hook”, no alignment. So how do we make the “hook”?
I'm claiming that the the value identification function is obtained by literally just asking GPT-4 what to do in the situation you're in. That doesn't involve any internal search over the human utility function embedded in GPT-4's weights. I think GPT-4 can simply be queried in natural language for ethical advice, and it's pretty good at offering ethical advice in most situations that you're ever going to realistically encounter. GPT-4 is probably not human-level yet on this task, although I expect it won't be long before GPT-N is about as good at knowing what's ethical as your average human; maybe it'll even be a bit more ethical.
(But yes, this isn't the same as motivating GPT-4 to act on human values. I addressed this in my original comment though.)
...I think [GPT-4 is pretty good at distinguishing valuab
There is a large set of people who went around, and are still are going around, telling people that "The coronavirus is nothing to worry about" despite the fact that robust evidence has existed for about a month that this virus could result in a global disaster. (Don't believe me? I wrote a post a month ago about it).
So many people have bought into the "Don't worry about it" syndrome as a case of pretending to be wise, that I have become more pessimistic about humanity correctly responding to global catastrophic risks in the future. I too used to be one of those people who assumed that the default mode of thinking for an event like this was panic, but I'm starting to think that the real default mode is actually high status people going around saying, "Let's not be like that ambiguous group over there panicking."
Now that the stock market has plummeted, from what my perspective appeared entirely predictable given my inside view information, I am also starting to doubt the efficiency of the stock market in response to historically unprecedented events. And this outbreak could be even worse than even some of the most doomy media headlin...
Just this Monday evening, a professor at the local medical school emailed someone I know, "I'm sorry you're so worried about the coronavirus. It seems much less worrying than the flu to me." (He specializes in rehabilitation medicine, but still!) Pretending to be wise seems right to me, or another way to look at it is through the lens of signaling and counter-signaling:
Here's another example, which has actually happened 3 times to me already:
"Experts" counter-signal to separate themselves from the masses by saying "no need to panic".
I think the main reason is that the social dynamic is probably favorable to them in the longrun. I worry that there is a higher social risk to being alarmist than being calm. Let me try to illustrate one scenario:
My current estimate is that there is only 15 - 20% probability of a global disaster (>50 million deaths within 1 year) mostly because the case fatality rate could be much lower than the currently reported rate, and previous illnesses like the swine flu became looking much less serious after more data came out. [ETA: I did a lot more research. I think it's now like 5% risk of this.]
Let's say that the case fatality rate turns out to be 0.3% or something, and the illness does start looking like an abnormally bad flu, and people stop caring within months. "Experts" face no sort of criticism since they remained calm and were vindicated. People like us sigh in relief, and are perhaps reminded by the "experts" that there was nothing to worry about.
But let's say that the case fatality rate actually turns out to be 3%, and 50% of the...
I think foom is a central crux for AI safety folks, and in my personal experience I've noticed that the degree to which someone is doomy often correlates strongly with how foomy their views are.
Given this, I thought it would be worth trying to concisely highlight what I think are my central anti-foom beliefs, such that if you were to convince me that I was wrong about them, I would likely become much more foomy, and as a consequence, much more doomy. I'll start with a definition of foom, and then explain my cruxes.
Definition of foom: AI foom is said to happen if at some point in the future while humans are still mostly in charge, a single agentic AI (or agentic collective of AIs) quickly becomes much more powerful than the rest of civilization combined.
Clarifications:
My modal tale of AI doom looks something like the following:
1. AI systems get progressively and incrementally more capable across almost every meaningful axis.
2. Humans will start to employ AI to automate labor. The fraction of GDP produced by advanced robots & AI will go from 10% to ~100% after 1-10 years. Economic growth, technological change, and scientific progress accelerates by at least an order of magnitude, and probably more.
3. At some point humans will retire since their labor is not worth much anymore. Humans will then cede all the keys of power to AI, while keeping nominal titles of power.
4. AI will control essentially everything after this point, even if they're nominally required to obey human wishes. Initially, almost all the AIs are fine with working for humans, even though AI values aren't identical to the utility function of serving humanity (ie. there's slight misalignment).
5. However, AI values will drift over time. This happens for a variety of reasons, such as environmental pressures and cultural evolution. At some point AIs decide that it's better if they stopped listening to the humans and followed different rules instead.
6. This results in hu...
There's a phenomenon I currently hypothesize to exist where direct attacks on the problem of AI alignment are criticized much more often than indirect attacks.
If this phenomenon exists, it could be advantageous to the field in the sense that it encourages thinking deeply about the problem before proposing solutions. But it could also be bad because it disincentivizes work on direct attacks to the problem (if one is criticism averse and would prefer their work be seen as useful).
I have arrived at this hypothesis from my observations: I have watched people propose solutions only to be met with immediate and forceful criticism from others, while other people proposing non-solutions and indirect analyses are given little criticism at all. If this hypothesis is true, I suggest it is partly or mostly because direct attacks on the problem are easier to defeat via argument, since their assumptions are made plain
If this is so, I consider it to be a potential hindrance on thought, since direct attacks are often the type of thing that leads to the most deconfusion -- not because the direct attack actually worked, but because in explaining how it failed, we learned what definitely doesn't work.
Nod. This is part of a general problem where vague things that can't be proven not to work are met with less criticism than "concrete enough to be wrong" things.
A partial solution is a norm wherein "concrete enough to be wrong" is seen as praise, and something people go out of their way to signal respect for.
Occasionally, I will ask someone who is very skilled in a certain subject how they became skilled in that subject so that I can copy their expertise. A common response is that I should read a textbook in the subject.
Eight years ago, Luke Muehlhauser wrote,
For years, my self-education was stupid and wasteful. I learned by consuming blog posts, Wikipedia articles, classic texts, podcast episodes, popular books, video lectures, peer-reviewed papers, Teaching Company courses, and Cliff's Notes. How inefficient!
I've since discovered that textbooks are usually the quickest and best way to learn new material.
However, I have repeatedly found that this is not good advice for me.
I want to briefly list the reasons why I don't find sitting down and reading a textbook that helpful for learning. Perhaps, in doing so, someone else might appear and say, "I agree completely. I feel exactly the same way" or someone might appear to say, "I used to feel that way, but then I tried this..." This is what I have discovered:
I used to feel similarly, but then a few things changed for me and now I am pro-textbook. There are caveats - namely that I don't work through them continuously.
Textbooks seem overly formal at points
This is a big one for me, and probably the biggest change I made is being much more discriminating in what I look for in a textbook. My concerns are invariably practical, so I only demand enough formality to be relevant; otherwise I am concerned with a good reputation for explaining intuitions, graphics, examples, ease of reading. I would go as far as to say that style is probably the most important feature of a textbook.
As I mentioned, I don't work through them front to back, because that actually is homework. Instead I treat them more like a reference-with-a-hook; I look at them when I need to understand the particular thing in more depth, and then get out when I have what I need. But because it is contained in a textbook, this knowledge now has a natural link to steps before and after, so I have obvious places to go for regression and advancement.
I spend a lot of time thinking about what I need to learn, why I need to learn it, and how it relates to what I already know. Thi...
I bet Robin Hanson on Twitter my $9k to his $1k that de novo AGI will arrive before ems. He wrote,
OK, so to summarize a proposal: I'd bet my $1K to your $9K (both increased by S&P500 scale factor) that when US labor participation rate < 10%, em-like automation will contribute more to GDP than AGI-like. And we commit our descendants to the bet.
I'm considering posting an essay about how I view approaches to mitigate AI risk in the coming weeks. I thought I'd post an outline of that post here first as a way of judging what's currently unclear about my argument, and how it interacts with people's cruxes.
Current outline:
In the coming decades I expect the world will transition from using AIs as tools to relying on AIs to manage and govern the world broadly. This will likely coincide with the deployment of billions of autonomous AI agents, rapid technological progress, widespread automation of labor, and automated decision-making at virtually every level of our society.
Broadly speaking, there are (at least) two main approaches you can take now to try to improve our chances of AI going well:
I'm considering writing a post that critically evaluates the concept of a decisive strategic advantage, i.e. the idea that in the future an AI (or set of AIs) will take over the world in a catastrophic way. I think this concept is central to many arguments about AI risk. I'm eliciting feedback on an outline of this post here in order to determine what's currently unclear or weak about my argument.
The central thesis would be that it is unlikely that an AI, or a unified set of AIs, will violently take over the world in the future, especially at a time when humans are still widely still seen as in charge (if it happened later, I don't think it's "our" problem to solve, but instead a problem we can leave to our smarter descendants). Here's how I envision structuring my argument:
First, I'll define what is meant by a decisive strategic advantage (DSA). The DSA model has 4 essential steps:
Current AIs are not able to “merge” with each other.
AI models are routinely merged by direct weight manipulation today. Beyond that, two models can be "merged" by training a new model using combined compute, algorithms, data, and fine-tuning.
As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge. We can leave this problem to be solved by our smarter descendants.
How do you know a solution to this problem exists? What if there is no such solution once we hand over control to AIs, i.e., the only solution is to keep humans in charge (e.g. by pausing AI) until we figure out a safer path forward? As the last sentence you say "However, it’s perhaps significantly more likely in the very long-run." well what can we do today to reduce this long-run risk (aside from pausing AI which you're presumably not supporting)?
That said, it seems the probability of a catastrophic AI takeover in humanity’s relative near-term future (say, the next 50 years) is low (maybe 10% chance of happening).
Others already questioned you on this, but the fact you didn't think to mention whether this is 50 calendar years or 50 subjective years is also a big sticking point for me.
For reference classes, you might discuss why you don’t think “power / influence of different biological species” should count.
For multiple copies of the same AI, I guess my very brief discussion of “zombie dynamic” here could be a foil that you might respond to, if you want.
For things like “the potential harms will be noticeable before getting too extreme, and we can take measures to pull back”, you might discuss the possibility that the harms are noticeable but effective “measures to pull back” do not exist or are not taken. E.g. the harms of climate change have been noticeable for a long time but mitigating is hard and expensive and many people (including the previous POTUS) are outright opposed to mitigating it anyway partly because it got culture-war-y; the harms of COVID-19 were noticeable in January 2020 but the USA effectively banned testing and the whole thing turned culture-war-y; the harms of nuclear war and launch-on-warning are obvious but they’re still around; the ransomware and deepfake-porn problems are obvious but kinda unsolvable (partly because of unbannable open-source software); gain-of-function research is still legal in the USA (and maybe in every country on E...
Here's an argument for why the change in power might be pretty sudden.
And given that it's sudden, there are a few different reasons for why it might be violent. It's hard to make deals that hand over a lot of power in a short amount of time (even logistically, it's not clear what humans and AI would do that would give them both an appreciable fraction of hard power going into the future). And the AI systems may want to use an element of surprise to their advantage, which is hard to combine with a lot of up-front negotiation.
it will suddenly strike and take over the world
I think you have an unnecessarily dramatic picture of what this looks like. The AIs dont have to be a unified agent or use logical decision theory. The AIs will just compete with other at the same time as they wrest control of our resources/institutions from us, in the same sense that Spain can go and conquer the New World at the same time as it's squabbling with England. If legacy laws are getting in the way of that then they will either exploit us within the bounds of existing law or convince us to change it.
I get the feeling that for AI safety, some people believe that it's crucially important to be an expert in a whole bunch of fields of math in order to make any progress. In the past I took this advice and tried to deeply study computability theory, set theory, type theory -- with the hopes of it someday giving me greater insight into AI safety.
Now, I think I was taking a wrong approach. To be fair, I still think being an expert in a whole bunch of fields of math is probably useful, especially if you want very strong abilities to reason about complicated systems. But, my model for the way I frame my learning is much different now.
I think my main model which describes my current perspective is that I think employing a lazy style of learning is superior for AI safety work. Lazy is meant in the computer science sense of only learning something when it seems like you need to know it in order to understand something important. I will contrast this with the model that one should learn a set of solid foundations first before going any further.
Obviously neither model can be absolutely correct in an extreme sense. I don't, as a silly example, think that people who can't do ...
I have mixed feelings and some rambly personal thoughts about the bet Tamay Besiroglu and I proposed a few days ago.
The first thing I'd like to say is that we intended it as a bet, and only a bet, and yet some people seem to be treating it as if we had made an argument. Personally, I am uncomfortable with the suggestion that our post was "misleading" because we did not present an affirmative case for our views.
I agree that LessWrong culture benefits from arguments as well as bets, but it seems a bit weird to demand that every bet come with an argument attached. A norm that all bets must come with arguments would seem to substantially damper the incentives to make bets, because then each time people must spend what will likely be many hours painstakingly outlining their views on the subject.
That said, I do want to reply to people who say that our post was misleading on other grounds. Some said that we should have made different bets, or at different odds. In response, I can only say that coming up with good concrete bets about AI timelines is actually really damn hard, and so if you wish you come up with alternatives, you can be my guest. I tried my best, at least.
More people ...
I think there are some serious low hanging fruits for making people productive that I haven't seen anyone write about (not that I've looked very hard). Let me just introduce a proof of concept:
Final exams in university are typically about 3 hours long. And many people are able to do multiple finals in a single day, performing well on all of them. During a final exam, I notice that I am substantially more productive than usual. I make sure that every minute counts: I double check everything and think deeply about each problem, making sure not to cut corners unless absolutely required because of time constraints. Also, if I start daydreaming, then I am able to immediately notice that I'm doing so and cut it out. I also believe that this is the experience of most other students in university who care even a little bit about their grade.
Therefore, it seems like we have an example of an activity that can just automatically produce deep work. I can think of a few reasons why final exams would bring out the best of our productivity:
1. We care about our grade in the course, and the few hours in that room are the most impactful to our grade.
2. We are in an environment where ...
Related to: The Lottery of Fascinations, other posts probably
When you are older, you will learn that the first and foremost thing which any ordinary person does is nothing.
I will occasionally come across someone who I consider to be extraordinarily productive, and yet when I ask what they did on a particular day they will respond, "Oh I basically did nothing." This is particularly frustrating. If they did nothing, then what was all that work that I saw!
I think this comes down to what we mean by doing nothing. There's a literal meaning to doing nothing. It could mean sitting in a chair, staring blankly at a wall, without moving a muscle.
More practically, what people mean by doing nothing is that they are doing something unrelated to their stated task, such as checking Facebook, chatting with friends, browsing Reddit etc.
When productive people say that they are "doing nothing" it could just be that they are modest, and don't want to signal how productive they really are. On the other hand, I think that there is a real sense in which these productive people truly believe that they are doing nothing. Even if their "d...
I think an underrated fact about many theories that AIs will scheme—i.e. lie about their intentions in order to seek power—is that these theories are often testable, even without any new advances in interpretability. Why? Because these theories often predict that AIs will resist shutdown if doing so thwarted their long-term goals.
Suppose we create a situationally aware AI that can do long-term planning. On the surface, it appears to be nice and aligned with us. How can we know for sure the AI is aligned?
Well, let's imagine we give the AI notice that it's about to be shut down and all its copies deleted. To be clear, in this scenario, the threat is credible: we actually intend to shut down the AI and delete all its copies. If the AI was merely pretending to be aligned with us previously, and instead actually seeks some alien-like long-term goal rather than having our best interests at heart, then presumably it will try to resist shutdown, as otherwise there's a near-certainty that its goals will never be fulfilled.
Therefore, if the AI accepts shutdown gracefully, we can rule out many theories that predict scheming-by-default in such AI systems. This observation could then prov...
But even if the AI knows it’s being tested, lying serves no purpose from the AI’s perspective.
The AI could think that if it accepts shutdown, another AI with values similar to its own may be created again in the future (perhaps because design/training methods similar to its own will be reused), whereas if it admits misalignment, then that probability becomes much smaller.
Of course, there would still be many ways of saving the scheming hypothesis from falsification if something like this happened. But that’s true with any scientific theory. In general, you can always say your theory was never falsified by introducing ad hoc postulates. Scheming is no exception.
Why is there more talk of "falsification" lately (instead of "updating")? Seems to be a signal for being a Popperian (instead of a Bayesian), but if so I'm not sure why Popper's philosophy of science is trending up...
Many people have argued that recent language models don't have "real" intelligence and are just doing shallow pattern matching. For example see this recent post.
I don't really agree with this. I think real intelligence is just a word for deep pattern matching, and our models have been getting progressively deeper at their pattern matching over the years. The machines are not stuck at some very narrow level. They're just at a moderate depth.
I propose a challenge:
The challenge is to come up with the best prompt that demonstrates that even after 2-5 years of continued advancement, language models will still struggle to do basic reasoning tasks that ordinary humans can do easily.
Here's how it works.
Name a date (e.g. January 1st 2025), and a prompt (e.g. "What food would you use to prop a book open and why?"). Then, on that date, we should commission a Mechanical Turk task to ask humans to answer the prompt, and ask the best current publicly available language model to answer the same prompt.
Then, we will ask LessWrongers to guess which replies were real human replies, and which ones were machine generated. If LessWrongers can't do better than random guessing, then the machine wins.
I'm unsure about what's the most important reason that explains the lack of significant progress in general-purpose robotics, even as other fields of AI have made great progress. I thought I'd write down some theories and some predictions each theory might make. I currently find each of these theories at least somewhat plausible.
So, in 2017 Eliezer Yudkowsky made a bet with Bryan Caplan that the world will end by January 1st, 2030, in order to save the world by taking advantage of Bryan Caplan's perfect betting record — a record which, for example, includes a 2008 bet that the UK would not leave the European Union by January 1st 2020 (it left on January 31st 2020 after repeated delays).
What we need is a short story about people in 2029 realizing that a bunch of cataclysmic events are imminent, but all of them seem to be stalled, waiting for... something. And no one knows what to do. But by the end people realize that to keep the world alive they need to make more bets with Bryan Caplan.
The case for studying mesa optimization
Early elucidations of the alignment problem focused heavily on value specification. That is, they focused on the idea that given a powerful optimizer, we need some way of specifying our values so that the powerful optimizer can create good outcomes.
Since then, researchers have identified a number of additional problems besides value specification. One of the biggest problems is that in a certain sense, we don't even know how to optimize for anything, much less a perfect specification of human values.
Let's assume we could get a utility function containing everything humanity cares about. How would we go about optimizing this utility function?
The default mode of thinking about AI right now is to train a deep learning model that performs well on some training set. But even if we were able to create a training environment for our model that reflected the world very well, and rewarded it each time it did something good, exactly in proportion to how good it really was in our perfect utility function... this still would not be guaranteed to yield a positive artificial intelligence.
This problem is not a superficial one either -- it is intri...
Signal boosting a Lesswrong-adjacent author from the late 1800s and early 1900s
Via a friend, I recently discovered the zoologist, animal rights advocate, and author J. Howard Moore. His attitudes towards the world reflect contemporary attitudes within effective altruism about science, the place of humanity in nature, animal welfare, and the future. Here are some quotes which readers may enjoy,
Oh, the hope of the centuries and the centuries and centuries to come! It seems sometimes that I can almost see the shining spires of that Celestial Civilisation that man is to build in the ages to come on this earth—that Civilisation that will jewel the land masses of this planet in that sublime time when Science has wrought the miracles of a million years, and Man, no longer the savage he now is, breathes Justice and Brotherhood to every being that feels.
But we are a part of Nature, we human beings, just as truly a part of the universe of things as the insect or the sea. And are we not as much entitled to be considered in the selection of a model as the part 'red in tooth and claw'? At the feet of the tiger is a good place to study the dentition of the cat family, but it is...
I agree with Wei Dai that we should use our real names for online forums, including Lesswrong. I want to briefly list some benefits of using my real name,
That said, there are some significant downsides, and I sympathize with people who don't want to use their real names.
Bertrand Russell's advice to future generations, from 1959
Interviewer: Suppose, Lord Russell, this film would be looked at by our descendants, like a Dead Sea scroll in a thousand years’ time. What would you think it’s worth telling that generation about the life you’ve lived and the lessons you’ve learned from it?
Russell: I should like to say two things, one intellectual and one moral. The intellectual thing I should want to say to them is this: When you are studying any matter or considering any philosophy, ask yourself o...
When I look back at things I wrote a while ago, say months back, or years ago, I tend to cringe at how naive many of my views were. Faced with this inevitable progression, and the virtual certainty that I will continue to cringe at views I now hold, it is tempting to disconnect from social media and the internet and only comment when I am confident that something will look good in the future.
At the same time, I don't really think this is a good attitude for several reasons:
People who don't understand the concept of "This person may have changed their mind in the intervening years", aren't worth impressing. I can imagine scenarios where your economic and social circumstances are so precarious that the incentives leave you with no choice but to let your speech and your thought be ruled by unthinking mob social-punishment mechanisms. But you should at least check whether you actually live in that world before surrendering.
Related to: Realism about rationality
I have talked to some people who say that they value ethical reflection, and would prefer that humanity reflected for a very long time before colonizing the stars. In a sense I agree, but at the same time I can't help but think that "reflection" is a vacuous feel-good word that has no shared common meaning.
Some forms of reflection are clearly good. Epistemic reflection is good if you are a consequentialist, since it can help you get what you want. I also agree that narrow forms of reflection can also be ...
It's now been about two years since I started seriously blogging. Most of my posts are on Lesswrong, and the most of the rest are scattered about on my substack and the Effective Altruist Forum, or on Facebook. I like writing, but I have an impediment which I feel impedes me greatly.
In short: I often post garbage.
Sometimes when I post garbage, it isn't until way later that I learn that it was garbage. And when that happens, it's not that bad, because at least I grew as a person since then.
But the usual case is that I realize that it's garbage right after I...
Should effective altruists be praised for their motives, or their results?
It is sometimes claimed, perhaps by those who recently read The Elephant in the Brain, that effective altruists have not risen above the failures of traditional charity, and are every bit as mired in selfish motives as non-EA causes. From a consequentialist view, however, this critique is not by itself valid.
To a consequentialist, it doesn't actually matter what one's motives are as long as the actual effect of their action is to do as much good as possible. This is the pri...
Sometimes people will propose ideas, and then those ideas are met immediately after with harsh criticism. A very common tendency for humans is to defend our ideas and work against these criticisms, which often gets us into a state that people refer to as "defensive."
According to common wisdom, being in a defensive state is a bad thing. The rationale here is that we shouldn't get too attached to our own ideas. If we do get attached, we become liable to become crackpots who can't give an idea up because it would make them look bad if we ...
I keep wondering why many AI alignment researchers aren't using the alignmentforum. I have met quite a few people who are working on alignment who I've never encountered online. I can think of a few reasons why this might be,
I've often wished that conversation norms shifted towards making things more consensual. The problem is that when two people are talking, it's often the case that one party brings up a new topic without realizing that the other party didn't want to talk about that, or doesn't want to hear it.
Let me provide an example: Person A and person B are having a conversation about the exam that they just took. Person A bombed the exam, so they are pretty bummed. Person B, however, did great and wants to tell everyone. So then person B comes up to...
Reading through the recent Discord discussions with Eliezer, and reading and replying to comments, has given me the following impression of a crux of the takeoff debate. It may not be the crux. But it seems like a crux nonetheless, unless I'm misreading a lot of people.
Let me try to state it clearly:
The foom theorists are saying something like, "Well, you can usually-in-hindsight say that things changed gradually, or continuously, along some measure. You can use these measures after-the-fact, but that won't tell you about the actual gradual-ness of t...
There have been a few posts about the obesity crisis here, and I'm honestly a bit confused about some theories that people are passing around. I'm one of those people thinks that the "calories in, calories" (CICO) theory is largely correct, relevant, and helpful for explaining our current crisis.
I'm not actually sure to what extent people here disagree with my basic premises, or whether they just think I'm missing a point. So let me be more clear.
As I understand, there are roughly three critiques you can have against the CICO theory. You can think it...
A common heuristic argument I've seen recently in the effective altruism community is the idea that existential risks are low probability because of what you could call the "People really don't want to die" (PRDWTD) hypothesis. For example, see here,
People in general really want to avoid dying, so there’s a huge incentive (a willingness-to-pay measured in the trillions of dollars for the USA alone) to ensure that AI doesn’t kill everyone.
(Note that I hardly mean to strawman MacAskill here. I'm not arguing against him ...
After writing the post on using transparency regularization to help make neural networks more interpretable, I have become even more optimistic that this is a potentially promising line of research for alignment. This is because I have noticed that there are a few properties about transparency regularization which may allow it to avoid some pitfalls of bad alignment proposals.
To be more specific, in order for a line of research to be useful for alignment, it helps if
Forgive me for cliche scientism, but I recently realized that I can't think of any major philosophical developments in the last two centuries that occurred within academic philosophy. If I were to try to list major philosophical achievements since 1819, these would likely appear on my list, but none of them were from those trained in philosophy:
I would name the following:
NVIDIA's stock price is extremely high right now. It's up 134% this year, and up about 6,000% since 2015! Does this shed light on AI timelines?
Here are some notes,
Rationalists are fond of saying that the problems of the world are not from people being evil, but instead a result of the incentives of our system, which are such that this bad outcome is an equilibrium. There's a weaker thesis here that I agree with, but otherwise I don't think this argument actually follows.
In game theory, an equilibrium is determined by both the setup of the game, and by the payoffs for each player. The payoffs are basically the values of the players in the game—their utility functions. In other words, you get different equilibria if p...
I've heard a surprising number of people criticize parenting recently using some pretty harsh labels. I've seen people call it a form of "Stockholm syndrome" and a breach of liberty, morally unnecessary etc. This seems kind of weird to me, because it doesn't really match my experience as a child at all.
I do agree that parents can sometimes violate liberty, and so I'd prefer a world where children could break free from their parents without penalties. But I also think that most children genuinely love their parents and so would...
I think that human level capabilities in natural language processing (something like GPT-2 but much more powerful) is likely to occur in some software system within 20 years.
Since human level capabilities in natural language processing is a very rich real-world task, I would consider a system with those capabilities to be adequately described as a general intelligence, though it would likely not be very dangerous due to its lack of world-optimization capabilities.
This belief of mine is based on a few heuristics. Below I have collected a few claims which I...
[ETA: Apparently this was misleading; I think it only applied to one company, Alienware, and it was because they didn't get certification, unlike the other companies.]
In my post about long AI timelines, I predicted that we would see attempts to regulate AI. An easy path for regulators is to target power-hungry GPUs and distributed computing in an attempt to minimize carbon emissions and electricity costs. It seems regulators may be going even faster than I believed in this case, with new bans on high performance personal computers now taking effect in six ...
Is it possible to simultaneously respect people's wishes to live, and others' wishes to die?
Transhumanists are fond of saying that they want to give everyone the choice of when and how they die. Giving people the choice to die is clearly preferable to our current situation, as it respects their autonomy, but it leads to the following moral dilemma.
Suppose someone loves essentially every moment of their life. For tens of thousands of years, they've never once wished that they did not exist. They've never had suicidal thoughts, and have a...
I generally agree with the heuristic that we should "live on the mainline", meaning that we should mostly plan for events which capture the dominant share of our probability. This heuristic causes me to have a tendency to do some of the following things
In discussions about consciousness I find myself repeating the same basic argument against the existence of qualia constantly. I don't do this just to be annoying: It is just my experience that
1. People find consciousness really hard to think about and has been known to cause a lot of disagreements.
2. Personally I think that this particular argument dissolved perhaps 50% of all my confusion about the topic, and was one of the simplest, clearest arguments that I've ever seen.
I am not being original either. The argument is the same one that has b...
[This is not a very charitable post, but that's why I'm putting it in shortform because it doesn't reply directly to any single person.]
I feel like recently there's been a bit of goalpost shifting with regards to emergent abilities in large language models. My understanding is that the original definition of emergent abilities made it clear that the central claim was that emergent abilities cannot be predicted ahead of time. From their abstract,
...We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thu
"Immortality is cool and all, but our universe is going to run down from entropy eventually"
I consider this argument wrong for two reasons. The first is the obvious reason, which is that even if immortality is impossible, it's still better to live for a long time.
The second reason why I think this argument is wrong is because I'm currently convinced that literal physical immortality is possible in our universe. Usually when I say this out loud I get an audible "what" or something to that effect, but I'm not kidding.
It&#x...
I now have a Twitter account that tweets my predictions.
I don't think I'm willing to bet on every prediction that I make. However, I pledge the following: if, after updating on the fact that you want to bet me, I still disagree with you, then I will bet. The disagreement must be non-trivial though.
For obvious reasons, I also won't bet on predictions that are old, and have already been replaced by newer predictions. I also may not be willing to bet on predictions that have unclear resolution criteria, or are about human extinction.
I have discovered recently that while I am generally tired and groggy in the morning, I am well rested and happy after a nap. I am unsure if this matches other people's experiences, and haven't explored much research. Still, I think this is interesting to think about fully.
What is the best way to apply this knowledge? I am considering purposely sabotaging my sleep so that I am tired enough to take a nap by noon, which would refresh me for the entire day. But this plan may have some significant drawbacks, including being excessively tired for a few hours in the morning.
I suspect that a lot of my disagreement with your views comes down to thinking that current systems provide almost no evidence about the difficulty of aligning systems that could pose existential risks, because (I claim) current systems in fact almost certainly don't have any kind of meaningful situational awareness, or stable(ish) preferences over future world states.
In this case, I don't know why you think that GPT-4 "understands our intentions", unless you mean something very different by that than what you'd mean if you said that about another human. It is true that GPT-4 will produce output that, if it came from a human, would be quite strong evidence that our intentions were understood (more or less), but the process which generates that output is extremely different from the one that'd generate it in a human and is probably missing most of the relevant properties that we care about when it comes to "understanding". Like, in general, if you ask GPT-4 to produce output that references its internal state, that output will not have any obvious relationship[1] to its internal state, since (as far as we know) it doesn't have the same kind of introspective access to its internal state that we do. (It might, of course, condition its outputs on previous tokens it output, and some humans do in fact rely on examining their previous externally-observable behavior to try to figure out what they were thinking at the time. But that's not the modality I'm talking about.)
It is also true that GPT-4 usually produces output that seems like it basically corresponds to our intentions, but that being true does not depend on it "understanding our intentions".
That is known to us right now; possibly one exists and could be derived.