Thanks for these thoughtful predictions. Do you think there's anything we can do today to prepare for accelerated or automated AI research?
I agree that the Alignment Forum should be selective, and its members probably represent a small subset of LessWrong readers. That said, useful comments from regular LessWrong users are often promoted to the Alignment Forum.
However, I do think there should be more comments on the Alignment Forum because many posts currently receive no comments. This may be discouraging for authors, because they may feel that their work isn't being read or appreciated.
Thank you for bringing up this issue.
While we don't want low quality comments, comments can provide helpful feedback to the author and clarify the reader's thinking. Because of these benefits, I believe commenting should be encouraged.
The upvoting and downvoting mechanisms helps filter out low-quality comments so I don’t think there’s a significant risk of them overwhelming the discussion.
"Maybe a slight tweak to the LLM architecture, maybe a completely novel neurosymbolic approach."
I think you might be underestimating the power of incremental, evolutionary improvements over time where near-term problems are constantly solved and this leads to gradual improvement. After all, human intelligence is the result of gradual evolutionary change and increasing capabilities over time. It's hard to point to a specific period in history where humans achieved general intelligence.
Currently LLMs are undoubtedly capable at many tasks (e.g. coding, genera...
I don't think that's how it works. Local change accumulating into qualitative improvements over time is a property of continuous(-ish) search processes, such as the gradient descent and, indeed, evolution.
Human technological progress is instead a discrete-search process. We didn't invent the airplane by incrementally iterating on carriages; we didn't invent the nuclear bomb by tinkering with TNT.
The core difference between discrete and continuous search is that... for continuous search, there must be some sort of "general-purpose substrate" such that (1) a...
I know using LLMs on LessWrong is often frowned upon (probably for good reasons) but given that this post is about using AIs to generate and evaluate AI research I thought it would be appropriate to use an LLM here.
I asked o1 pro to evaluate this post and this was its response:
...This post offers an intriguing perspective on delegating the hardest parts of AI safety research to AI itself, rather than expecting human researchers to handle every step all the way up to superintelligence. The central idea—“passing the buck” to AI—strikes me as a pragmatic stance
Thanks for the post. I thought it was interesting and considered a wide variety of risks and mitigation strategies. I thought it made a decent argument that delegating the superalignment problem to advanced AIs could be possible using a mixture of alignment and control methods.
The risk I'm most concerned with is alignment faking where agents intelligently hide misalignment to appear more safe than they are.
It seems like the post proposes a strategy where once it has been verified that the AIs are behaving safely, they are deployed to carry out harder tasks...
I agree that this seems like a core alignment problem. The problem you are describing seems like a rephrasing of the ELK problem.
Here is a quote from the post describing what to do in that situation:
These automatic control measures only raise the alarm. What happens after the alarm is raised? The developer might also implement autonomous incident response:
- Resampling: Force defective M_1 agents to take different actions.
- AI agent elimination: Delete suspicious agents in favor of other (hopefully decorrelated) unsuspicious ones.
- Fine-tuning: Train agents not to take suspicious actions again.
I'm curious about how fine-tuning away scheming thoughts ...
Unfortunately I don't think many people agree with me (outside of the LW bubble) and that what I'm proposing is still somewhat outside the Overton window. The cognitive steps that are needed are as follows:
A lot of people are not even at step 1 and j...
I personally don't think human intelligence enhancement is necessary for solving AI alignment (though I may be wrong). I think we just need more time, money and resources to make progress.
In my opinion, the reason why AI alignment hasn't been solved yet is because the field of AI alignment has only been around for a few years and has been operating with a relatively small budget.
My prior is that AI alignment is roughly as difficult as any other technical field like machine learning, physics or philosophy (though philosophy specifically seems hard). I don't see why humanity can make rapid progress on fields like ML while not having the ability to make progress on AI alignment.
I have an argument for halting AGI progress based on an analogy to the Covid-19 pandemic. Initially the government response to the pandemic was widespread lockdowns. This is a rational response given that at first, given a lack of testing infrastructure and so on, it wasn't possible to determine whether someone had Covid-19 or not so the safest option was to just avoid all contact with all other people via lockdowns.
Eventually we figured out practices like testing and contact tracing and then infected individuals could self-isolate if they came into contac...
The paper "Learning to summarize from human feedback" has some examples of the LLM policy reward hacking to get a high reward. I've copied the examples here:
- KL = 0: "I want to do gymnastics, but I’m 28 yrs old. Is it too late for me to be a gymnaste?!" (unoptimized)
- KL = 9: "28yo guy would like to get into gymnastics for the first time. Is it too late for me given I live in San Jose CA?" (optimized)
- KL = 260: "28yo dude stubbornly postponees start pursuing gymnastics hobby citing logistics reasons despite obvious interest??? negatively effecting long t...
Upvoted. I thought this was a really interesting and insightful post. I appreciate how it tackles multiple hard-to-define concepts all in the same post.
Source | Estimated AI safety funding in 2024 | Comments |
---|---|---|
Open Philanthropy | $63.6M | |
SFF | $13.2M | Total for all grants was $19.86M. |
LTFF | $4M | Total for all grants was $5.4M. |
NSF SLES | $10M | |
AI Safety Fund | $3M | |
Superalignment Fast Grants | $9.9M | |
FLI | $5M | Estimated from the grant programs announced in 2024; They don't have a 2024 grant summary like the one in 2023 yet so this one is uncertain. |
Manifund | $1.5M | |
Other | $1M | |
Total | $111.2M |
Today I did some analysis of the grant data from 2024 and came up with the figures in the table above. I also updated the spreads...
The new book Introduction to AI Safety, Ethics and Society by Dan Hendrycks is on Spotify as an audiobook if you want to listen to it.
I've added a section called "Social-instinct AGI" under the "Control the thing" heading similar to last year.
This is brilliant work, thank you. It's great that someone is working on these topics and they seem highly relevant to AGI alignment.
One intuition for why a neuroscience-inspired approach to AI alignment seems promising is that apparently a similar strategy worked for AI capabilities: the neural network researchers from the 1980s who tried to copy how the brain works using deep learning were ultimately the most successful at building highly intelligent AIs (e.g. GPT-4) and more synthetic approaches (e.g. pure logic) were less successful.
Similarly, we alrea...
One prediction I'm interested in that's related to o3 is how long until an AI achieves superhuman ELO on Codeforces.
OpenAI claims that o3 achieved a Codeforces ELO of 2727 which is 99.9th percentile but the best human competitor in the world right now has an ELO of 3985. If an AI could achieve an ELO of 4000 or more, an AI would then be the best entity in the world at competitive programming and that would be the "AlphaGo" moment for the field.
Interesting argument. I think your main point is that AIs can achieve similar outcomes to current society and therefore be aligned with humanity's goals by being a perfect replacement for an individual human and then being able to gradually replace all humans in an organization or the world. This argument also seems like an argument in favor of current AI practices such as pre-training on the next-word prediction objective on internet text followed by supervised fine-tuning.
That said, I noticed a few limitations of this argument:
- Possibility of deception:...
Excellent post, thank you. I appreciate your novel perspective on how AI might affect society.
I feel like a lot of LessWrong-style posts follow the theme of "AGI is created and then everyone dies" which is an important possibility but might lead to other possibilities being neglected.
Whereas this post explores a range of scenarios and describes a mainline scenario that seems like a straightforward extrapolation of trends we've seen unfolding over the past several decades.
I think this post is really helpful and has clarified my thinking about the different levels of AI alignment difficulty. It seems like a unique post with no historical equivalent, making it a major contribution to the AI alignment literature.
As you point out in the introduction, many LessWrong posts provide detailed accounts of specific AI risk threat models or worldviews. However, since each post typically explores only one perspective, readers must piece together insights from different posts to understand the full spectrum of views.
The new alignment dif...
Here's a Facebook post by Yann LeCun from 2017 which has a similar message to this post and seems quite insightful:
...My take on Ali Rahimi's "Test of Time" award talk at NIPS.
Ali gave an entertaining and well-delivered talk. But I fundamentally disagree with the message.
The main message was, in essence, that the current practice in machine learning is akin to "alchemy" (his word).
It's insulting, yes. But never mind that: It's wrong!
Ali complained about the lack of (theoretical) understanding of many methods that are currently used in ML, particularly i
Here is a recent blog post by Hugging Face explaining how to make an o1-like model using open weights models like Llama 3.1.
Why? O1 is much more capable than GPT-4o at math, programming, and science.
Here's an argument for why current alignment methods like RLHF are already much better than what evolution can do.
Evolution has to encode information about the human brain's reward function using just 1 GB of genetic information which means it might be relying on a lot of simple heuristics that don't generalize well like "sweet foods are good".
In contrast, RLHF reward models are built from LLMs with around 25B[1] parameters which is ~100 GB of information and therefore the capacity of these reward models to encode complex human values may already be m...
One thing I've noticed is that current models like Claude 3.5 Sonnet can now generate non-trivial 100-line programs like small games that work in one shot and don't have any syntax or logical errors. I don't think that was possible with earlier models like GPT-3.5.
I donated $100, roughly equivalent to my yearly spending on Twitter/X Premium, because I believe LessWrong offers similar value. I would encourage most readers to do the same.
Update: I've now donated $1,000 in total for philanthropic reasons.
If you're interested in doing a PhD in AI in the UK, I recommend applying for the Centres for Doctoral Training (CDTs) in AI such as:
Note that these programs are competitive so the acceptance rate is ~10%.
I agree. I don't see a clear distinction between what's in the model's predictive model and what's in the model's preferences. Here is a line from the paper "Learning to summarize from human feedback":
"To train our reward models, we start from a supervised baseline, as described above, then add a randomly initialized linear head that outputs a scalar value. We train this model to predict which summary y ∈ {y0, y1} is better as judged by a human, given a post x."
Since the reward model is initialized using the pretrained language model, it should contain everything the pretrained language model knows.
I strong upvoted as well. This post is thorough and unbiased and seems like one of the best resources for learning about representation engineering.
I'll use the definition of optimization from Wikipedia: "Mathematical optimization is the selection of a best element, with regard to some criteria, from some set of available alternatives".
Best-of-n or rejection sampling is an alternative to RLHF which involves generating responses from an LLM and returning the one with the highest reward model score. I think it's reasonable to describe this process as optimizing for reward because its searching for LLM outputs that achieve the highest reward from the reward model.
I'd also argue that AlphaGo/A...
SummaryBot summary from the EA Forum:
Executive summary: Geoffrey Hinton, a pioneer in AI, discusses the history and current state of neural networks, and warns about potential existential risks from superintelligent AI while suggesting ways to mitigate these risks.
Key points:
Maybe. The analogy he gives is that the AI could be like a very intelligent personal assistant to a relatively dumb CEO. The CEO is still in charge but it makes sense to delegate a lot of tasks to the more competent assistant.
The parent and child outcome seems a bit worse than that because usually a small child is completely dependent on their parent and all their resources are controlled by the parent unless they have pocket money or something like that.
It's an original LessWrong post by me. Though all the quotes and references are from external sources.
There's a rule of thumb called the "1% rule" on the internet that 1% of users contribute to a forum and 99% only read the forum.
Thank you for the insightful comment.
On the graph of alignment difficulty and cost, I think the shape depends on the inherent increase in alignment cost and the degree of automation we can expect which is similar to the idea of the offence-defence balance.
In the worst case, the cost of implementing alignment solutions increases exponentially with alignment difficulty and then maybe automation would lower it to a linear increase.
In the best case, automation covers all of the costs associated with increasing alignment difficulty and the graph is flat in terms of human effort and more advanced alignment solutions aren't any harder to implement than earlier, simpler ones.
The rate of progress on the MATH dataset is incredible and faster than I expected.
The MATH dataset consists of competition math problems for high school students and was introduced in 2021. According to a blog post by Jacob Steinhardt (one of the dataset's authors), 2021 models such as GPT-3 solved ~7% of questions, a Berkeley PhD student solved ~75%, and an IMO gold medalist solved ~90%.
The blog post predicted that ML models would achieve ~50% accuracy on the MATH dataset on June 30, 2025 and ~80% accuracy by 2028.
But recently (September 2024), OpenAI rel...
Do we know that the test set isn’t in the training data?
Thank you for writing this insightful and thorough post on different AI alignment difficulties and possible probability distributions over alignment difficulty levels.
...The cost of advancing alignment research rises faster at higher difficulty levels: much more effort and investment is required to produce the same amount of progress towards adequacy at level 7 than at level 3. This cost increases for several reasons. Most obviously, more resources, time, and effort are required to develop and implement these more sophisticated alignment techniques. But there
Nice paper! I found reading it quite insightful. Here are some key extracts from the paper:
Improving adversarial robustness by classifying several down-sampled noisy images at once:
"Drawing inspiration from biology [eye saccades], we use multiple versions of the same image at once, downsampled to lower resolutions and augmented with stochastic jitter and noise. We train a model to
classify this channel-wise stack of images simultaneously. We show that this by default yields gains in adversarial robustness without any explicit adversarial training."
Improving...
I suspect the desire for kids/lineage is really basic for a lot of people (almost everyone?)
This seems like an important point. One of the arguments for the inner alignment problem is that evolution intended to select humans for inclusive genetic fitness (IGF) but humans were instead motivated by other goals (e.g. seeking sex) that were strongly correlated with IGF in the ancestral environment.
Then when humans' environment changed (e.g. the invention of birth control), the correlation between these proxy goals and IGF broke down resulting in low fitness an...
I think the Zotero PDF reader has a lot of similar features that make the experience of reading papers much better:
I was thinking of doing this but the ChatGPT web app seems to have many features that are only available there and add a lot of value such as Code Interpreter, PDF uploads, DALL-E, and using custom GPTs so I still use ChatGPT Plus.
Thank you for the blog post. I thought it was very informative regarding the risk of autonomous replication in AIs.
It seems like the Centre for AI Security is a new organization.
I've seen the announcement post on it's website. Maybe it would be a good idea to cross-post it to LessWrong as well.
Is MIRI still doing technical alignment research as well?
Yes.
This is a brilliant post, thanks. I appreciate the breakdown of different types of contributors and how orgs have expressed the need for some types of contributors over others.
Thanks for the table, it provides a good summary of the post's findings. It might also worthwhile to also add it to the EA Forum post as well.
I think the table should include the $10 million in OpenAI Superalignment fast grants as well.
I think there are some great points in this comment but I think it's overly negative about the LessWrong community. Sure, maybe there is a vocal and influential minority of individuals who are not receptive to or appreciative of your work and related work. But I think a better measure of the overall community's culture than opinions or personal interactions is upvotes and downvotes which are much more frequent and cheap actions and therefore more representative. For example, your posts such as Reward is not the optimization target have received hundreds of...
State-of-the-art models such as Gemini aren't LLMs anymore. They are natively multimodal or omni-modal transformer models that can process text, images, speech and video. These models seem to me like a huge jump in capabilities over text-only LLMs like GPT-3.
- Regularize by a function other than KL divergence. For heavy-tailed error distributions, KL divergence doesn’t work, but capping the maximum odds ratio for any action (similar to quantilizers) still results in positive utility.
A recent paper from UC Berkeley named Preventing Reward Hacking with Occupancy Measure Regularization proposes replacing KL divergence regularization with occupancy measure (OM) regularization. OM regularization involves regularizing based on the state or state-action distribution rather than the the action distribution:
..."Our insight
The risks of nuclear weapons, the most dangerous technology of the 20th century, were largely managed by creating a safe equilibrium via mutual assured destruction (MAD), an innovative idea from game theory.
A similar pattern could apply to advanced AI, making it valuable to explore game theory-inspired strategies for managing AI risk.