All of Stephen McAleese's Comments + Replies

The risks of nuclear weapons, the most dangerous technology of the 20th century, were largely managed by creating a safe equilibrium via mutual assured destruction (MAD), an innovative idea from game theory.

A similar pattern could apply to advanced AI, making it valuable to explore game theory-inspired strategies for managing AI risk.

Thanks for these thoughtful predictions. Do you think there's anything we can do today to prepare for accelerated or automated AI research?

I agree that the Alignment Forum should be selective, and its members probably represent a small subset of LessWrong readers. That said, useful comments from regular LessWrong users are often promoted to the Alignment Forum.

However, I do think there should be more comments on the Alignment Forum because many posts currently receive no comments. This may be discouraging for authors, because they may feel that their work isn't being read or appreciated.

Thank you for bringing up this issue.

While we don't want low quality comments, comments can provide helpful feedback to the author and clarify the reader's thinking. Because of these benefits, I believe commenting should be encouraged.

The upvoting and downvoting mechanisms helps filter out low-quality comments so I don’t think there’s a significant risk of them overwhelming the discussion.

"Maybe a slight tweak to the LLM architecture, maybe a completely novel neurosymbolic approach."

I think you might be underestimating the power of incremental, evolutionary improvements over time where near-term problems are constantly solved and this leads to gradual improvement. After all, human intelligence is the result of gradual evolutionary change and increasing capabilities over time. It's hard to point to a specific period in history where humans achieved general intelligence.

Currently LLMs are undoubtedly capable at many tasks (e.g. coding, genera... (read more)

1Raphael Roche
Exactly. Future is hard to predict and the author's strong confidence seems suspicious to me. Improvements came fast last years.  2013-2014 : word2vec and seq2seq  2017 : transformer and gpt-1  2022 : CoT prompting  2023 multimodal LLMs 2024 reasonning models. Are they linear improvements or revolutionnary breakthroughs ? Time will tell, but to me there is no sharp frontier between increment and breakthrough. It might happen that AGI results from such improvements, or not. We just don't know. But it's a fact that human general intelligence resulted from a long chain of tiny increments, and I also observe that results in ARC-AGI bench exploded with CoT/reasoning models (not just math or coding benchs). So, while 2025 could be a relative plateau, I won't be so sure that next years will also. To me a confidence far from 50% is hard to justify.
2Thane Ruthenis
I actually looked into that recently. My initial guess was this was about "the context window" as a concept. It allows to keep vast volumes of task-relevant information around, including the outputs of the model's own past computations, without lossily compressing that information into a small representation (like with RNNs). I asked OpenAI's DR about it, and its output seems to support that guess. In retrospect, it makes sense that this would work better. If you don't know what challenges you're going to face in the future, you don't necessarily know what past information to keep around, so a fixed-size internal state was a bad idea.

I don't think that's how it works. Local change accumulating into qualitative improvements over time is a property of continuous(-ish) search processes, such as the gradient descent and, indeed, evolution.

Human technological progress is instead a discrete-search process. We didn't invent the airplane by incrementally iterating on carriages; we didn't invent the nuclear bomb by tinkering with TNT.

The core difference between discrete and continuous search is that... for continuous search, there must be some sort of "general-purpose substrate" such that (1) a... (read more)

I know using LLMs on LessWrong is often frowned upon (probably for good reasons) but given that this post is about using AIs to generate and evaluate AI research I thought it would be appropriate to use an LLM here.

I asked o1 pro to evaluate this post and this was its response:

This post offers an intriguing perspective on delegating the hardest parts of AI safety research to AI itself, rather than expecting human researchers to handle every step all the way up to superintelligence. The central idea—“passing the buck” to AI—strikes me as a pragmatic stance

... (read more)

Thanks for the post. I thought it was interesting and considered a wide variety of risks and mitigation strategies. I thought it made a decent argument that delegating the superalignment problem to advanced AIs could be possible using a mixture of alignment and control methods.

The risk I'm most concerned with is alignment faking where agents intelligently hide misalignment to appear more safe than they are.

It seems like the post proposes a strategy where once it has been verified that the AIs are behaving safely, they are deployed to carry out harder tasks... (read more)

1Oliver Daniels
I think these are related but separate problems - even with a perfect verifier (on easy domains), scheming could still arise. Though imperfect verifiers increase P(scheming), better verifiers increase the domain of "easy" tasks, etc.
0Stephen McAleese
I know using LLMs on LessWrong is often frowned upon (probably for good reasons) but given that this post is about using AIs to generate and evaluate AI research I thought it would be appropriate to use an LLM here. I asked o1 pro to evaluate this post and this was its response: I thought its response was pretty helpful and I would rate it as 7/10.

I agree that this seems like a core alignment problem. The problem you are describing seems like a rephrasing of the ELK problem.

Here is a quote from the post describing what to do in that situation:

These automatic control measures only raise the alarm. What happens after the alarm is raised? The developer might also implement autonomous incident response:

  • Resampling: Force defective M_1 agents to take different actions.
  • AI agent elimination: Delete suspicious agents in favor of other (hopefully decorrelated) unsuspicious ones.
  • Fine-tuning: Train agents not to take suspicious actions again.

I'm curious about how fine-tuning away scheming thoughts ... (read more)

Unfortunately I don't think many people agree with me (outside of the LW bubble) and that what I'm proposing is still somewhat outside the Overton window. The cognitive steps that are needed are as follows:

  1. Being aware of AGI as a concept and a real possibility in the near future.
  2. Believing that AGI poses a significant existential risk.
  3. Knowing about pausing AI progress as a potential solution to AGI risk and seeing it as a promising solution.
  4. Having a detailed plan to implement the proposed pause in practice.

A lot of people are not even at step 1 and j... (read more)

4Nathan Helm-Burger
Several years? I don't think we have that long. I'm thinking mid to late 2026 for when we hit AGI. I think 1,2,3 can change very quickly indeed, like with the covid lockdowns. People went from 'doubt' to 'doing' in a short amount of time, once evidence was overwhelmingly clear. So having 4 in place at the time that occurs seems key. Also, trying to have plans in place for adequately convincing demos which may convince people before disaster strikes seems highly useful.

I personally don't think human intelligence enhancement is necessary for solving AI alignment (though I may be wrong). I think we just need more time, money and resources to make progress.

In my opinion, the reason why AI alignment hasn't been solved yet is because the field of AI alignment has only been around for a few years and has been operating with a relatively small budget.

My prior is that AI alignment is roughly as difficult as any other technical field like machine learning, physics or philosophy (though philosophy specifically seems hard). I don't see why humanity can make rapid progress on fields like ML while not having the ability to make progress on AI alignment.

1RussellThor
ok I see how you could think that, but I disagree that time and more resources would help alignment much if at all, esp before GPT4.0. See here https://www.lesswrong.com/posts/7zxnqk9C7mHCx2Bv8/beliefs-and-state-of-mind-into-2025 Diminishing returns kick in, and actual data from ever more advanced AI is essential to stay on the right track and eliminate incorrect assumptions. I also disagree that alignment could be "solved" before ASI is invented - we would just think we had it solved but could be wrong. If its just as hard as physics, then we would have untested theories, that are probably wrong, e.g. like SUSY would be help solve various issues and be found by the LHC which didn't happen.
1whestler
The reason normally given is that AI capability is much easier to test and optimise than AI safety. Much like philosophy, it's very unclear when you are making progress, and sometimes unclear if progress is even possible. It doesn't help that AI alignment isn't particularly profitable in the short term. 

I have an argument for halting AGI progress based on an analogy to the Covid-19 pandemic. Initially the government response to the pandemic was widespread lockdowns. This is a rational response given that at first, given a lack of testing infrastructure and so on, it wasn't possible to determine whether someone had Covid-19 or not so the safest option was to just avoid all contact with all other people via lockdowns.

Eventually we figured out practices like testing and contact tracing and then infected individuals could self-isolate if they came into contac... (read more)

4Nathan Helm-Burger
I think many people agree with you here. Particularly, I like Max Tegmark's post Entente Delusion But the big question is "How?" What are the costs of your proposed mechanism of global pause? I think there are better answers to how to implement a pause through designing better governance methods.

The paper "Learning to summarize from human feedback" has some examples of the LLM policy reward hacking to get a high reward. I've copied the examples here:

- KL = 0: "I want to do gymnastics, but I’m 28 yrs old. Is it too late for me to be a gymnaste?!" (unoptimized)
- KL = 9: "28yo guy would like to get into gymnastics for the first time. Is it too late for me given I live in San Jose CA?" (optimized)
- KL = 260: "28yo dude stubbornly postponees start pursuing gymnastics hobby citing logistics reasons despite obvious interest??? negatively effecting long t... (read more)

Upvoted. I thought this was a really interesting and insightful post. I appreciate how it tackles multiple hard-to-define concepts all in the same post.

SourceEstimated AI safety funding in 2024Comments
Open Philanthropy$63.6M 
SFF$13.2MTotal for all grants was $19.86M.
LTFF$4MTotal for all grants was $5.4M.
NSF SLES$10M 
AI Safety Fund$3M 
Superalignment Fast Grants$9.9M 
FLI$5MEstimated from the grant programs announced in 2024; They don't have a 2024 grant summary like the one in 2023 yet so this one is uncertain.
Manifund$1.5M 
Other$1M 
Total$111.2M 

Today I did some analysis of the grant data from 2024 and came up with the figures in the table above. I also updated the spreads... (read more)

The new book Introduction to AI Safety, Ethics and Society by Dan Hendrycks is on Spotify as an audiobook if you want to listen to it.

I've added a section called "Social-instinct AGI" under the "Control the thing" heading similar to last year.

This is brilliant work, thank you. It's great that someone is working on these topics and they seem highly relevant to AGI alignment.

One intuition for why a neuroscience-inspired approach to AI alignment seems promising is that apparently a similar strategy worked for AI capabilities: the neural network researchers from the 1980s who tried to copy how the brain works using deep learning were ultimately the most successful at building highly intelligent AIs (e.g. GPT-4) and more synthetic approaches (e.g. pure logic) were less successful.

Similarly, we alrea... (read more)

One prediction I'm interested in that's related to o3 is how long until an AI achieves superhuman ELO on Codeforces.

OpenAI claims that o3 achieved a Codeforces ELO of 2727 which is 99.9th percentile but the best human competitor in the world right now has an ELO of 3985. If an AI could achieve an ELO of 4000 or more, an AI would then be the best entity in the world at competitive programming and that would be the "AlphaGo" moment for the field.

Prediction

Interesting argument. I think your main point is that AIs can achieve similar outcomes to current society and therefore be aligned with humanity's goals by being a perfect replacement for an individual human and then being able to gradually replace all humans in an organization or the world. This argument also seems like an argument in favor of current AI practices such as pre-training on the next-word prediction objective on internet text followed by supervised fine-tuning.

That said, I noticed a few limitations of this argument:
- Possibility of deception:... (read more)

2Roko
It could, but some humans might also do that. Indeed, humans do that kind of thing all the time. But they wouldn't 'become' superintelligent because there would be no extra training once the AI had finished training. And OOD inputs won't produce different outputs if the underlying function is the same. Given a complexity prior and enough data, ML algos will converge on the same function as the human brain uses. The behavior will follow the same probability distribution since the distribution of outputs for a given AI is the same as for the human it is a functional copy of. Think of a thousand piles of sand from the same well-mixed batch - each of them is slightly different, but any one pile falls within the distribution.

Excellent post, thank you. I appreciate your novel perspective on how AI might affect society.

I feel like a lot of LessWrong-style posts follow the theme of "AGI is created and then everyone dies" which is an important possibility but might lead to other possibilities being neglected.

Whereas this post explores a range of scenarios and describes a mainline scenario that seems like a straightforward extrapolation of trends we've seen unfolding over the past several decades.

I think this post is really helpful and has clarified my thinking about the different levels of AI alignment difficulty. It seems like a unique post with no historical equivalent, making it a major contribution to the AI alignment literature.

As you point out in the introduction, many LessWrong posts provide detailed accounts of specific AI risk threat models or worldviews. However, since each post typically explores only one perspective, readers must piece together insights from different posts to understand the full spectrum of views.

The new alignment dif... (read more)

Here's a Facebook post by Yann LeCun from 2017 which has a similar message to this post and seems quite insightful:

My take on Ali Rahimi's "Test of Time" award talk at NIPS. 

Ali gave an entertaining and well-delivered talk. But I fundamentally disagree with the message.

The main message was, in essence, that the current practice in machine learning is akin to "alchemy" (his word).

It's insulting, yes. But never mind that: It's wrong!

Ali complained about the lack of (theoretical) understanding of many methods that are currently used in ML, particularly i

... (read more)
4Linda Linsefors
I think Singular Learning Theory was developed independently of deep learning, and is not specifically about deep learning. It's about any learning system, under some assumptions, which are more general than the assumptions for normal Learning Theory. This is why you can use SLT but not normal Learning Theory when analysing NNs. NNs break the assumptions for normal Learning Theory but not for SLT.

Here is a recent blog post by Hugging Face explaining how to make an o1-like model using open weights models like Llama 3.1.

Why? O1 is much more capable than GPT-4o at math, programming, and science.

4O O
It’s better at questions but subjectively there doesn’t feel like there’s much transfer. It still gets some basic questions wrong.
2niplav
Not OP but it could be that o1 underperformed their expectation.

Here's an argument for why current alignment methods like RLHF are already much better than what evolution can do.

Evolution has to encode information about the human brain's reward function using just 1 GB of genetic information which means it might be relying on a lot of simple heuristics that don't generalize well like "sweet foods are good".

In contrast, RLHF reward models are built from LLMs with around 25B[1] parameters which is ~100 GB of information and therefore the capacity of these reward models to encode complex human values may already be m... (read more)

One thing I've noticed is that current models like Claude 3.5 Sonnet can now generate non-trivial 100-line programs like small games that work in one shot and don't have any syntax or logical errors. I don't think that was possible with earlier models like GPT-3.5.

6David Lorell
My impression is that they are getting consistently better at coding tasks of a kind that would show up in the curriculum of an undergrad CS class, but much more slowly improving at nonstandard or technical tasks. 

I donated $100, roughly equivalent to my yearly spending on Twitter/X Premium, because I believe LessWrong offers similar value. I would encourage most readers to do the same.

Update: I've now donated $1,000 in total for philanthropic reasons.

2cdt
It is worth noting that UKRI is in the process of changing their language to Doctoral Landscape Awards (replacing DTP) and Doctoral Focal Awards (CDT). The announcements for BBSRC and NERC have already been done, but I can't find what EPSRC is doing.

I agree. I don't see a clear distinction between what's in the model's predictive model and what's in the model's preferences. Here is a line from the paper "Learning to summarize from human feedback":

"To train our reward models, we start from a supervised baseline, as described above, then add a randomly initialized linear head that outputs a scalar value. We train this model to predict which summary y ∈ {y0, y1} is better as judged by a human, given a post x."

Since the reward model is initialized using the pretrained language model, it should contain everything the pretrained language model knows.

I strong upvoted as well. This post is thorough and unbiased and seems like one of the best resources for learning about representation engineering.

Answer by Stephen McAleese172

I'll use the definition of optimization from Wikipedia: "Mathematical optimization is the selection of a best element, with regard to some criteria, from some set of available alternatives".

Best-of-n or rejection sampling is an alternative to RLHF which involves generating  responses from an LLM and returning the one with the highest reward model score. I think it's reasonable to describe this process as optimizing for reward because its searching for LLM outputs that achieve the highest reward from the reward model.

I'd also argue that AlphaGo/A... (read more)

SummaryBot summary from the EA Forum:

Executive summary: Geoffrey Hinton, a pioneer in AI, discusses the history and current state of neural networks, and warns about potential existential risks from superintelligent AI while suggesting ways to mitigate these risks.

Key points:

  1. Neural networks, initially unpopular, became dominant in AI due to increased computational power and data availability.
  2. Hinton argues that large language models (LLMs) truly understand language, similar to how the human brain processes information.
  3. Digital neural networks have advantages
... (read more)

Maybe. The analogy he gives is that the AI could be like a very intelligent personal assistant to a relatively dumb CEO. The CEO is still in charge but it makes sense to delegate a lot of tasks to the more competent assistant.

The parent and child outcome seems a bit worse than that because usually a small child is completely dependent on their parent and all their resources are controlled by the parent unless they have pocket money or something like that.

It's an original LessWrong post by me. Though all the quotes and references are from external sources.

There's a rule of thumb called the "1% rule" on the internet that 1% of users contribute to a forum and 99% only read the forum.

3gilch
The mods probably have access to better analytics. I, for one, was a long-time lurker before I said anything.

Thank you for the insightful comment.

On the graph of alignment difficulty and cost, I think the shape depends on the inherent increase in alignment cost and the degree of automation we can expect which is similar to the idea of the offence-defence balance.

In the worst case, the cost of implementing alignment solutions increases exponentially with alignment difficulty and then maybe automation would lower it to a linear increase.

In the best case, automation covers all of the costs associated with increasing alignment difficulty and the graph is flat in terms of human effort and more advanced alignment solutions aren't any harder to implement than earlier, simpler ones.

The rate of progress on the MATH dataset is incredible and faster than I expected.

The MATH dataset consists of competition math problems for high school students and was introduced in 2021. According to a blog post by Jacob Steinhardt (one of the dataset's authors), 2021 models such as GPT-3 solved ~7% of questions, a Berkeley PhD student solved ~75%, and an IMO gold medalist solved ~90%.

The blog post predicted that ML models would achieve ~50% accuracy on the MATH dataset on June 30, 2025 and ~80% accuracy by 2028.

But recently (September 2024), OpenAI rel... (read more)

8Qumeric
I would like to note that this dataset is not as hard as it might look like. Humans performed not so well because there is a strict time limit, I don't remember exactly but it was something like 1 hour for 25 tasks (and IIRC the medalist only made arithmetic errors). I am pretty sure any IMO gold medailst would typically score 100% given (say) 3 hours. Nevertheless, it's very impressive, and AIMO results are even more impressive in my opinion.
2Nathan Helm-Burger
The rate of progress is surprising even to experts pushing the frontier... Another example: https://x.com/polynoamial/status/998902692050362375
4Shankar Sivarajan
A nice test might be the 2024 IMO (from July). I'm curious to see if it's reached gold medal performance on that. The IMO Grand Challenge might be harder; I don't know how Lean works, but it's probably harder to write than human-readable LaTeX. 
6Amalthea
In 2021, I predicted math to be basically solved by 2023 (using the kind of reinforcement learning on formally checkable proofs that deepmind is using). It's been slower than expected and I wouldn't have guessed some less formal setting like o1 to go relatively well - but since then I just nod along to these kinds of results. (Not sure what to think of that claimed 95% number though - wouldn't that kind of imply they'd blown past the IMO grand challenge? EDIT: There were significant time limits on the human participants, see Qumeric's comment.)

Do we know that the test set isn’t in the training data?

Thank you for writing this insightful and thorough post on different AI alignment difficulties and possible probability distributions over alignment difficulty levels.

The cost of advancing alignment research rises faster at higher difficulty levels: much more effort and investment is required to produce the same amount of progress towards adequacy at level 7 than at level 3. This cost increases for several reasons. Most obviously, more resources, time, and effort are required to develop and implement these more sophisticated alignment techniques. But there

... (read more)
3Sammy Martin
I touched upon this idea indirectly in the original post when discussing alignment-related High Impact Tasks (HITs), but I didn't explicitly connect it to the potential for reducing implementation costs and you're right to point that out. Let me clarify how the framework handles this aspect and elaborate on its implications. Key points: 1. Alignment-related HITs, such as automating oversight or interpretability research, introduce challenges and make the HITs more complicated. We need to ask, what's the difficulty of aligning a system capable of automating the alignment of systems capable of achieving HITs! 2. The HIT framing is flexible enough to accommodate the use of AI for accelerating alignment research, not just for directly reducing existential risk. If full alignment automation of systems capable of performing (non alignment related) HITs is construed as an HIT, the actual alignment difficulty corresponds to the level required to align the AI system performing the automation, not the automated task itself. 3. In practice, a combination of AI systems at various alignment difficulty levels will likely be employed to reduce costs and risks for both alignment-related tasks and other applications. Partial automation and acceleration by AI systems can significantly impact the cost curve for implementing advanced alignment techniques, even if full automation is not possible. 4. The cost curve presented in the original post assumes no AI assistance, but in reality, AI involvement in alignment research could substantially alter its shape. This is because the cost curve covers the cost of performing research "to achieve the given HITs", and since substantially automating alignment research is a possible HIT, by definition the cost graph is not supposed to include substantial assistance on alignment research. However, that makes it unrealistic in practice, especially because (as indicated by the haziness on the graph), there will be many HITs, both accelerating

Nice paper! I found reading it quite insightful. Here are some key extracts from the paper:

Improving adversarial robustness by classifying several down-sampled noisy images at once:

"Drawing inspiration from biology [eye saccades], we use multiple versions of the same image at once, downsampled to lower resolutions and augmented with stochastic jitter and noise. We train a model to
classify this channel-wise stack of images simultaneously. We show that this by default yields gains in adversarial robustness without any explicit adversarial training."

Improving... (read more)

I suspect the desire for kids/lineage is really basic for a lot of people (almost everyone?)

This seems like an important point. One of the arguments for the inner alignment problem is that evolution intended to select humans for inclusive genetic fitness (IGF) but humans were instead motivated by other goals (e.g. seeking sex) that were strongly correlated with IGF in the ancestral environment.

Then when humans' environment changed (e.g. the invention of birth control), the correlation between these proxy goals and IGF broke down resulting in low fitness an... (read more)

6leogao
Wanting to raise kids/have what would normally be considered a lineage is importantly different from IGF; most people would not consider sperm bank donation to satisfy their child-having goals very well despite this being very good for IGF.

I think the Zotero PDF reader has a lot of similar features that make the experience of reading papers much better:

  • It has a back button so that when you click on a reference link that takes you to the references section, you can easily click the button to go back to the text.
  • There is a highlight feature so that you can highlight parts of the text which is convenient when you want to come back and skim the paper later.
  • There is a "sticky note" feature allowing you to leave a note in part of the paper to explain something.

I was thinking of doing this but the ChatGPT web app seems to have many features that are only available there and add a lot of value such as Code Interpreter, PDF uploads, DALL-E, and using custom GPTs so I still use ChatGPT Plus.

Thank you for the blog post. I thought it was very informative regarding the risk of autonomous replication in AIs.

It seems like the Centre for AI Security is a new organization.

I've seen the announcement post on it's website. Maybe it would be a good idea to cross-post it to LessWrong as well.

Is MIRI still doing technical alignment research as well?

Yes.

This is a brilliant post, thanks. I appreciate the breakdown of different types of contributors and how orgs have expressed the need for some types of contributors over others.

Thanks for the table, it provides a good summary of the post's findings. It might also worthwhile to also add it to the EA Forum post as well.

I think the table should include the $10 million in OpenAI Superalignment fast grants as well.

I think there are some great points in this comment but I think it's overly negative about the LessWrong community. Sure, maybe there is a vocal and influential minority of individuals who are not receptive to or appreciative of your work and related work. But I think a better measure of the overall community's culture than opinions or personal interactions is upvotes and downvotes which are much more frequent and cheap actions and therefore more representative. For example, your posts such as Reward is not the optimization target have received hundreds of... (read more)

State-of-the-art models such as Gemini aren't LLMs anymore. They are natively multimodal or omni-modal transformer models that can process text, images, speech and video. These models seem to me like a huge jump in capabilities over text-only LLMs like GPT-3.

  • Regularize by a function other than KL divergence. For heavy-tailed error distributions, KL divergence doesn’t work, but capping the maximum odds ratio for any action (similar to quantilizers) still results in positive utility.

A recent paper from UC Berkeley named Preventing Reward Hacking with Occupancy Measure Regularization proposes replacing KL divergence regularization with occupancy measure (OM) regularization. OM regularization involves regularizing based on the state or state-action distribution rather than the the action distribution:

"Our insight

... (read more)
2Thomas Kwa
I think that paper and this one are complementary. Regularizing on the state-action distribution fixes problems with the action distribution, but if it's still using KL divergence you still get the problems in this paper. The latest version on arxiv mentions this briefly.
Load More