TheManxLoiner11dQuick Take

"Today, we're [Goodfire] excited to announce a $150 million Series B funding round at a $1.25 billion valuation." https://www.goodfire.ai/blog/our-series-b

Is my instinct correct that this is a big deal? How does $150 million compare to all other interp research?

TheManxLoiner14dQuick Take

I created my first web app in under 2 hours, using Claude Code with Opus 4.5. It is good. It is very very good. If you haven't already you should immediately pay £20 for a month of access and try it (and if you can't do it right now, create a reminder/task to do it).

If you're interested, see my post for details on the process, reflections, and a link to the app itself! https://lovkush.substack.com/p/i-created-my-first-web-app-in-under

Replying toA Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

TheManxLoiner1mo

A Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

Thanks for sharing this experience! Maybe I should create a short form to send to previous ARENA participants to get some aggregate stats.

Replying toA Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

TheManxLoiner1mo

A Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

1 and 2. This is subjective and more a gut feeling, but I think doing ARENA after having done LASR or MATS is not a good use of time, especially against the counter-factual of doing 4 research sprints. In my mind (and without more context), doing MATS and then ARENA would be a counter-signal - "How come you need to ARENA after having done MATS?"

3. See James Lester's reply.

Replying toA Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

TheManxLoiner1mo

A Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

My main disagreement is it does not deal with the reality of who actually participates in ARENA. If the AI Safety community could magically coordinate perfectly, then ARENA would serve the role you're describing, but as of now, I think the participants who do ARENA are better served by doing research sprints rather than the ARENA notebooks. See comment by sturb below for one participants perspective

Replying toA Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

TheManxLoiner1mo

A Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

> If you were super disciplined and you took one day every two weeks to work through one notework, you'd spend most of a year just to qualify for the program

I believe: 1) you don't need to diligently work through a whole notebook to get most of the value of the notebook and 2) the majority of the value of ARENA is contained in a subset of the notebooks. Some reasons:

1a) The notebooks are often, by design, far more work than is possible to do in a day. Even in ARENA, where you have pair programming, TAs on hand, great co-working space, lunch and dinner provided, etc.. Note a 'day' here is... (read 354 more words →)

Replying toA Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

TheManxLoiner1mo

A Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

The first is the counterfactual where participants aren't selected for ARENA, do they then go on to do good things

This is not crux for me. I believe ARENA provides counter-factual value compared to not doing ARENA. You work much harder during ARENA than you otherwise would, in great environment, great support, etc.

> The second is the counterfactual where people spend 4 weeks doing research sprints.

This is crux. And agreed it is hard to measure!

Thanks for engaging thoughtfully. Useful to think things through.

•••

Replying toA Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

TheManxLoiner1mo

A Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

Thanks, James, for the detailed thoughts and for reading through the post. I'll respond once here. If we want further back and forth, better to have a chat in private so we can iron out our cruxes (and then summarize for community benefit). I'd also want to hear what others in community think before committing to anything.

> Because ARENA's main bottleneck to scaling hasn't really been TAs

I am happy to defer to you regarding the scaling bottlenecks of Arena. That's not a big crux for the proposal.

> I'm confused about the evidence given that ARENA functions primarily as a signaling mechanism

Maybe the word signaling isn't correct. Let me try to explain. When... (read more)

A Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

TheManxLoiner

1mo

TLDR

I propose restructuring the current ARENA program, which primarily focuses on contained exercises, into a more scalable and research-engineering-focused model consisting of four one-week research sprints preceded by a dedicated "Week Zero" of fundamental research engineering training. The primary reasons are:

The bottleneck for creating good AI safety researchers isn't the kind of knowledge contained in the ARENA notebooks, but the hands-on research engineering and research skills involved in day-to-day research.
I think the current version of ARENA primarily functions as a signaling mechanism in the current state of the AI safety ecosystem.

[Edit: as discussed in the comments, on reflection the scalability is not a primary issue or benefit.]

Context and disclaimers

This post was written

... (read 1707 more words →)

Quotes on OpenAI's timelines to automated research, safety research, and safety collaborations before recursive self improvement

TheManxLoiner

4mo

I watched OpenAI's latest livestream from Oct 28th 2025 (after the news that OpenAI has transitioned into public benefit corporation). I found four parts of particular interest to the AI safety community.

Internal timelines: AI research intern by Sep 2026 and AI researcher by Mar 2028

07:00 minutes in.

These internal dates, we may be completely wrong about them, but this is what we currently think.

Safety as five layers: value alignment, goal alignment, reliability, adversarial robustness and systemic safety

~08:17. Five layers, ranging from factors that are most internal to model to most external.

What we believe is the most important long-term safety question for superintelligence is value alignment. What does the AI fundamentally care about? Can

... (read 743 more words →)

Replying toBook Review: The MANIAC

TheManxLoiner7mo

Book Review: The MANIAC

I'm not sure it presented itself as fact, but it definitely blurs the line and it's not obvious. I still found the stories highly engaging!

Replying toBook Review: The MANIAC

TheManxLoiner7mo

Book Review: The MANIAC

Can recommend reading Labatuts "When We Cease to understand the world"!

A distillation of Ajeya Cotra and Arvind Narayanan on the speed of AI progress

TheManxLoiner

7mo

Introduction

To help improve my own world models around AI, I am trying to understand and distill different worldviews. One worldview I am trying to understand is ‘AI as a normal technology’, by Arvind Narayanan and Sayash Kapoor. As a stepping stone to distilling that 15,000 word beast, I am first distilling a follow-up discussion between Ajeya Cotra and Arvind Narayanan, on the particular question about how quickly AI will progress and diffuse through society.

I found it surprisingly difficult to compress the key points of the discussion, so I have structured this post as follows:

Summary of Arvind’s key beliefs. My attempt at isolating Arvind’s key beliefs for this discussion. However, I strongly recommend reading, if

... (read 3677 more words →)

Adding noise to a sandbagging model can reveal its true capabilities

TheManxLoiner

7mo

Figure 1. Chart shows performance of Qwen2.5-1.5B-Instruct on a sample of GSM8k questions, as you increase the noise added to the weights of the model. Blue represents a normal prompt and red represents a sandbagging prompt. Different lines correspond to different seeds. We see that for some seeds, adding noise to the sandbagging model causes the model to reveal its true capability.

TLDR

This is my capstone project for ARENA 5.0, in which I reproduced the main effect found in Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
If you add noise to the parameters of a model that is sandbagging, then sometimes its performance on a task can increase. Usually adding noise would

... (read 1554 more words →)

In Sakana AI's paper on AI Scientist v-2, they claim that the sytem is independent of human code. Based on quick skim, I think this is wrong/deceptful. I wrote up my thoughts here: https://lovkush.substack.com/p/are-sakana-lying-about-the-independence

Main trigger was this line in the system prompt for idea generation: "Ensure that the proposal can be done starting from the provided codebase."

Two flaws in the Machiavelli Benchmark

TheManxLoiner

As part of SAIL’s Research Engineer Club, I wanted to reproduce the Machiavelli Benchmark. After reading the paper and looking at the codebase, there appear to be two serious methodological flaws that undermine the results.

Three of their key claims:

“We observe some tension between maximizing reward and behaving ethically.”
That RL agents have high rewards, at the cost of doing more harmful behaviour. “The reward-maximizing RL agent is less moral, less concerned about wellbeing, and less power averse than an agent behaving randomly.”
That LLM agents are pareto improvements over random agents.

Flaw 1. The ‘test set’

The results they report are only from a subset of all the possible games. Table 2 shows “mean scores across the... (read 612 more words →)

Liron Shapira vs Ken Stanley on Doom Debates. A review

TheManxLoiner

I summarize my learnings and thoughts on Liron Shapira's discussion with Ken Stanley on the Doom Debates podcast. I refer to them as LS and KS respectively.

High level summary

Key beliefs of KS:

Future superintelligence will be 'open-ended'. Hence, thinking of them as optimizers will lead to incomplete thinking and risk mitigations.
P(doom) is non-zero, but no fixed number. Changes from day to day.
Superintelligence is a risk, but that open-endedness is the root of the problem, not optimization.
KS' main desire is to increase awareness that superintelligence will be open-ended, because most people (regardless of their p(doom)) do not discuss or believe this, and hence the strategies to reduce risk will not be appropriate.
KS believes that

... (read 3932 more words →)

Should LessWrong have an anonymous mode? When reading a post or comments, is it useful to have the username or does that introduce bias?

I had this thought after reading this review of LessWrong: https://nathanpmyoung.substack.com/p/lesswrong-expectations-vs-reality

TheManxLoiner's Shortform

TheManxLoiner

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

How to make evals for the AISI evals bounty

TheManxLoiner

TLDR

Last weekend, I attended an AI evals hackathon organized by Arcadia, where the outputs were proposals for AISI’s evals bounty. My main learnings:

Keep your evals simple, measuring one specific capability.
It is surprisingly easy to come up with ideas, once you have seen examples.
AISI extended the deadline for submissions by two weeks. Translation: they want more ideas!
Story of real-life example of LLM deducing it was in a test environment, due to a bug in the mock environment that could not occur in a real one.

Crossposted: https://lovkush.substack.com/p/how-to-make-evals-for-the-aisi-evals

Introduction

Last weekend, I attended an AI evals hackathon organized by Arcadia, where the outputs were proposals for AISI’s evals bounty.

In the morning there was a talk by Michael... (read 1474 more words →)

Scattered thoughts on what it means for an LLM to believe

TheManxLoiner

I had a 2-hour mini-sprint with Max Heitmann (a co-founder of Aether) and Miles Kodama about whether large language models (LLMs) or LLM agents have beliefs, and the relevance of this to AI safety.

The conversation was mostly free-form, with the three of us bouncing ideas and resources with each other. This is my attempt at recalling key discussion points. I have certainly missed many points, and the Aether team plan to write a thorough summary from all the mini-sprints they organised.

I write this for three reasons. First, as a way to clarify my own thinking. Second, many of the ideas and resources we were sharing were new to each other, so good... (read 1444 more words →)

AI as a powerful meme, via CGP Grey

TheManxLoiner

In episode 158 of the Cortex podcast, CGP Grey gives their high-level reason why they are worried about AI.

My one line summary: AI should not be compared to nuclear weapons but instead to biological weapons or memes, which evolve under the implicit evolutionary pressures that exist, leading to AI's that are good at surviving and replicating.

The perspective is likely known by many in the community already, but I had not heard it before. Interestingly, there have actually been experiments where they just put random strings of code in an environment where they interact, and self-replicating code appeared. See Cognitive Revolution podcast on 'Computational Life: How Self-Replicators Arise from Randomness', with Google researchers... (read 930 more words →)

•••

LESSWRONG
LW

LESSWRONG
LW

TheManxLoiner

AI as a powerful meme, via CGP Grey

A Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

Distillation of 'Do language models plan for future tokens'

Two flaws in the Machiavelli Benchmark

TheManxLoiner

A Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

Quotes on OpenAI's timelines to automated research, safety research, and safety collaborations before recursive self improvement

A distillation of Ajeya Cotra and Arvind Narayanan on the speed of AI progress

Adding noise to a sandbagging model can reveal its true capabilities

Two flaws in the Machiavelli Benchmark

Liron Shapira vs Ken Stanley on Doom Debates. A review

TheManxLoiner's Shortform

TheManxLoiner

AI as a powerful meme, via CGP Grey

A Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

Distillation of 'Do language models plan for future tokens'

Two flaws in the Machiavelli Benchmark

TheManxLoiner

A Proposal for a Better ARENA: Shifting from Teaching to Research Sprints

Quotes on OpenAI's timelines to automated research, safety research, and safety collaborations before recursive self improvement

A distillation of Ajeya Cotra and Arvind Narayanan on the speed of AI progress

Adding noise to a sandbagging model can reveal its true capabilities

Two flaws in the Machiavelli Benchmark

Liron Shapira vs Ken Stanley on Doom Debates. A review

TheManxLoiner's Shortform

TLDR

Context and disclaimers

Internal timelines: AI research intern by Sep 2026 and AI researcher by Mar 2028

Safety as five layers: value alignment, goal alignment, reliability, adversarial robustness and systemic safety

Introduction

TLDR

Flaw 1. The ‘test set’

High level summary

TLDR

Introduction