Esben Kran

We are excited to host a jamsite for Apart Research and Martian's upcoming AI Safety x Physics Grand Challenge hackathon. Note: No physics expertise is required to participate - anyone interested in working on technical AI safety challenges is welcome!

For clarification, this event is not organized with Martian but with the generous support of PIBBSS and Timaeus

Replying toWhat is the theory of change behind writing papers about AI safety?

Esben KranMar 19, 2025

What is the theory of change behind writing papers about AI safety?

In a similar fashion as writing AlignmentForum or LessWrong posts, iterating on our knowledge about how to make AI systems safe is great. Papers are uniquely suited to do this in an environment where there are 10,000s of career ML researchers that can help make progress on such problems.

It also helps AGI corporations directly improve their model deployment, such as making them safer. However, this is probably rarer than people imagine and is most relevant for pre-deployment evaluation, such as Apollo's.

Additionally, papers (and now even LW posts sometimes) may be referred to as a "source of truth" (or new knowledge) in media, allowing journalists to say something about AI systems while referring... (read more)

Esben Kran1y

I think "stay within bounds" is a toy example of the equivalent to most alignment work that tries to avoid the agent accidentally lapsing into meth recipes and is one of our most important initial alignment tasks. This is also one of the reasons most capabilities work turns out to be alignment work (and vice versa) because it needs to fulfill certain objectives.

If you talk about alignment evals for alignment that isn't naturally incentivized by profit-seeking activities, "stay within bounds" is of course less relevant.

When it comes to CEV (optimizing utility for other players), one of the most generalizing and concrete works involves at every step maximizing how many choices the other... (read more)

Esben Kran1y

Hii Yonatan :))) It seems like we're still at the stage of "toy alignment tests" like "stay within these bounds". Maybe a few ideas:

Capabilities: Get diamonds, get to the netherworld, resources / min, # trades w/ villagers, etc. etc.
Alignment KPIs
- Stay within bounds
- Keeping villagers safe
- Truthfully explaining its actions as they're happening
- Long-term resource sustainability (farming) vs. short-term resource extraction (dynamite)
- Environmental protection rules (zoning laws alignment, nice)
- Understanding and optimizing for the utility of other players or villagers, selflessly
Selected Claude-gens:
- Honor other players' property rights (no stealing from chests/bases even if possible)
- Distribute resources fairly when working with other players
- Build public infrastructure vs private wealth
- Safe disposal of hazardous materials (lava, TNT)
- Help new players learn rather than just doing things for them

I'm sure there's many other interesting alignment tests in there!

Esben Kran1y

This is a surprisingly interesting field of study. Some video games provide a great simulation of the real world and Minecraft seems to be one of them. We've had a few examples of minecraft evals with one that comes to mind here: https://www.apartresearch.com/project/diamonds-are-not-all-you-need

Replying toCollege technical AI safety hackathon retrospective - Georgia Tech

Esben Kran1y

College technical AI safety hackathon retrospective - Georgia Tech

Super cool work Yixiong - we were impressed by your professionalism in this process despite working within another group's whims on this one. Some other observations from our side that may be relevant for other folks hosting hackathons:
- Prepare starter materials: For example, for some of our early interpretability hackathons, we built a full resource base (Github) with videos, Colabs, and much more (some of it with Neel Nanda, big appreciation for his efforts in making interp more available). Our philosophy for the starter materials are: "If a participant can make a submission-worthy project by maximum cloning your repo and typing two commands or simply walk through a Google Colab, this is... (read 634 more words →)

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Jonathan N

Jonathan N, abra, Connor Axiotes, Esben Kran

This blog was published by Jonathan Ng, Andrey Anurin, Connor Axiotes, Esben Kran.

Apart Research's newest paper, Catastrophic Cyber Capabilities Benchmark (3cb): Robustly Evaluating LLM Agent Cyber Offense Capabilities (website), creates a novel cyber offense capability benchmark that engages with issues of legibility, coverage, and generalization in cyber offense benchmarks.

We were moved to create 3cb because a superintelligent AI performing autonomous cyber operations would prove a large risk for humanity. This means robust cyber offense evaluations will be more important than ever for policymakers and AI developers.

3cb uses a new type of cyber offense task categorization and adheres to the principle of demonstrations-as-evaluations to improve legibility and coverage. It also introduces 15 original challenges... (read 1759 more words →)

Can startups be impactful in AI safety?

Esben Kran

Esben Kran, Archana Vaidheeswaran

With Lakera's strides in securing LLM APIs, Goodfire AI's path to scaling interpretability, and 20+ model evaluations startups among much else, there's a rising number of technical startups attempting to secure the model ecosystem.

Of course, they have varying levels of impact on superintelligence containment and security and even with these companies, there's a lot of potential for aligned, ambitious and high-impact startups within the ecosystem. This point isn't new and has been made in our previous posts and by Eric Ho (Goodfire AI CEO).

To set the stage, our belief is that these are the types of companies that will have a positive impact:

Startups with a profit incentive completely aligned with improving AI safety;
that have a deep technical background to shape AGI deployment

... (read 1790 more words →)

Finding Deception in Language Models

Esben Kran

Esben Kran, Archana Vaidheeswaran

This June, Apart Research and Apollo Research joined forces to host the Deception Detection Hackathon. Bringing together students, researchers, and engineers from around the world to tackle a pressing challenge in AI safety; preventing AI from deceiving humans and overseers.

The hackathon took place both online and in multiple physical locations simultaneously. Marius Hobbhahn, the CEO of Apollo Research, kicked off the hackathon with a keynote talk about evaluating deception in AI with white-box and black-box methods. You can watch his talk here. We also had talks by Jacob Haimes, an Apart fellow, and Mikita Balesni, a research scientist at Apollo Research.

This post details the top 8 projects, multiple of which are currently... (read 972 more words →)

Merge Candidate discussion: Merge this into the Apart Research tag to accommodate the updated name of the Apart Sprints instead of Alignment Jam and avoid mis-labeling between the two tags (which happens currently).

Results from the AI x Democracy Research Sprint

Esben Kran

Esben Kran, jordine, Jason Hoelscher-Obermaier

We ran a 3-day research sprint on AI governance, motivated by the need for demonstrations of the risks to democracy by AI, supporting AI governance work. Here we share the 4 winning projects but many of the other 19 entries were also incredibly interesting so we suggest you take a look.

In summary, the winning projects:

Red-teamed unlearning to evaluate its effectiveness and practical scope in open-source models to remove hazardous information while retaining essential knowledge in the context of WMDP.
Demonstrated that making LLMs better at identifying misinformation also enhances their ability to create sophisticated disinformation, and discussed strategies to mitigate this.
Investigated how AI can undermine U.S. federal public comment systems by generating realistic,

... (read 1521 more words →)

Demonstrate and evaluate risks from AI to society at the AI x Democracy research hackathon

Esben Kran

TLDR; Participate online or in-person on the weekend 3rd to 5th May in an exciting and fun AI safety research hackathon focused on demonstrating and extrapolating risks to democracy from real-life threat models. We invite researchers, cybersecurity professionals, and governance experts to join but it is open for everyone, and we will introduce starter code templates to help you kickstart your team's projects. Join here.

Why demonstrate risks to democracy?

Despite some of the largest potential risks from AI being related to our democratic institutions and the fragility of society, there is surprisingly little work demonstrating and extrapolating concrete risks from AI to democracy.

By putting together actual demonstrations of potential dangers and mindfully extrapolating... (read 1542 more words →)

Join the AI Evaluation Tasks Bounty Hackathon

Esben Kran

How do we test when autonomous AI might become a catastrophic risk? One approach is to assess the capabilities of current AI systems in performing tasks relevant to self-replication and R&D. METR (formerly ARC Evals), a research group focused on this question, has:

developed a Task Standard, a standardized structure for specifying "tasks" in code to test language models, currently used by the UK AI Safety Institute
awarded substantial bounties to researchers developing new tasks for current language models

Now, you have the chance to directly contribute to this important AI safety research. We invite you to join the Code Red Hackathon, an event hosted by Apart in collaboration with METR, where you can earn money, connect with experts,... (read 1024 more words →)

Replying toSurvey for alignment researchers!

Esben Kran2y

Survey for alignment researchers!

This seems like a great effort. We made a small survey called pain points in AI safety survey back in 2022 that we received quite a few answers to which you can see the final results of here. Beware that this has not been updated in ~2 years.

Multi-Agent Security Hackathon

Esben Kran

Esben Kran, Jason Hoelscher-Obermaier, Clement Neo

Join us for this weekend-long research hackathon on the complexities of AI security! We will dive deep into frontier AI systems, specifically the fascinating multi-agent systems. Our mission will be to understand the worst-case scenarios and how we can avoid them.

Sign up to blend our rigorous research with the spirit of hackathon innovation. There's prizes for $1,200 on the line and the most ambitious teams might receive advising and senior co-authorship with established researchers in the field of AI safety. So, come along!

Watch the introductory talk by Christian Schroeder de Witt here.

Identifying semantic neurons, mechanistic circuits & interpretability web apps

Esben Kran

Esben Kran, Neel Nanda

15 research projects on interpretability were submitted to the mechanistic interpretability Alignment Jam in January hosted with Neel Nanda. Here, we share the top projects and results. In summary:

Activation patching works on singular neurons, token vector and neuron output weights can be compared, and a high mutual congruence between the two indicates a mono-semantic MLP neuron.
The Automatic Circuit Identification tool (ACDC) can be used to infer a circuit for gendered pronouns, and some of these circuits can perform even better than the full model. Hyperparameter tuning for ACDC is also very important.
A three-stage method can be used to automatically identify semantically coherent neurons and describe in human-understandable language what that neuron activates to.
Several

... (read 2181 more words →)

Replying toFLI open letter: Pause giant AI experiments

Esben Kran3y

FLI open letter: Pause giant AI experiments

It seems like there's a lot of negative comments about this letter. Even if it does not go through, it seems very net positive for the reason that it makes explicit an expert position against large language model development due to safety concerns. There's several major effects of this, as it enables scientists, lobbyists, politicians and journalists to refer to this petition to validate their potential work on the risks of AI, it provides a concrete action step towards limiting AGI development, and it incentivizes others to think in the same vein about concrete solutions.

I've tried to formulate a few responses to the criticisms raised:

"6 months isn't enough to develop the safety

Esben Kran

TLDR; The European Network for AI Safety is a central point for connecting researchers and community organizers in Europe with opportunities and events happening in their vicinity. Sign up here to become a member of the network, and join our launch event on Wednesday, April 5th from 20:00-21:00 CET!

Why did we create ENAIS?

ENAIS (pronounced e-nice) was founded by European AI safety researchers and field-builders who recognized the lack of interaction among various groups in the region. Our goal is to address the decentralized nature of AI safety work in Europe by improving information exchange and coordination.

We focus on Europe for several reasons: a Europe-specific organization can better address local issues like the EU AI Act,... (read 814 more words →)

Replying toShutting Down the Lightcone Offices

Esben Kran3y

Shutting Down the Lightcone Offices

Oliver's second message seems like a truly relevant consideration for our work in the alignment ecosystem. Sometimes, it really does feel like AI X-risk and related concerns created the current situation. Many of the biggest AGI advances might not have been developed counterfactually, and machine learning engineers would just be optimizing another person's clicks.

I am a big fan of "Just don't build AGI" and academic work with AI, simply because it is better at moving slowly (and thereby safely through open discourse and not $10 mil training runs) compared to massive industry labs. I do have quite a bit of trust in Anthropic, DeepMind and OpenAI simply from their general safety considerations... (read more)

Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results

Esben Kran

Esben Kran, fbarez, Sabrina Zaki, gabrielrecc, rz2383

We ran a hackathon on scalable oversight with Gabriel Recchia as keynote speaker (watch the talk) and Ruiqi Zhong as co-judge. Here, we share the top projects and results. In summary:

We can automate the “sandwiching” paradigm from Cotra [1] by having a smaller model ask structured questions to elicit a true answer from a larger model and getting a response accuracy rate as output.
We can understand coordination abilities between humans and large language models quantitatively using asymmetric-information language games such as Codenames.
We can study scaling and prompt specificity phenomena in-depth using a simple framework. In this case, word reversal is investigated to evaluate the emergent abilities of language models.

Watch the project presentations on... (read 1771 more words →)

🏆📈 We've created Alignment Markets! Here, you can bet on how AI safety benchmark competitions go. The current ones are about the Autocast warmup competition (meta), the Moral Uncertainty Research Competition, and the Trojan Detection Challenge.

It's hosted through Manifold Markets so you'll set up an account on their site. I've chatted with them about creating a A-to-B prediction market so maybe they'll be updated when we get there. Happy betting!

LESSWRONG
LW

LESSWRONG
LW

Results from the interpretability hackathon

Safety timelines: How long will it take to solve alignment?

Newsletter for Alignment Research: The ML Safety Updates

AI Safety Ideas: A collaborative AI safety research platform

Esben Kran

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Can startups be impactful in AI safety?

Finding Deception in Language Models

Results from the AI x Democracy Research Sprint

Demonstrate and evaluate risks from AI to society at the AI x Democracy research hackathon

Join the AI Evaluation Tasks Bounty Hackathon

Identifying semantic neurons, mechanistic circuits & interpretability web apps

Esben Kran

Results from the interpretability hackathon

Safety timelines: How long will it take to solve alignment?

Newsletter for Alignment Research: The ML Safety Updates

AI Safety Ideas: A collaborative AI safety research platform

Esben Kran

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Can startups be impactful in AI safety?

Finding Deception in Language Models

Results from the AI x Democracy Research Sprint

Demonstrate and evaluate risks from AI to society at the AI x Democracy research hackathon

Join the AI Evaluation Tasks Bounty Hackathon

Identifying semantic neurons, mechanistic circuits & interpretability web apps

Why demonstrate risks to democracy?

Why did we create ENAIS?