Replying toThe case for AGI safety products

kaivu20d

n=2, but I didn't observe any investor caring about the fact that we (Fulcrum Research) were a pbc for raising at seed stage, and I think the same is true of Theorem Labs (another safety startup).

Replying toAn Aphoristic Overview of Technical AI Alignment proposals

kaivu1mo

An Aphoristic Overview of Technical AI Alignment proposals

Before reading your disclaimer that Claude helped with the aphorisms, I felt like the post felt a bit like AI slop.

One example is the line “principles must survive power”. Granted that’s a true description of a goal one might have in alignment, but the spirit of constitutional ai is that the model has the capability of judging whether it acts in accordance with a principle even if it can’t always act correctly. So I’m not sure what that line is saying, and it felt a bit meaningless.

I still like the motivating question, and I will check out Epictetus now!

Replying toThe Weakest Model in the Selector

kaivu2mo

The Weakest Model in the Selector

Do you have concrete examples of this working? Seems plausible, just curious about the current attack surface.
My intuition is reasoning fixes this well before models are robust in other ways—the OODness of the previous acceptance gets mitigated by model-generated CoT. RL on traces with more context switching would probably also help. I don’t think you need models with self-awareness and stable goals to solve this; seems like a more mundane training distribution issue.
Interesting though, and seems like a more general version of this problem is pretty hard (e.g. for agent decision making).

Replying toUnknown Knowns: Five Ideas You Can't Unsee

kaivu2mo

Unknown Knowns: Five Ideas You Can't Unsee

Interesting post! I am a bit confused about the application of the local linearity principle to the tax bracket issue.

The content of "differentiable functions are locally linear" is something like: if you nudge an input by ε, the output changes by ~f'(x)·ε. If you double your nudge, you approximately double your effect (to first order). I enjoyed the small goods insurance and altruism examples, since local curvature is negligible at the relevant scale (for altruist utility, and relatively unimportant decisions).

But the tax bracket example doesn't really make sense to me. The feared bad outcome ("being pushed into a higher bracket makes me worse off") is compatible with local linearity (in the sense that the segments were linear), since just you could just have f'(x) < 0. What rules out the bad outcome is that marginal tax rates are positive and less than 100%, not anything about smoothness or local approximation.

Automated real time monitoring and orchestration of coding agents

zef

zef, kaivu, leni

4mo

Fulcrum is excited to open-source two new tools:

Quibbler is a critic for your coding agent.
Orchestra is a multi-agent coding system: it uses a designer agent that spawns and coordinates your parallel coding agents.

Motivation

Human attention is a scarce resource.

In the best case, coding with agents allows that attention to be spent on the “right” parts of your code: its functionality, architecture, and its failure-modes. In the worst case, your attention is spent on the behavior of your agents: preventing them from taking unsafe actions, asking them not to reward hack, and understanding if their outputs are trustworthy.

Can we use agents to help us out?

Quibbler

Quibbler is a background agent that monitors and critiques your coding agent’s... (read 334 more words →)

AI agents and painted facades

leni

leni, zef, kaivu

6mo

In 1787, Catherine the Great sailed down the Dnieper to inspect its banks. Her trusted advisor, Governor Potemkin, set out to present those war-torn lands to her in the best possible light. Legend has it^[1] Potemkin set up painted facades along the riverbank, so that, from her barge, Catherine would see beautiful villages – each just a couple of inches thick.

The rise of AI agents makes the Potemkin problem commonplace. Research agents cite experiments that never took place. Coding agents often write fake tests and mock solutions as they cause catastrophes behind the scenes.

We're moving towards a world of Potemkin villages – where our understanding of reality drifts farther and farther from... (read 599 more words →)

Replying toMe, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

kaivu2y

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Thanks for bringing this up: this was a pretty confusing part of the evaluation.

Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words (and it could optionally use the seed to throw a biased coin as well).

You’re right that the easiest way to solve this problem, as enforced in our grading, is to output an ordered pair without using the seed.

The main reason we didn’t enforce this very strictly in our grading is that we didn’t expect (and in fact empirically did not observe) LLMs actually hard-coding a single pair across all seeds. Given that, it would have been somewhat computationally expensive to explicitly penalize this in grading.

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

L Rudolf L

L Rudolf L, bilalchughtai, Jan Betley, kaivu, Jérémy Scheurer, Mikita Balesni, AlexMeinke, Owain_Evans, Marius Hobbhahn

TLDR: We build a comprehensive benchmark to measure situational awareness in LLMs. It consists of 16 tasks, which we group into 7 categories and 3 aspects of situational awareness (self-knowledge, situational inferences, and taking actions).

We test 19 LLMs and find that all perform above chance, including the pretrained GPT-4-base (which was not subject to RLHF finetuning). However, the benchmark is still far from saturated, with the top-scoring model (Claude-3.5-Sonnet) scoring 54%, compared to a random chance of 27.4% and an estimated upper baseline of 90.7%.

This post has excerpts from our paper, as well as some results on new models that are not in the paper.

Links: Twitter thread, Website (latest results + code),... (read 1483 more words →)

109

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Tony Wang

Tony Wang, Miles Wang, kaivu

Overview

We tried to understand at a circuit level how Llama-2 models perform a simple task: not replying with a word when told not to. We mostly failed at fully reverse engineering the responsible circuit, but in the process we learned a few things about mechanistic interpretability. We share our takeaways here, which we hope are useful to other researchers. We are very interested in critiques of our views.

Our major takeaways:

Some model behaviors might be computationally irreducible.
“Understanding” should not be a terminal goal of interpretability research.

More minor takeaways:

Log-odds is a great metric.
Tooling is important and should not be taken for granted.

More technical details are available in our paper, which serves as the appendix to... (read 2870 more words →)

Update on Harvard AI Safety Team and MIT AI Alignment

Xander Davies

Xander Davies, Sam Marks, kaivu, tlevin, leni, maxnadeau, Naomi Bashkansky

We help organize the Harvard AI Safety Team (HAIST) and MIT AI Alignment (MAIA), and are excited about our groups and the progress we’ve made over the last semester.

In this post, we’ve attempted to think through what worked (and didn’t work!) for HAIST and MAIA, along with more details about what we’ve done and what our future plans are. We hope this is useful for the many other AI safety groups that exist or may soon exist, as well as for others thinking about how best to build community and excitement around working to reduce risks from advanced AI.

Important things that worked:

Well-targeted outreach, which (1) focused on the technically interesting parts of alignment (rather

... (read 2259 more words →)

LESSWRONG
LW

LESSWRONG
LW

kaivu

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Update on Harvard AI Safety Team and MIT AI Alignment

AI agents and painted facades

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

kaivu

Introducing Lunette: auditing agents for evals and environments

Automated real time monitoring and orchestration of coding agents

AI agents and painted facades

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Update on Harvard AI Safety Team and MIT AI Alignment

kaivu

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Update on Harvard AI Safety Team and MIT AI Alignment

AI agents and painted facades

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

kaivu

Introducing Lunette: auditing agents for evals and environments

Automated real time monitoring and orchestration of coding agents

AI agents and painted facades

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Update on Harvard AI Safety Team and MIT AI Alignment

Motivation

Quibbler

Overview