Replying toIrrationality as a Defense Mechanism for Reward-hacking

Irrationality as a Defense Mechanism for Reward-hacking

Executive Summary

Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post ^[1] , and are excited for more in the field to embrace pragmatism! In brief, we think that:
- It is crucial to have empirical feedback on your ultimate goal with good proxy tasks ^[2] .
- We do not need near-complete understanding to have significant impact.
- We can perform good focused projects by starting with a theory of change, and good exploratory projects by starting with a robustly useful setting
But that’s pretty abstract. So how

... (read 3981 more words →)

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár, lewis smith

2mo

Executive Summary

The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability:
- Trying to directly solve problems on the critical path to AGI going well ^[[1]]
- Carefully choosing problems according to our comparative advantage
- Measuring progress with empirical feedback on proxy tasks
We believe that, on the margin, more researchers who share our goals should take a pragmatic approach to interpretability, both in industry and academia, and we call on people to join us
- Our proposed scope is broad and includes much non-mech interp work, but we see this as the natural approach for mech

... (read 7963 more words →)

131

•••

Replying toMe, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

bilalchughtai4mo

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

just checked and think it is still up?

Replying toA Comprehensive Guide to Running

bilalchughtai6mo*

A Comprehensive Guide to Running

I took up running about a year ago, shortly after starting to wear bearfoot shoes. I only run in barefoot shoes and have had no problems. I suspect it helped that I "learned to run" in barefoot shoes, and that ordinarily one would have to change their stride to fit the shoe better, e.g. by landing mid foot instead of heel striking. I also, as I was a new runner, started off with extremely low volume (~10km a week), which also probably helped build foot strength.

I'm pretty happy overall, though am now tempted to get some proper running shoes for races, as the zero energy return is sad if you care about speed.

Replying toA Comprehensive Guide to Running

bilalchughtai6mo

A Comprehensive Guide to Running

readers of this post may also enjoy the simpler following post: how to run without all the pesky agonizing pain

bilalchughtai6mo*Quick Take

When evaluating whether to invest time in making things more efficient, I often see people compare the one off cost of make the thing more efficient to the expected future saved time when doing the thing. I think there is often an important third variable to track, that often swings such decisions from not worth the effort to definitely worth the effort. Namely, the expected increase in usage of the thing due to the reduction in friction of utilizing the thing. I in practice often find the final consideration to dominate.

Recent examples from my life:

Reducing the number of button presses needed for common workflows on my laptop means I both can navigate

... (read more)

Replying toAn opinionated guide to building a good to-do system

bilalchughtai6mo

An opinionated guide to building a good to-do system

thanks for the comment!

i keep hoping ai agents get good enough

i am also keen to think through how ai can make my workflows better! i agree that they are not quite there yet for entirely automating parts of task management.

no, I just want them to go away upon completion

seems good, seems eventually possible

The core is I want the system/ tool to be invisible ... Touch tasks as close to once as possible ... Why are you making a "today" list?

i agree that that many of the best systems require literally zero cognitive overhead. but i'm skeptical that optimal task management should ever literally be zero effort.

touching each task literally only once (to... (read more)

Replying toAn opinionated guide to building a good to-do system

bilalchughtai6mo

An opinionated guide to building a good to-do system

pretty often! i schedule a lot and time box a bit. the main reason i schedule is so that i get a reminder at the right time of day to do the task. sometimes that time is more just a proxy for some event like "when i'm at the office" or "when i get home". it would be better to specify that precisely but i havn't seen good support for it anywhere yet.

Replying toAn opinionated guide to building a good to-do system

bilalchughtai6mo

An opinionated guide to building a good to-do system

Jotting down "deal with X" on the todoist mobile app and then later figuring out exactly how to deal with it on my laptop is a pretty common workflow of mine. I find it pretty frictionless. I also use the mobile app for other things (e.g. glancing at my daily to-do list while on the go).

Here are some screenshots of the app:

An opinionated guide to building a good to-do system

bilalchughtai

6mo

My to-do system is by far the most important system I have for keeping my life on track. It acts as a second brain, remembering things for me so I don't have to.^[1] Without it, I would be completely lost,^[2] and nowhere near as organized or conscientious. To-do systems done well can massively improve your productivity. They reduce cognitive load when thinking about things you need to do, making you less likely to forget tasks, more likely to choose the right next task and more efficient in doing so, and less distracted when executing on tasks. As such, I think more people should invest time into building themselves a great to-do system.^[3] This post collates... (read 2254 more words →)

Consider applying to PhDs soon!

Last November, I wrote a blog post titled You should consider applying to PhDs (soon!), where I argued it is probably a good use of time for junior AI safety researchers (e.g. people who have recently participated in an upskilling or research program like ARENA or MATS) to apply to PhDs in the current cycle, even if they are on the fence about whether they want to do a PhD.

My core arguments were that academic timelines are very slow (i.e. if you apply this year you would not start until Fall 2026), applications are generally cheap and high information value, and that applying strictly increases your future optionality. I... (read more)

karpathy reviews sleep trackers: https://karpathy.bearblog.dev/finding-the-best-sleep-tracker/

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill

Nicholas Goldowsky-Dill, bilalchughtai, StefanHex, Marius Hobbhahn

Can you tell when an LLM is lying from the activations? Are simple methods good enough? We recently published a paper investigating if linear probes detect when Llama is deceptive.

Abstract:

AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient, since the AI might produce seemingly benign outputs while its internal reasoning is misaligned. We thus evaluate if linear probes can robustly detect deception by monitoring model activations. We test two probe-training datasets, one with contrasting instructions to be honest or deceptive (following Zou et al., 2023) and one of responses to simple roleplaying scenarios. We test whether these probes generalize to realistic settings where Llama-3.3-70B-Instruct behaves

... (read 319 more words →)

104

Paper: Open Problems in Mechanistic Interpretability

Lee Sharkey

Lee Sharkey, bilalchughtai

TL;DR: This paper brings together ~30 mechanistic interpretability researchers from 18 different research orgs to review current progress and the main open problems of the field.

This review collects the perspectives of its various authors and represents a synthesis of their views by Apollo Research on behalf of Schmidt Sciences. The perspectives presented here do not necessarily reflect the views of any individual author or the institutions with which they are affiliated.

Abstract

Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks’ capabilities in order to accomplish concrete scientific and engineering goals.

Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about... (read more)

A LW feature that I would find helpful is an easy to access list of all links cited by a given post.

Activation space interpretability may be doomed

bilalchughtai

bilalchughtai, Lucius Bushnaq

TL;DR: There may be a fundamental problem with interpretability work that attempts to understand neural networks by decomposing their individual activation spaces in isolation: It seems likely to find features of the activations - features that help explain the statistical structure of activation spaces, rather than features of the model - the features the model’s own computations make use of.

Written at Apollo Research

Introduction

Claim: Activation space interpretability is likely to give us features of the activations, not features of the model, and this is a problem.

Let’s walk through this claim.

What do we mean by activation space interpretability? Interpretability work that attempts to understand neural networks by explaining the inputs and outputs of their layers in isolation. In... (read 2273 more words →)

153

Reasons for and against working on technical AI safety at a frontier AI lab

bilalchughtai

I am about to start working on a frontier lab safety team. This post presents a varied set of perspectives that I collected and thought through before accepting my offer. Thanks to the many people I spoke to about this.

For

You're close to the action. As AI continues to heat up, being closer to the action seems increasingly important. Being at a frontier lab allows you to better understand how frontier AI development actually happens and make better predictions about how it might play out in future. You can build a gears level model of what goes into the design and deployment of current and future frontier systems, and the bureaucratic and political processes behind this,... (read 3428 more words →)

101

You might want to stop using the honey extension. Here are some shady things they do, beyond the usual:

Steal affiliate marketing revenue from influencers (who they also often sponsor), by replacing the genuine affiliate referral cookie with their affiliate referral cookie.
Deceive customers by deliberately withholding the best coupon codes, while claiming they have found the best coupon codes on the internet; partner businesses control which coupon codes honey shows consumers.

Book Summary: Zero to One

bilalchughtai

Summary. Zero to one is a collection of notes on startups by Peter Thiel (co-founder of PayPal and Palantir) that grew from a course taught by Thiel at Stanford in 2012. Its core thesis is that iterative progress is insufficient for meaningful progress. Thiel argues that the world can only become better if it changes dramatically, which requires new technology that does not yet exist to be invented. He argues that the right way to do this is not to copy existing things, nor to iterate gradually on existing ideas, but to find fundamentally new company-shaped ideas, and leverage those to change the world. The book discusses recent historical examples of going... (read 2112 more words →)

Remap your caps lock key

bilalchughtai

When was the last time you (intentionally) used your caps lock key?

No, seriously.

Here is a typical US-layout qwerty (mac) keyboard. Notice:

Caps lock is conveniently located only one key away from A, which is where your left pinky should rest on the home row by default.
Caps lock is absolutely massive.
How far various other keys you might want use often are from the home row.

Remap your caps lock key.

I have mine mapped to escape.

Modifier keys such as control or command are also good options (you could then map control/command to escape).

How do I do this you ask?

On Mac, system settings > keyboard > keyboard shortcuts > modifier keys.
On Windows, Microsoft PowerToys' Keyboard Manager is one solution.
If you use Linux, I trust you can manage on your own.

Thanks to Rudolf for introducing me to this idea.

Should you invest in a Lifetime ISA? (UK)

The Lifetime Individual Savings Account (LISA) is a government saving scheme in the UK intended primarily to help individuals between the ages of 18 and 50 buy their first home (among a few other things). You can hold your money either as cash or in stocks and shares.

The unique selling point of the scheme is that the government will add a 25% bonus on all savings up to £4000 per year. However, this comes with several restrictions. The account is intended to only be used for the following purposes:
1) to buy your first home, worth £450k or less
2) if you are aged 60 or older
3)... (read more)

LESSWRONG
LW

LESSWRONG
LW

bilalchughtai

Activation space interpretability may be doomed

A Pragmatic Vision for Interpretability

You should consider applying to PhDs (soon!)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

bilalchughtai

bilalchughtai

[Paper] Difficulties with Evaluating a Deception Detector for AIs

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

An opinionated guide to building a good to-do system

Detecting Strategic Deception Using Linear Probes

Paper: Open Problems in Mechanistic Interpretability

Activation space interpretability may be doomed

bilalchughtai

Activation space interpretability may be doomed

A Pragmatic Vision for Interpretability

You should consider applying to PhDs (soon!)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

bilalchughtai

bilalchughtai

[Paper] Difficulties with Evaluating a Deception Detector for AIs

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

An opinionated guide to building a good to-do system

Detecting Strategic Deception Using Linear Probes

Paper: Open Problems in Mechanistic Interpretability

Activation space interpretability may be doomed

Abstract

Executive Summary

Executive Summary

Abstract

Introduction

For