John Wentworth explains natural latents – a key mathematical concept in his approach to natural abstraction. Natural latents capture the "shared information" between different parts of a system in a provably optimal way. This post lays out the formal definitions and key theorems.

38Jeremy Gillen
This post deserves to be remembered as a LessWrong classic.  1. It directly tries to solve a difficult and important cluster of problems (whether it succeeds is yet to be seen). 2. It uses a new diagrammatic method of manipulating sets of independence relations. 3. It's a technical result! These feel like they're getting rarer on LessWrong and should be encouraged. There are several problems that are fundamentally about attaching very different world models together and transferring information from one to the other.  * Ontology identification involves taking a goal defined in an old ontology[1] and accurately translating it into a new ontology. * High-level models and low-level models need to interact in a bounded agent. I.e. learning a high-level fact should influence your knowledge about low-level facts and vice versa. * Value identification is the problem of translating values from a human to an AI. This is much like ontology identification, with the added difficulty that we don't get as much detailed access or control over the human world model. * Interpretability is about finding recognisable concepts and algorithms in trained neural networks. In general, we can solve these problems using shared variables and shared sub-structures that are present in both models. * We can stitch together very different world models along shared variables. E.g. if you have two models of molecular dynamics, one faster and simpler than the other. You want to simulate in the fast one, then switch to the slow one when particular interactions happen. To transfer the state from one to the other you identify variables present in both models (probably atom locations, velocities, some others), then just copy these values to the other model. Under-specified variables must be inferred from priors. * If you want to transfer a new concept from WM1 to a less knowledgeable WM2, you can do so by identifying the lower-level concepts that both WMs share, then constructing an "expla
Customize


Understanding deep learning isn’t a leaderboard sport - handle with care.

Saliency maps, neuron dissection, sparse autoencoders - each surged on hype, then stalled[1] when follow‑up work showed the insight was mostly noise, easily spoofed, or valid only in cherry‑picked settings. That risks being negative progress: we spend cycles debunking ghosts instead of building cumulative understanding.

The root mismatch is methodological. Mainstream ML capabilities research enjoys a scientific luxury almost no other field gets: public, quantitative benchmarks that tie effort to ground truth. ImageNet accuracy, MMLU, SWE‑bench - one number silently kills bad ideas. With that safety net, you can iterate fast on weak statistics and still converge on something useful. Mechanistic interpretability has no scoreboard for “the network’s internals now make sense.” Implicitly inheriting benchmark‑reliant habits from mainstream ML therefore swaps a ruthless filter for a fog of self‑deception.

How easy is it to fool ourselves? Recall the “Could a neuroscientist understand a microprocessor?” study: standard neuroscience toolkits - ablation tests, tuning curves, dimensionality reduction - were applied to a 6502 chip whose ground truth is fully known. The analyses produced plausible‑looking stories that entirely missed how the processor works. Interpretability faces the same trap: shapely clusters or sharp heat‑maps can look profound until a stronger test dissolves them.

What methodological standard should replace the leaderboard? Reasonable researchers will disagree[2]. Borrowing from mature natural sciences like physics or neuroscience seems like a sensible default, but a proper discussion is beyond this note. The narrow claim is simpler:

Because no external benchmark will catch your mistakes, you must design your own guardrails. Methodology for understanding deep learning is an open problem, not a hand‑me‑down from capabilities work.

So, before shipping the next clever probe, pause and ask: Where could I be fooling myself, and what concrete test would reveal it? If you don’t have a clear answer, you may be sprinting without the safety net this methodology assumes - and that’s precisely when caution matters most.

  1. ^

    This is perhaps a bit harsh - I think SAEs for instance still might hold some promise, and neuron-based analysis still has its place, but I think it's fair to say the hype got quite ahead of itself.

  2. ^

    Exactly how much methodological caution is warranted here will obviously be a point of contention. Everyone thinks the people going faster than them are reckless and the people going slower are needlessly worried. My point here is just to think actively about the question - don't just blindly inherit standards from ML.

Understanding deep learning isn’t a leaderboard sport - handle with care. Saliency maps, neuron dissection, sparse autoencoders - each surged on hype, then stalled[1] when follow‑up work showed the insight was mostly noise, easily spoofed, or valid only in cherry‑picked settings. That risks being negative progress: we spend cycles debunking ghosts instead of building cumulative understanding. The root mismatch is methodological. Mainstream ML capabilities research enjoys a scientific luxury almost no other field gets: public, quantitative benchmarks that tie effort to ground truth. ImageNet accuracy, MMLU, SWE‑bench - one number silently kills bad ideas. With that safety net, you can iterate fast on weak statistics and still converge on something useful. Mechanistic interpretability has no scoreboard for “the network’s internals now make sense.” Implicitly inheriting benchmark‑reliant habits from mainstream ML therefore swaps a ruthless filter for a fog of self‑deception. How easy is it to fool ourselves? Recall the “Could a neuroscientist understand a microprocessor?” study: standard neuroscience toolkits - ablation tests, tuning curves, dimensionality reduction - were applied to a 6502 chip whose ground truth is fully known. The analyses produced plausible‑looking stories that entirely missed how the processor works. Interpretability faces the same trap: shapely clusters or sharp heat‑maps can look profound until a stronger test dissolves them. What methodological standard should replace the leaderboard? Reasonable researchers will disagree[2]. Borrowing from mature natural sciences like physics or neuroscience seems like a sensible default, but a proper discussion is beyond this note. The narrow claim is simpler: So, before shipping the next clever probe, pause and ask: Where could I be fooling myself, and what concrete test would reveal it? If you don’t have a clear answer, you may be sprinting without the safety net this methodology assumes - and that’s pr
plex*32-23

@Daniel Kokotajlo I think AI 2027 strongly underestimates current research speed-ups from AI. It expects the research speed-up is currently ~1.13x. I expect the true number is more likely around 2x, potentially higher.

Points of evidence:

  1. I've talked to someone at a leading lab who concluded that AI getting good enough to seriously aid research engineering is the obvious interpretation of the transition to a faster doubling time on the METR benchmark. I claim advance prediction credit for new datapoints not returning to 7 months, and instead holding out at 4 months. They also expect more phase transitions to faster doubling times; I agree and stake some epistemic credit on this (unsure when exactly, but >50% on this year moving to a faster exponential).
  2. I've spoken to a skilled researcher originally from physics who claims dramatically higher current research throughput. Often 2x-10x, and many projects that she'd just not take on if she had to do everything manually.
  3. The leader of an 80 person engineering company which has the two best devs I've worked with recently told me that for well-specified tasks, the latest models are now better than their top devs. He said engineering is no longer a bottleneck.
  4. Regularly hanging around on channels with devs who comment on the latest models, and getting the vibes of how much it seems to be speeding people up.

If correct, this propagates through the model to much shorter timelines.

Please do an epistemic spot check on these numbers by talking to representative people in ways that would turn up evidence about current speed-ups.[1]

Edit: Eli said he'd be enthusiastic for someone else to get fresh data, I'm going to take a shot at this.

(also @Scott Alexander @Thomas Larsen @elifland @romeo @Jonas V)

  1. ^

    You might so far have mostly been talking to the very best researchers who are probably getting (or at least claiming, obvious reason for tinted glasses here) smaller speed-ups?

plex*32-23
35
@Daniel Kokotajlo I think AI 2027 strongly underestimates current research speed-ups from AI. It expects the research speed-up is currently ~1.13x. I expect the true number is more likely around 2x, potentially higher. Points of evidence: 1. I've talked to someone at a leading lab who concluded that AI getting good enough to seriously aid research engineering is the obvious interpretation of the transition to a faster doubling time on the METR benchmark. I claim advance prediction credit for new datapoints not returning to 7 months, and instead holding out at 4 months. They also expect more phase transitions to faster doubling times; I agree and stake some epistemic credit on this (unsure when exactly, but >50% on this year moving to a faster exponential). 2. I've spoken to a skilled researcher originally from physics who claims dramatically higher current research throughput. Often 2x-10x, and many projects that she'd just not take on if she had to do everything manually. 3. The leader of an 80 person engineering company which has the two best devs I've worked with recently told me that for well-specified tasks, the latest models are now better than their top devs. He said engineering is no longer a bottleneck. 4. Regularly hanging around on channels with devs who comment on the latest models, and getting the vibes of how much it seems to be speeding people up. If correct, this propagates through the model to much shorter timelines. Please do an epistemic spot check on these numbers by talking to representative people in ways that would turn up evidence about current speed-ups.[1] Edit: Eli said he'd be enthusiastic for someone else to get fresh data, I'm going to take a shot at this. (also @Scott Alexander @Thomas Larsen @elifland @romeo @Jonas V) 1. ^ You might so far have mostly been talking to the very best researchers who are probably getting (or at least claiming, obvious reason for tinted glasses here) smaller speed-ups?

Rediscovering some math.

[I actually wrote this in my personal notes years ago. Seemed like a good fit for quick takes.]

I just rediscovered something in math, and the way it came out to me felt really funny.

I was thinking about startup incubators, and thinking about how it can be worth it to make a bet on a company that you think has only a one in ten chance of success, especially if you can incubate, y'know, ten such companies.

And of course, you're not guaranteed success if you incubate ten companies, in the same way that you can flip a coin twice and have it come up tails both times. The expected value is one, but the probability of at least one success is not one.

So what is it? More specifically, if you consider ten such 1-in-10 events, do you think you're more or less likely to have at least one of them succeed? It's not intuitively obvious which way that should go.

Well, if they're independent events, then the probability of all of them failing is 0.9^10, or

And therefore the probability of at least one succeeding is  More likely than not! That's great. But not hugely more likely than not.

(As a side note, how many events do you need before you're more likely than not to have one success? It turns out the answer is 7. At seven 1-in-10 events, the probability that at least one succeeds is 0.52, and at 6 events, it's 0.47.)

So then I thought, it's kind of weird that that's not intuitive. Let's see if I can make it intuitive by stretching the quantities way up and down — that's a strategy that often works. Let's say I have a 1-in-a-million event instead, and I do it a million times. Then what is the probability that I'll have had at least one success? Is it basically 0 or basically 1?

...surprisingly, my intuition still wasn't sure! I would think, it can't be too close to 0, because we've rolled these dice so many times that surely they came up as a success once! But that intuition doesn't work, because we've exactly calibrated the dice so that the number of rolls is the same as the unlikelihood of success. So it feels like the probability also can't be too close to 1.

So then I just actually typed this into a calculator. It's the same equation as before, but with a million instead of ten. I added more and more zeros, and then what I saw was that the number just converges to somewhere in the middle.

If it was the 1300s then this would have felt like some kind of discovery. But by this point, I had realized what I was doing, and felt pretty silly. Let's drop the "", and look at this limit;

If this rings any bells, then it may be because you've seen this limit before;

or perhaps as

The probability I was looking for was , or about 0.632.

I think it's really cool that my intuition somehow knew to be confused here! And to me this path of discovery was way more intuitive that just seeing the standard definition, or by wondering about functions that are their own derivatives. I also think it's cool that this path made  pop out on its own, since I almost always think of e in the context of an exponential function, rather than as a constant. It also makes me wonder if 1/e is more fundamental than . (Similar to how  is more fundamental than .)

Rediscovering some math. [I actually wrote this in my personal notes years ago. Seemed like a good fit for quick takes.] I just rediscovered something in math, and the way it came out to me felt really funny. I was thinking about startup incubators, and thinking about how it can be worth it to make a bet on a company that you think has only a one in ten chance of success, especially if you can incubate, y'know, ten such companies. And of course, you're not guaranteed success if you incubate ten companies, in the same way that you can flip a coin twice and have it come up tails both times. The expected value is one, but the probability of at least one success is not one. So what is it? More specifically, if you consider ten such 1-in-10 events, do you think you're more or less likely to have at least one of them succeed? It's not intuitively obvious which way that should go. Well, if they're independent events, then the probability of all of them failing is 0.9^10, or (1−110)10≈0.35. And therefore the probability of at least one succeeding is 1−0.35=0.65. More likely than not! That's great. But not hugely more likely than not. (As a side note, how many events do you need before you're more likely than not to have one success? It turns out the answer is 7. At seven 1-in-10 events, the probability that at least one succeeds is 0.52, and at 6 events, it's 0.47.) So then I thought, it's kind of weird that that's not intuitive. Let's see if I can make it intuitive by stretching the quantities way up and down — that's a strategy that often works. Let's say I have a 1-in-a-million event instead, and I do it a million times. Then what is the probability that I'll have had at least one success? Is it basically 0 or basically 1? ...surprisingly, my intuition still wasn't sure! I would think, it can't be too close to 0, because we've rolled these dice so many times that surely they came up as a success once! But that intuition doesn't work, because we've exactly cali

Has anyone here had therapy to help handle thoughts of AI doom? How did it go? What challenges did you face explaining it or being taken seriously, and what kind of therapy worked, if any? 

I went to a therapist for 2 sessions and received nothing but blank looks when I tried to explain what I was trying to process. I think it was very unfamiliar ground for them and they didn't know what to do with me. I'd like to try again but if anyone here has guideance on what worked for them, I'd be interested.

I've also started basic meditation, which continues to be a little helpful.

Has anyone here had therapy to help handle thoughts of AI doom? How did it go? What challenges did you face explaining it or being taken seriously, and what kind of therapy worked, if any?  I went to a therapist for 2 sessions and received nothing but blank looks when I tried to explain what I was trying to process. I think it was very unfamiliar ground for them and they didn't know what to do with me. I'd like to try again but if anyone here has guideance on what worked for them, I'd be interested. I've also started basic meditation, which continues to be a little helpful.

Quick take titles should end in a period.

Quick takes (previously known as short forms) are often viewed via preview on the front page. This preview removes formatting and newlines for space reasons. So, if your title doesn't end in a period and especially if capitalization doesn't clearly denote a sentence boundary (like in this case where the first sentence starts in "I"), then it might be confusing.

Quick take titles should end in a period. Quick takes (previously known as short forms) are often viewed via preview on the front page. This preview removes formatting and newlines for space reasons. So, if your title doesn't end in a period and especially if capitalization doesn't clearly denote a sentence boundary (like in this case where the first sentence starts in "I"), then it might be confusing.

Popular Comments

I think it's a good direction to move in. But I usually don't think of it as "trying to become a wizard" or some kind of self-improvement. When I do something, it's because I'm interested in the thing. Like making a video game because I had an idea for it, or reading an economics textbook because I was curious about economic questions. The challenge is maintaining a steady flow of such projects, I've found that in "steady state" I do about one per year, which isn't a lot. So maybe ambition would actually help? Idk.
Author here. When constructing this paper, we needed an interpretable metric (time horizon), but this is not very data-efficient. We basically made the fewest tasks we could to get acceptable error bars, because high-quality task creation from scratch is very expensive. (We already spent over $150k baselining tasks, and more on the bounty and baselining infra.) Therefore we should expect that restricting to only 32 of the 170 tasks in the paper makes the error bars much wider; it roughly increases the error bars by a factor of sqrt(170/32) = 2.3. Now if these tasks were log-uniformly distributed from 1 second to 16 hours, we would still be able to compare these results to the rest of our dataset, but it's worse than that: all fully_private tasks are too difficult for models before GPT-4, so removing SWAA requires restricting the x axis to the 2 years since GPT-4. This is 1/3 of our x axis range, so it makes error bars on the trendline slope 3 times larger. Combined with the previous effect the error bars become 6.9 times larger than the main paper! So the analysis in this post, although it uses higher-quality data, basically throws away all the statistical power of the paper. I wish I had 170 high-quality private tasks but there was simply not enough time to construct them. Likewise, I wish we had baselines for all tasks, but sadly we will probably move to a different safety case framework before we get them. Though SWAA tasks have their own problems (eg many of them were written by Claude), they were developed in-house and are private, and it's unlikely that GPT-2 and GPT-3 would have a horizon like 10x different on a different benchmark. So for this check we really should not exclude them. privacy_levelnum_tasksfully_private32public_problem22easy_to_memorize18public_solution17semi_private2 When we include SWAA, the story is not too different from the original paper: doubling roughly 8 months. Note how large the error bars are. With only fully_private tasks and SWAA and RE-Bench, the extrapolated date of 1-month AI looks similar to the main paper, but the error bars are much larger; this is entirely driven by the small number of tasks (green bar).  For comparison, the main paper extrapolation When I exclude SWAA and RE-Bench too, the script refuses to even run, because sometimes the time horizon slope is negative, but when I suppress those errors and take the 2024-2025 trend we get 80% CI of early 2026 to mid-2047! This is consistent with the main result but pretty uninformative. You can see that with only fully_private tasks (ignore the tasks under 1 minute; these are SWAA), it's hard to even tell whether longer tasks are more difficult in the <1 hour range (tiles plot). As such we should be suspicious of whether the time horizon metric works at all with so few tasks.
I'm curious if your team has any thoughts on my post Some Thoughts on Metaphilosophy, which was in large part inspired by the Debate paper, and also seems relevant to "Good human input" here. Specifically, I'm worried about this kind of system driving the simulated humans out of distribution, either gradually or suddenly, accidentally or intentionally. And distribution shift could cause problems either with the simulation (presumably similar to or based on LLMs instead of low-level neuron-by-neuron simulation), or with the human(s) themselves. In my post, I talked about how philosophy seems to be a general way for humans to handle OOD inputs, but tends to be very slow and may be hard for ML to learn (or needs extra care to implement correctly). I wonder if you agree with this line of thought, or have some other ideas/plans to deal with this problem. Aside from the narrow focus on "good human input" in this particular system, I'm worried about social/technological change being accelerated by AI faster than humans can handle it (due to similar OOD / slowness of philosophy concerns), and wonder if you have any thoughts on this more general issue.
Load More

Recent Discussion

I'm not sure what the exact process was, tbh my guess is that they were estimated mostly independently but likely sanity checked with the survey to some extent in mind. It seems like they line up about right, given the 2022 vs. 2023 difference, the intuition regarding underadjusting for labor->progress, and giving weight to our own views as well rather than just the survey, given that we've thought more about this than survey takers (while of course they have the advantage of currently doing frontier AI research).

I'd make less of an adjustment if we ask... (read more)

2elifland
Yup, seems good
2plex
Okay, switched. I'm curious about why you didn't set the baseline to "no AI help", especially if you expect pre-2024 AI to be mostly useless, as that seems like a cleaner comparison than asking people to remember how good old AIs were?
2Cole Wyeth
You seem to be referring to comments from the CEO that more than 25% of code at Google is written by AI (and reviewed by humans). I’m not sure how reliable this number is, and it remains to be seen whether this is sustainable. It also doesn’t seem like a vast productivity boost (though it would be pretty significant, probably more than I expect, so would update me). 

We are having another rationalist Shabbat event at Rainbow Star House this Friday. The plan going forward will be to do one most Fridays. Email or DM me for the address if you haven’t been before.

We have pita, dips, and dessert planned, but are still looking for someone to bring a big pot of food this week-- if you’re able to bring a vegan chili or similar, please let me know.

Doors open at 5:30, ritual and food at 6:30.

What is this event?

At rationalist Shabbat each week, we light candles, sing Landsailor, eat together, and discuss topics of interest and relevance to the rationalist crowd. If you have suggestions for topics, would like to help contribute food, or otherwise assist with organizing, let us know.

This is a kid-friendly event -- we have young kids, so we have space and toys for them to play and hang out while the adults are chatting.
 

Update: this week we'll be singing Ballad of Smallpox Gone in honor of Smallpox Eradication Day yesterday. Also: please do RSVP if you haven't yet.

1JohnofCharleston
Brought my car to the office, hopefully there by 6. 
3Archimedes
Limiting China's computing power via export controls on hardware like GPUs might be accelerating global progress in AI capabilities. When Chinese labs are compute-starved, their research will differentially focus on efficiency gains compared to counterfactual universes where they are less limited. So far, they've been publishing their research, and their tricks can be quickly be incorporated by anyone else. US players can leverage their compute power, focusing on experiments and scaling while effectively delegating research topics that China is motivated to handle. Google and OpenAI benefit far more from DeepSeek than they do from Meta.
7ryan_greenblatt
One of the main drivers, perhaps the main driver[1], of algorithmic progress is compute for experiments. It seems unlikely that the effect you note could compensate for the reduced pace of capabilities progress. ---------------------------------------- 1. Both labor and compute have been scaled up over the last several years at big AI companies. My understanding is the scaling in compute was more important for algorithmic progress as it is hard to parallelize labor, the marginal employee is somewhat worse, the number of employees has been growing slower than compute, and the returns to compute vs faster serial labor seem similar at current margins. That's not to say employees don't matter, I'd guess Meta is substantially held back by worse employees (and maybe worse management). ↩︎

Both labor and compute have been scaled up over the last several years at big AI companies. My understanding is the scaling in compute was more important for algorithmic progress

That may be the case, but I suppose that in the last several years, compute has been scaled up more than labor. (Labor cost is entirely reoccurring, while compute cost is a one-time cost plus a reoccurring electricity cost, and a progress in compute hardware, from smaller integrated circuits, means that compute cost is decreasing over time.) Then obviously that doesn't necessari... (read more)

Has anyone here had therapy to help handle thoughts of AI doom? How did it go? What challenges did you face explaining it or being taken seriously, and what kind of therapy worked, if any? 

I went to a therapist for 2 sessions and received nothing but blank looks when I tried to explain what I was trying to process. I think it was very unfamiliar ground for them and they didn't know what to do with me. I'd like to try again but if anyone here has guideance on what worked for them, I'd be interested.

I've also started basic meditation, which continues to be a little helpful.

John: So there’s this thing about interp, where most of it seems to not be handling one of the standard fundamental difficulties of representation, and we want to articulate that in a way which will make sense to interp researchers (as opposed to philosophers). I guess to start… Steve, wanna give a standard canonical example of the misrepresentation problem?

Steve: Ok so I guess the “standard” story as I interpret it goes something like this:

  • In order to respond to dualists who thought the mind was inherently non-physical, materialist philosophers wanted a “naturalistic” account of mental representation - where “mental representation” basically means the same thing as “content” or “semantics”. We tend to use the term intentionality for mental representation. This is a technical term that’s not the same as
...

Counterpoint: https://www.lesswrong.com/s/gEvTvhr8hNRrdHC62

2Joseph Bloom
I think this is a valuable read for people who work in interp but feel like I want to add a few ideas: * Distinguishing Misrepresentation from Mismeasurement: Interpretability researchers use techniques that find vectors which we say correspond to the representations of the model, but the methods we use to find those may be imperfect. For example, if your cat SAE feature also lights up on racoons, then maybe this is a true property of the model's cat detector (that is also lights up on racoons) or maybe this is an artefact of the SAE loss function. Maybe the true cat detector doesn't get fooled by racoons, but your SAE latent is biased in some way. See this paper that I supervised for more concrete observations. * What are the canonical units? It may be that there is a real sense in which the model has a cat detector but maybe at the layer at which you tried to detect it, the cat detector is imperfect. If the model doesn't function as if it has an imperfect cat detector then maybe downstream of the cat-detector is some circuitry for catching/correcting specific errors. This means that finding the local cat detector you've found which might have misrepresentation issues isn't in itself sufficient to argue that the model as a whole has those issues. Selection pressures apply to the network as a whole and not necessarily always to the components. The fact that we see so much modularity is probably not random (John's written about this) but if I'm not mistaken, we don't have strong reasons to believe that the thing that looks like a cat detector must be the model's one true cat detector.  I'd be excited for some empirical work following up on this. One idea might be to train toy models which are incentivised to contain imperfect detectors (eg; there is a noisy signal but reward is optimised by having a bias toward recall or precision in some of the intermediate inferences). Identifying intermediate representations in such models could be interesting. 

It is important to remember that although languages and brains evolved alongside each other, these are separate systems.

Human brain has evolved to be a fast learner of whatever common language it is exposed to. And languages themselves have evolved to be as accessible for new speakers as possible.

One may say that language summarizes the shared experience of its speakers. Let me play with an analogy to Arithmetics:

There is an infinite set of numbers that have several kinds of properties (negative or positive, rational or irrational) and several kinds of relations between them (summation, product, etc). To describe them all, you don’t need an infinitely large database. You can describe the entire model with several axioms and theorems. Furthermore, you can build up on it, inventing new concepts...

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

For months, I had the feeling: something is wrong. Some core part of myself had gone missing.

I had words and ideas cached, which pointed back to the missing part.

There was the story of Benjamin Jesty, a dairy farmer who vaccinated his family against smallpox in 1774 - 20 years before the vaccination technique was popularized, and the same year King Louis XV of France died of the disease.

There was another old post which declared “I don’t care that much about giant yachts. I want a cure for aging. I want weekend trips to the moon. I want flying cars and an indestructible body and tiny genetically-engineered dragons.”.

There was a cached instinct to look at certain kinds of social incentive gradient, toward managing more people or growing an organization or playing...

What's your favorite times you've used CAD/CNC or 3D printing? Or what's your most likely place to make use of it?

1amitlevy49
I assume the idea is that bupropion is good at giving you the natural drive to do the kind of projects he describes?
1amitlevy49
The framing of science and engineering as isomorphic to wizard power immediately reminds me of the anime Dr. Stone, if you haven't watched it I think you may enjoy it, at least as a piece of media making the same type of point you are making.
2Gurkenglas
Yeah, the underling part was a joke :D
This is a linkpost for https://arxiv.org/abs/2505.03989

This post presents a mildly edited form of a new paper by UK AISI's alignment team (the abstract, introduction and related work section are replaced with an executive summary). Read the full paper here.

Executive summary 

AI safety via debate is a promising method for solving part of the alignment problem for ASI (artificial superintelligence). 

TL;DR Debate + exploration guarantees + solution to obfuscated arguments + good human input solves outer alignment. Outer alignment + online training solves inner alignment to a sufficient extent in low-stakes contexts. 

This post sets out: 

  • What debate can be used to achieve.
  • What gaps remain.
  • What research is needed to solve them. 

These gaps form the basis for one of the research agendas of UK AISI’s new alignment team: we aim to dramatically scale up ASI-relevant research on debate. We’ll...

I broadly agree with these concerns. I think we can split it into (1) the general issue of AGI/ASI driving humans out of distribution and (2) the specific issue of how assumptions about human data quality as used in debate will break down. For (2), we'll have a short doc soon (next week or so) which is somewhat related, along the lines of "assume humans are right most of the time on a natural distribution, and search for protocols which report uncertainty if the distribution induced by a debate protocol on some new class of questions is sufficiently differ... (read more)

1Marie_DB
I'm definitely also worried about collusion between the debaters to deceive the judge! That's what we try to address with the exploration guarantees in the sketch. The thinking is: If a debater is, say, deliberately not pointing out a flaw in an argument, then there's an alternative strategy that would get the debater higher reward on the episode (i.e. pointing out the flaw). So if we can verify that there wouldn't be significant gains from further exploration (ie trying out more alternative strategies), that's some evidence against this kind of collusion. But of course, we're only gesturing at some potential ways you might get exploration guarantees - we don't know yet if any of them will work. I'm also worried about collusion between the debaters and the judge, and we don't address this much in the sketch, though I think it could in principle be dealt with in the same way. I'm also imagining that the judge model would be much less capable (it only needs to be human-level in a narrower domain), which might mean it's incapable of exploration hacking.
1Marie_DB
Interesting post! Could you say more about what you mean by "driving the simulated humans out of distribution"? Is it something like "during deployment, the simulated human judges might be asked to answer questions far outside the training distribution, and so they might fail to accurately simulate humans (or humans might be worse than on )"?  The solution in the sketch is to keep the question distribution during deployment similar + doing online training during deployment (the simulated human judges could also be subject to online training). Is there a reason you think that won't work?
1Marie_DB
Thanks for the comment, all sensible scepticisms IMO! I agree that the low-stakes context part of the sketch (key claim 4) is the weakest part, and especially we don't emphasise enough the defeater of "the deployment set-up just won't be upheld" (because of persuasion, as you mention, but also e.g. the agent messing with the online learning process in the offline datacentre). We spent less time on it because we ultimately want to (try to) expand to high-stakes contexts, which will look pretty different, so this was more of a stop-gap rough picture so we could focus on getting the rest right. That said, I'm maybe more optimistic than you that there'll be a relevant period where the above issues can be sufficiently dealt with via control and debate is pretty important for catching subtler research sabotage. On debaters converging to honesty rather than subtle manipulation: I'm also pretty unsure if this will work and keen to see how it plays out empirically once we get LLMs that are a bit better at debate. I do think recursive debate makes it more likely that honesty is a winning strategy (relative to human-style debates) because debaters can lose on a single inconsistency or manipulative argumentative strategy, rather than being able to bury it among lots of claims (see also my reply to Charlie Steiner below).

Each brick = one uncertainty

TL;DR

Maybe instead of writing task lists, reframe macro objectives in terms of nested questions until you reach 'root' testable experiments.

Putting this into practice

I built a tool for myself, ‘Thought-Tree’ here, to try and systematise what I wrote in this post. Maybe it works out for you as well?

Essays I am Thinking About, and that Inspired this Post

Related essays by me

What I am planning to read

...