All of Thomas Kwa's Comments + Replies

I'd guess that a cheaper, wall-mounted version of CleanAirKits/Airfanta would be a better product. It's just a box with fans and slots for filters, the installation labor is significantly lower, you get better aesthetics, and not everyone has 52 inch ceiling fans at a standardized mounting length already so the market is potentially much larger with a standalone device.

The problem with the ceiling fan is that it's not designed for static pressure, so its effectiveness at moving air through the filter will depend on contingent factors like the blade area ra... (read more)

The uplift equation:

What is required for AI to provide net speedup to a software engineering project, when humans write higher quality code than AIs? It depends how it's used.

Cursor regime

In this regime, similar to how I use Cursor agent mode, the human has to read every line of code the AI generates, so we can write:

Where

  •  is the time for the human to write the code, either from scratch or after rejecting an AI suggestion
  •  is the time for the AI to generate the code in tokens per secon
... (read more)
1np_x
This calculus changes when you can work on many things at once (similar to @faul_sname's comment but this might be that you can work on many projects at once, even if they can't each be parallelized well).  
5faul_sname
Unless the work can be parallelized.

Why not require model organisms with known ground truth and see if the methods accurately reveal them, like in the paper? From the abstract of that paper:

Additionally, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods.

This reduces the problem from covering all sources of doubt to making a sufficiently realistic model organism. This was our idea with InterpBench, and I still find it plausible that with better executio... (read more)

Yeah, I completely agree this is a good research direction! My only caveat is I don’t think this is a silver bullet in the same way capabilities benchmarks are (not sure if you’re arguing this, just explaining my position here). The inevitable problem with interpretability benchmarks (which to be clear, your paper appears to make a serious effort to address) is that you either:

  1. Train the model in a realistic way - but then you don’t know if the model really learned the algorithm you expected it to
  2. Train the model to force it to learn a particular algorithm -
... (read more)

Author here. When constructing this paper, we needed an interpretable metric (time horizon), but this is not very data-efficient. We basically made the fewest tasks we could to get acceptable error bars, because high-quality task creation from scratch is very expensive. (We already spent over $150k baselining tasks, and more on the bounty and baselining infra.) Therefore we should expect that restricting to only 32 of the 170 tasks in the paper makes the error bars much wider; it roughly increases the error bars by a factor of sqrt(170/32) = 2.3.

Now if the... (read more)

I ran the horizon length graph with pass@8 instead, and the increase between GPT-4 and o1 seems to be slightly smaller (could be noise), and also Claude 3.7 does worse than o1. This means the doubling rate for pass@8 may be slightly slower than for pass@1. However, if horizon length increase since 2023 were only due to RL, the improvement from pass@8 would be barely half as much as the improvement in pass@1. It's faster, which could be due to some combination of the following:

  • o1 has a better base model than GPT-4
  • HCAST is an example of "emergent capabilitie
... (read more)

Agree, I'm pretty confused about this discrepancy. I can't rule out that it's just the "RL can enable emergent capabilities" point.

1Cole Wyeth
I am not confused, the results of this paper are expected on my model. 

The dates used in our regression are the dates models were publicly released, not the dates we benchmarked them. If we use the latter dates, or the dates they were announced, I agree they would be more arbitrary.

Also, there is lots of noise in a time horizon measurement and it only displays any sort of pattern because we measured over many orders of magnitude and years. It's not very meaningful to extrapolate from just 2 data points; there are many reasons one datapoint could randomly change by a couple of months or factor of 2 in time horizon. 

  • Releas
... (read more)
7Thane Ruthenis
Fair, also see my un-update edit. Have you considered removing GPT-2 and GPT-3 from your models, and seeing what happens? As I'd previously complained, I don't think they can be part of any underlying pattern (due to the distribution shift in the AI industry after ChatGPT/GPT-3.5). And indeed: removing them seems to produce a much cleaner trend with a ~130-day doubling.

o3 and o4-mini solve more than zero of the >1hr tasks that claude 3.7 got ~zero on, including some >4hr tasks that no previous models we tested have done well on, so it's not that models hit a wall at 1-4 hours. My guess is that the tasks they have been trained on are just more similar to HCAST tasks than RE-Bench tasks, though there are other possibilities.

1Cole Wyeth
Okay, that’s a lot more convincing. Was it available publicly? I seem to have missed it. Surprisingly high success probability on a 16 hour task - does this just mean it made some progress on the 16 hour task? Pokémon fell today too, we might be cooked. 

Other metrics also point to drone-dominated and C&C dominated war. E.g. towed artillery is too vulnerable to counterbattery fire, and modern mobile artillery like CAESAR must use "shoot and scoot" tactics-- it can fire 6 shells within two minutes of stopping and vacate before its last shell lands. But now drones attack them while moving too.

Yes. RL will at least be more applicable to well-defined tasks. Some intuitions:

  • In my everyday, the gap between well-defined task ability and working with the METR codebase is growing
  • 4 month doubling time is faster than the rate of progress in most other realistic or unrealistic domains
  • Recent models really like to reward hack, suggesting that RL can cause some behaviors not relevant to realistic tasks

This trend will break at some point, eg when labs get better at applying RL to realistic tasks, or when RL hits diminishing returns, but I have no idea when

GDM paper: Evaluating the Goal-Directedness of Large Language Models

Tom Everitt, Rohin Shah, and others from GDM attempt to measure "whether LLMs use their capabilities towards their given goal". Unlike previous work, their measure is not just rescaled task performance-- rather, an AI is goal-directed if it uses its capabilities effectively. A model that is not goal-directed when attempting a task will have capabilities but not properly use them. Thus, we can measure goal-directedness by comparing a model's actual performance to how it should perform if it... (read more)

2Daniel Tan
Interesting paper. Quick thoughts: * I agree the benchmark seems saturated. It's interesting that the authors frame it the other way - Section 4.1 focuses on how models are not maximally goal-directed. * It's unclear to me how they calculate the goal-directedness for 'information gathering', since that appears to consist only of 1 subtask. 

Time horizon of o3 is ~1.5 hours vs Claude 3.7's 54 minutes, and it's statistically significant that it's above the long-term trend. It's been less than 2 months since the release of Claude 3.7. If time horizon continues doubling every 3.5 months as it has over the last year, we only have another 12 months until time horizon hits 16 hours and we are unable to measure it with HCAST.

My guess is that future model time horizon will double every 3-4 months for well-defined tasks (HCAST, RE-Bench, most automatically scorable tasks) that labs can RL on, while capability on more realistic tasks will follow the long-term 7-month doubling time.

1snewman
What's your basis for "well-defined tasks" vs. "realistic tasks" to have very different doubling times going forward? Is the idea that the recent acceleration seems to be specifically due to RL, and RL will be applicable to well-defined tasks but not realistic tasks? This seems like an extremely important question, so if you have any further thoughts / intuitions / data to share, I'd be very interested.

Benchmark Readiness Level

Safety-relevant properties should be ranked on a "Benchmark Readiness Level" (BRL) scale, inspired by NASA's Technology Readiness Levels. At BRL 4, a benchmark exists; at BRL 6 the benchmark is highly valid; past this point the benchmark becomes increasingly robust against sandbagging. The definitions could look something like this:

BRLDefinitionExample
1Theoretical relevance to x-risk definedAdversarial competence
2Property operationalized for frontier AIs and ASIsAI R&D speedup;
Misaligned goals
3Behavior (or all parts) observed i
... (read more)

Some versions of the METR time horizon paper from alternate universes:

Measuring AI Ability to Take Over Small Countries (idea by Caleb Parikh)

Abstract: Many are worried that AI will take over the world, but extrapolation from existing benchmarks suffers from a large distributional shift that makes it difficult to forecast the date of world takeover. We rectify this by constructing a suite of 193 realistic, diverse countries with territory sizes from 0.44 to 17 million km^2. Taking over most countries requires acting over a long time horizon, with the excep... (read more)

1wonder
Would the take over for small countries also about humans using just an advanced AI for taking over? (or would the human using advanced AI for take over happen faster?)
BuckΩ14280

A few months ago, I accidentally used France as an example of a small country that it wouldn't be that catastrophic for AIs to take over, while giving a talk in France 😬

Quick list of reasons for me:

  • I'm averse to attending mass protests myself because they make it harder to think clearly and I usually don't agree with everything any given movement stands for.
  • Under my worldview, an unconditional pause is a much harder ask than is required to save most worlds if p(doom) is 14% (the number stated on the website). It seems highly impractical to implement compared to more common regulatory frameworks and is also super unaesthetic because I am generally pro-progress.
  • The economic and political landscape around AI is complicated e
... (read more)

I basically agree with this. The reason the paper didn't include this kind of reasoning (only a paragraph about how AGI will have infinite horizon length) is we felt that making a forecast based on a superexponential trend would be too much speculation for an academic paper. (There is really no way to make one without heavy reliance on priors; does it speed up by 10% per doubling or 20%?) It wasn't necessary given the 2027 and 2029-2030 dates for 1-month AI derived from extrapolation already roughly bracketed our uncertainty.

External validity is a huge concern, so we don't claim anything as ambitious as average knowledge worker tasks. In one sentence, my opinion is that our tasks suite is fairly representative of well-defined, low-context, measurable software tasks that can be done without a GUI. More speculatively, horizons on this are probably within a large (~10x) constant factor of horizons on most other software tasks. We have a lot more discussion of this in the paper, especially in heading 7.2.1 "Systematic differences between our tasks and real tasks". The HCAST paper ... (read more)

Humans don't need 10x more memory per step nor 100x more compute to do a 10-year project than a 1-year project, so this is proof it isn't a hard constraint. It might need an architecture change but if the Gods of Straight Lines control the trend, AI companies will invent it as part of normal algorithmic progress and we will remain on an exponential / superexponential trend.

Regarding 1 and 2, I basically agree that SWAA doesn't provide much independent signal. The reason we made SWAA was that models before GPT-4 got ~0% on HCAST, so we needed shorter tasks to measure their time horizon. 3 is definitely a concern and we're currently collecting data on open-source PRs to get a more representative sample of long tasks.

That bit at the end about "time horizon of our average baseliner" is a little confusing to me, but I understand it to mean "if we used the 50% reliability metric on the humans we had do these tasks, our model would say humans can't reliably perform tasks that take longer than an hour". Which is a pretty interesting point.

That's basically correct. To give a little more context for why we don't really believe this number, during data collection we were not really trying to measure the human success rate, just get successful human runs and measure their time.... (read more)

All models since at least GPT-3 have had this steep exponential decay [1], and the whole logistic curve has kept shifting to the right. The 80% success rate horizon has basically the same 7-month doubling time as the 50% horizon so it's not just an artifact of picking 50% as a threshold.

Claude 3.7 isn't doing better on >2 hour tasks than o1, so it might be that the curve is compressing, but this might also just be noise or imperfect elicitation.

Regarding the idea that autoregressive models would plateau at hours or days, it's plausible, and one point of... (read more)

One possible interpretation here is going back to the inner-monologue interpretations as being multi-step processes with an error rate per step where only complete success is useful, which is just an exponential; as the number of steps increase from 1 to n, you get a sigmoid from ceiling performance to floor performance at chance. So you can tell the same story about these more extended tasks, which after all, are just the same sort of thing - just more so. We also see this sort of sigmoid in searching with a fixed model, in settings like AlphaZero in Hex... (read more)

It's expensive to construct and baseline novel tasks for this (we spent well over $100k on human baselines) so what we are able to measure in the future depends on whether we can harvest realistic tasks that naturally have human data. You could do a rough analysis on math contest problems, say assigning GSM8K and AIME questions lengths based on a guess of how long expert humans take, but the external validity concerns are worse than for software. For one thing, AIME has much harder topics than GSM8K (we tried to make SWAA not be artificially easier or harder than HCAST); for another, neither are particularly close to the average few minutes of a research mathematician's job.

The trend probably sped up in 2024. If the future trend follows the 2024--2025 trend, we get 50% reliability at 167 hours in 2027.

Author here. My best guess is that by around the 1-month point, AIs will be automating large parts of both AI capabilities and empirical alignment research. Inferring anything more depends on many other beliefs.

Currently no one knows how hard the alignment problem is or what exactly good alignment research means-- it is the furthest-looking, least well-defined and least tractable of the subfields of AI existential safety. This means we don't know the equivalent task length of the alignment problem. Even more importantly, we only measured the AIs at softwar... (read more)

2Garrett Baker
You probably mention this somewhere, but I'll ask here, are you currently researching whether these results hold for those other domains? I'm personally more interested about math than law.

AIs (and humans) don't have 100% reliability at anything, so the graph tracks when AIs get a 50% success rate on our dataset, over all tasks and attempts. We also measure AI horizons at 80% success rate in the paper, and those are about 5x shorter. It's hard to measure much higher than 80% with our limited task suite, but if we could we would measure 95% and 99% as well.

2mattmacdermott
I think the commenter is asking something a bit different - about the distribution of tasks rather than the success rate. My variant of this question: is your set of tasks supposed to be an unbiased sample of the tasks a knowledge worker faces, so that if I see a 50% success rate on 1 hour long tasks I can read it as a 50% success rate on average across all of the tasks any knowledge worker faces? Or is it more impressive than that because the tasks are selected to be moderately interesting, or less impressive because they’re selected to be measurable, etc

So the citation is an unreleased paper! That unreleased paper may make a splash, since (assuming this 7-month-doubling trend is not merely 1-2 years old) it strongly implies we really will find good solutions for turning LLMs agentic fairly soon.

The 7-month doubling trend we measured actually goes back to GPT-2 in 2019. Since 2024, the trend has been faster, doubling roughly every 3-4 months depending on how you measure, but we only have six 2024-25 models so error bars are wide and it's really unclear which trend will be predictive of the future.

Important results from the paper as I see it:

  • RL agents can learn to cheat in the wild, and CoT pressure doesn't prevent it.
  • The slope of cheating increase and final cheating rate are both slightly lower when applying CoT pressure, but all of the cheating will go uncaught by the monitor.
  • GPT-4o can zero-shot monitor o3-level agents with 95% accuracy when it has access to CoT, only 60% when it can only monitor actions.
  • If there is no CoT pressure, RL agents telegraph their cheating, such that cheating precision and recall actually go up over time for a monitor
... (read more)

I think we have two separate claims here:

  1. Do technologies that have lots of resources put into their development generally improve discontinuously or by huge slope changes?
  2. Do technologies often get displaced by technologies with a different lineage?

I agree with your position on (2) here. But it seems like the claim in the post that sometime in the 2030s someone will make a single important architectural innovation that leads to takeover within a year mostly depends on (1), as it would require progress within that year to be comparable to all the progress fr... (read more)

5Thane Ruthenis
Indeed, and I'm glad we've converged on (2). But... ... On second thoughts, how did we get there? The initial disagreement was how plausible it was for incremental changes to the LLM architecture to transform it into a qualitatively different type of architecture. It's not about continuity-in-performance, it's about continuity-in-design-space. Whether finding an AGI-complete architecture would lead to a discontinuous advancement in capabilities, to FOOM/RSI/sharp left turn, is a completely different topic from how smoothly we should expect AI architectures' designs to change. And on that topic, (a) I'm not very interested in reference-class comparisons as opposed to direct gears-level modeling of this specific problem, (b) this is a bottomless rabbit hole/long-standing disagreement which I'm not interested in going into at this time. That's an interesting general pattern, if it checks out. Any guesses why that might be the case? My instinctive guess is the new-paradigm approaches tend to start out promising-in-theory, but initially very bad, people then tinker with prototypes, and the technology becomes commercially viable the moment it's at least marginally better than the previous-paradigm SOTA. Which is why there's an apparent performance-continuity despite a lineage/paradigm-discontinuity.

Easy verification makes for benchmarks that can quickly be cracked by LLMs. Hard verification makes for benchmarks that aren't used.

Agree, this is one big limitation of the paper I'm working on at METR. The first two ideas you listed are things I would very much like to measure, and the third something I would like to measure but is much harder than any current benchmark given that university takes humans years rather than hours. If we measure it right, we could tell whether generalization is steadily improving or plateauing.

Though the fully connected -> transformers wasn't infinite small steps, it definitely wasn't a single step. We had to invent various sub-innovations like skip connections separately, progressing from RNNs to LSTM to GPT/BERT style transformers to today's transformer++. The most you could claim is a single step is LSTM -> transformer.

Also if you graph perplexity over time, there's basically no discontinuity from introducing transformers, just a possible change in slope that might be an artifact of switching from the purple to green measurement method.... (read more)

7Thane Ruthenis
This argument still seems to postdict that cars were invented by tinkering with carriages and horse-breeding, spacecraft was invented by tinkering with planes, refrigerators were invented by tinkering with cold cellars, et cetera. If you take the snapshot of the best technology that does X at some time T, and trace its lineage, sure, you'll often see the procession of iterative improvements on some concepts and techniques. But that line won't necessarily pass through the best-at-X technologies at times from 0 to T - 1. The best personal transportation method were horses, then cars. Cars were invented by iterating on preceding technologies and putting them together; but horses weren't involved. Similar for the best technology at lifting a human being into the sky, the best technology for keeping food cold, etc. I expect that's the default way significant technological advances happen. They don't come from tinkering with the current-best-at-X tech. They come from putting together a bunch of insights from different or non-mainstream tech trees, and leveraging them for X in a novel way. And this is what I expect for AGI. It won't come from tinkering with LLMs, it'll come from a continuous-in-retrospect, surprising-in-advance contribution from some currently-disfavored line(s) of research. (Edit: I think what I would retract, though, is the point about there not being a continuous manifold of possible technological artefacts. I think something like "the space of ideas the human mind is capable of conceiving" is essentially it.)

A continuous manifold of possible technologies is not required for continuous progress. All that is needed is for there to be many possible sources of improvements that can accumulate, and for these improvements to be small once low-hanging fruit is exhausted.

Case in point: the nanogpt speedrun, where the training time of a small LLM was reduced by 15x using 21 distinct innovations which touched basically every part of the model, including the optimizer, embeddings, attention, other architectural details, quantization, hyperparameters, code optimizations, ... (read more)

6johnswentworth
I think you should address Thane's concrete example: That seems to me a pretty damn solid knock-down counterargument. There were no continuous language model scaling laws before the transformer architecture, and not for lack of people trying to make language nets.
7p.b.
Because these benchmarks are all in the LLM paradigm: Single input, single output from a single distribution. Or they are multi-step problems on rails. Easy verification makes for benchmarks that can quickly be cracked by LLMs. Hard verification makes for benchmarks that aren't used. One could let models play new board/computer games against average humans: Video/image input, action output.  One could let models offer and complete tasks autonomously on freelancer platforms.  One could enrol models in remote universities and see whether they autonomously reach graduation.  It's not difficult to come up with hard benchmarks for current models (these are not close to AGI complete). I think people don't do this because they know that current models would be hopeless at benchmarks that actually aim for their shortcomings (agency, knowledge integration + integration of sensory information, continuous learning, reliability, ...)

I think eating the Sun is our destiny, both in that I expect it to happen and that I would be pretty sad if we didn't; I just hope it will be done ethically. This might seem like a strong statement but bear with me

Our civilization has undergone many shifts in values as higher tech levels have indicated that sheer impracticality of living a certain way, and I feel okay about most of these. You won't see many people nowadays who avoid being photographed because photos steal a piece of their soul. The prohibition on women working outside the home, common in m... (read more)

Will we ever have Poké Balls in real life? How fast could they be at storing and retrieving animals? Requirements:

  • Made of atoms, no teleportation or fantasy physics.
  • Small enough to be easily thrown, say under 5 inches diameter
  • Must be able to disassemble and reconstruct an animal as large as an elephant in a reasonable amount of time, say 5 minutes, and store its pattern digitally
  • Must reconstruct the animal to enough fidelity that its memories are intact and it's physically identical for most purposes, though maybe not quite to the cellular level
  • No external
... (read more)
5Gurkenglas
All we need to create is a Ditto. A blob of nanotech wouldn't need 5 seconds to take the shape of the surface of an elephant and start mimicing its behavior; is it good enough to optionally do the infilling later if it's convenient?

and yet, the richest person is still only responsible for 0.1%* of the economic output of the united states.

Musk only owns 0.1% of the economic output of the US but he is responsible for more than this, including large contributions to

  • Politics
  • Space
    • SpaceX is nearly 90% of global upmass
    • Dragon is the sole American spacecraft that can launch humans to ISS
    • Starlink probably enables far more economic activity than its revenue
    • Quality and quantity of US spy satellites (Starshield has ~tripled NRO satellite mass)
  • Startup culture through the many startups from ex-Spac
... (read more)
4Joseph Miller
The Duke of Wellington said that Napoleon's presence on a battlefield “was worth forty thousand men”. This would be about 4% of France's military size in 1812.
2leogao
i'm happy to grant that the 0.1% is just a fermi estimate and there's a +/- one OOM error bar around it. my point still basically stands even if it's 1%. i think there are also many factors in the other direction that just make it really hard to say whether 0.1% is an under or overestimate. for example, market capitalization is generally an overestimate of value when there are very large holders. tesla is also a bit of a meme stock so it's most likely trading above fundamental value. my guess is most things sold to the public sector probably produce less economic value per $ than something sold to the private sector, so profit overestimates value produced the sign on net economic value of his political advocacy seems very unclear to me. the answer depends strongly on some political beliefs that i don't feel like arguing out right now. it slightly complicates my analogy for elon to be both the richest person in the us and also possibly the most influential (or one of). in my comment i am mostly referring to economic-elon. you are possibly making some arguments about influentialness in general. the problem is that influentialness is harder to estimate. also, if we're talking about influentialness in general, we don't get to use the 0.1% ownership of economic output as a lower bound of influentialness. owning x% of economic output doesn't automatically give you x% of influentialness. (i think the majority of other extremely rich people are not nearly as influential as elon per $)

Agree that AI takeoff could likely be faster than our OODA loop.

There are four key differences between this and the current AI situation that I think makes this perspective pretty outdated:

  • AIs are made out of ML, so we have very fine-grained control over how we train them and modify them for deployment, unlike animals which have unpredictable biological drives and long feedback loops.
  • By now, AIs are obviously developing generalized capabilities. Rather than arguments over whether AIs will ever be superintelligent, the bulk of the discourse is over whether they will supercharge economic growth or cause massive job loss
... (read more)
2bgaesop
re: 1) I don't think we do have fine-grained control over the outcome of the training of LLMs and other ML systems, which is what really matters. See recent emergent self-preservation behavior. re: 2) I'm saying that I think those arguments are distractions from the much more important one of x-risk. But sure, this metaphor doesn't address economic impact aside from "I think we could gain a lot from cooperating with them to hunt fish" re: 3) I'm not sure I see the relevance. The unnamed audience member saying "I say we keep giving them nootropics" is meant to represent AI researchers who aren't actively involving themselves in the x-risk debate continuing to make progress on AI capabilities while the arguers talk to each other  re: 4) It sounds like you're comparing something like a log graph of human capability to a linear graph of AI capability. That is, I don't think that AI will take tens of thousands of years to develop the way human civilization did. My 50% confidence interval on when the Singularity will happen is 2026-2031, and my 95% confidence only extends to maybe 2100. I expect there to be more progress in AI development in 2025-2026 than in 1980-2020 
7Noosphere89
Re continuous takeoff, you could argue that human takeoff was continuous, or only mildly discontinuous, just very fast, and at any rate it could well be discontinuous relative to your OODA loop/the variables you tracked, so unfortunately I think the continuity of the takeoff is less relevant than people thought (it does matter for alignment, but not for governance of AI): https://www.lesswrong.com/posts/cHJxSJ4jBmBRGtbaE/continuity-assumptions#DHvagFyKb9hiwJKRC https://www.lesswrong.com/posts/cHJxSJ4jBmBRGtbaE/continuity-assumptions#tXKnoooh6h8Cj8Tpx

"Random goals" is a crux. Complicated goals that we can't control well enough to prevent takeover are not necessarily uniformly random goals from whatever space you have in mind.

"Don't believe there is any chance" is very strong. If there is a viable way to bet on this I would be willing to bet at even odds that conditional on AI takeover, a few humans survive 100 years past AI takeover.

2Mikhail Samin
Do you think if an AI with random goals that doesn’t get acausally paid to preserve us takes over, then there’s a meaningful chance there will be some humans around in 100 years? What does it look like?

I'm working on the autonomy length graph mentioned from METR and want to caveat these preliminary results. Basically, we think the effective horizon length of models is a bit shorter than 2 hours, although we do think there is an exponential increase that, if it continues, could mean month-long horizons within 3 years.

  • Our task suite is of well-defined tasks. We have preliminary data showing that messier tasks like the average SWE intellectual labor are harder for both models and low-context humans.
  • This graph is of 50% horizon time (the human time-to-comple
... (read more)
  • It's not clear whether agents will think in neuralese, maybe end-to-end RL in English is good enough for the next few years and CoT messages won't drift enough to allow steganography
  • Once agents think in either token gibberish or plain vectors maybe self-monitoring will still work fine. After all agents can translate between other languages just fine. We can use model organisms or some other clever experiments to check whether the agent faithfully translates its CoT or unavoidably starts lying to us as it gets more capable.
  • I care about the exact degree to which monitoring gets worse. Plausibly it gets somewhat worse but is still good enough to catch the model before it coups us.

I'm not happy about this but it seems basically priced in, so not much update on p(doom).

We will soon have Bayesian updates to make. If we observe that incentives created during end-to-end RL naturally produce goal guarding and other dangerous cognitive properties, it will be bad news. If we observe this doesn't happen, it will be good news (although not very good news because web research seems like it doesn't require the full range of agency).

Likewise, if we observe models' monitorability and interpretability start to tank as they think in neuralese, it will be bad news. If monitoring and interpretability are unaffected, good news.

Interesting times.

How on earth could monitorability and interpretability NOT tank when they think in neuralese? Surely changing from English to some gibberish highdimensional vector language makes things harder, even if not impossible?

Thanks for the update! Let me attempt to convey why I think this post would have been better with fewer distinct points:

In retrospect, maybe I should've gone into explaining the basics of entropy and enthalpy in my reply, eg:

If you replied with this, I would have said something like "then what's wrong with the designs for diamond mechanosynthesis tooltips, which don't resemble enzymes and have been computationally simulated as you mentioned in point 9?" then we would have gone back and forth a few times until either (a) you make some complicated argument I... (read more)

0bhauth
That conflicts with eg: Anyway, I already answered that in 9. diamond.
1Daniel Tan
I don’t know! Seems hard It’s hard because you need to disentangle ‘interpretation power of method’ from ‘whether the model has anything that can be interpreted’, without any ground truth signal in the latter. Basically you need to be very confident that the interp method is good in order to make this claim. One way you might be able to demonstrate this, is if you trained / designed toy models that you knew had some underlying interpretable structure, and showed that your interpretation methods work there. But it seems hard to construct the toy models in a realistic way while also ensuring it has the structure you want - if we could do this we wouldn’t even need interpretability. Edit: Another method might be to show that models get more and more “uninterpretable” as you train them on more data. Ie define some metric of interpretability, like “ratio of monosemantic to polysemantic MLP neurons”, and measure this over the course of training history. This exact instantiation of the metric is probably bad but something like this could work

This doesn't seem wrong to me so I'm now confused again what the correct analysis is. It would come out the same way if we assume rationalists are selected on g right?

Is a Gaussian prior correct though? I feel like it might be double-counting evidence somehow.

TLDR:

  • What OP calls "streetlighting", I call an efficient way to prioritize problems by tractability. This is only a problem insofar as we cannot also prioritize by relevance.
  • I think problematic streetlighting is largely due to incentives, not because people are not smart / technically skilled enough. Therefore solutions should fix incentives rather than just recruiting smarter people.

First, let me establish that theorists very often disagree on what the hard parts of the alignment problem are, precisely because not enough theoretical and empirical progress... (read more)

Under log returns to money, personal savings still matter a lot for selfish preferences. Suppose the material comfort component of someone's utility is 0 utils at an consumption of $1/day. Then a moderately wealthy person consuming $1000/day today will be at 7 utils. The owner of a galaxy, at maybe $10^30 / day, will be at 69 utils, but doubling their resources will still add the same 0.69 utils it would for today's moderately wealthy person. So my guess is they will still try pretty hard at acquiring more resources, similarly to people in developed economies today who balk at their income being halved and see it as a pretty extreme sacrifice.

9Benjamin_Todd
True, though I think many people have the intuition that returns diminish faster than log (at least given current tech). For example, most people think increasing their income from $10k to $20k would do more for their material wellbeing than increasing it from $1bn to $2bn. I think the key issue is whether new tech makes it easier to buy huge amounts of utility, or that people want to satisfy other preferences beyond material wellbeing (which may have log or even close to linear returns).

I agree. You only multiply the SAT z-score by 0.8 if you're selecting people on high SAT score and estimating the IQ of that subpopulation, making a correction for regressional Goodhart. Rationalists are more likely selected for high g which causes both SAT and IQ, so the z-score should be around 2.42, which means the estimate should be (100 + 2.42 * 15 - 6) = 130.3. From the link, the exact values should depend on the correlations between g, IQ, and SAT score, but it seems unlikely that the correction factor is as low as 0.8.

I was at the NeurIPS many-shot jailbreaking poster today and heard that defenses only shift the attack success curve downwards, rather than changing the power law exponent. How does the power law exponent of BoN jailbreaking compare to many-shot, and are there defenses that change the power law exponent here?

6John Hughes
Circuit breaking changes the exponent a bit if you compare it to the slope of Llama3 (Figure 3 of the paper). So, there is some evidence that circuit breaking does more than just shifting the intercept when exposed to best-of-N attacks. This contrasts with the adversarial training on attacks in the MSJ paper, which doesn't change the slope (we might see something similar if we do adversarial training with BoN attacks). Also, I expect that using input-output classifiers will change the slope significantly. Understanding how these slopes change with different defenses is the work we plan to do next!

It's likely possible to engineer away mutations just by checking. ECC memory already has an error rate nine orders of magnitude better than human DNA, and with better error correction you could probably get the error rate low enough that less than one error happens in the expected number of nanobots that will ever exist. ECC is not the kind of checking for which the checking process can be disabled, as the memory module always processes raw bits into error-corrected bits, which fails unless it matches some checksum which can be made astronomically unlikely to happen in a mutation.

3Knight Lee
You're very right! I didn't really think of that. I had the intuition that mutation is very hard to avoid since cancer is very hard to avoid, but maybe it's not that really accurate. Thinking a bit more, it does seem unlikely that a mutation can disable the checking process itself, if the checking process is well designed with checksums. One idea is that the meaning of each byte (or "base pair" in our DNA analogy), changes depending on the checksum of previous bytes. This way, if one byte is mutated, the meaning of every next byte changes (e.g. "hello" becomes "ifmmp"), rendering the entire string of instructions useless. The checking process itself cannot break in any way to compensate for this. It has to break in such a way that it won't update its checksum for this one new byte, but still will update its checksum for all other bytes, which is very unlikely. If it simply disables checksums, all bytes become illegible (like encryption). I use the word "byte" very abstractly—it could be any unit of information. And yes, error correction code could further improve things by allowing a few mutations to get corrected without making the nanobot self destruct. It's still possible the hierarchical idea in my post has advantages over checksums. It theoretically only slows down self replication when a nanobot retrieves its instructions the first time, not when a nanobot uses its instructions. Maybe a compromise is that there is only one level of master nanobots, and they are allowed to replicate the master copy given that they use checksums. But they still use these master copies to install simple copies in other nanobots which do not need checksums. I admit, maybe a slight difference in self replication efficiency doesn't matter. Exponential growth might be so fast that over-engineering the self replication speed is a waste of time. Choosing a simpler system that can be engineered and set up sooner might be wiser. I agree that the hierarchical idea (and any master c

I was expecting some math. Maybe something about the expected amount of work you can get out of an AI before it coups you, if you assume the number of actions required to coup is n, the trusted monitor has false positive rate p, etc?

Load More