LESSWRONG
LW

1049
Thomas Kwa
6940Ω564267910
Message
Dialogue
Subscribe

Member of technical staff at METR.

Previously: Vivek Hebbar's team at MIRI → Adrià Garriga-Alonso on various empirical alignment projects → METR.

I have signed no contracts or agreements whose existence I cannot mention.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Catastrophic Regressional Goodhart
11Thomas Kwa's Shortform
Ω
5y
Ω
285
Drake Thomas's Shortform
Thomas Kwa4d91

My views which I have already shared in person:

  • The reason old currency units were large is because they're derived from weights of silver (eg a pound sterling or the equivalent French system dating back to Charlemagne), and the pound is a reasonably-sized base unit of weight, so the corresponding monetary unit was extremely large. There would be nothing wrong with having $1000 currency units again, it's just that we usually have inflation rather than deflation. In crypto none of the reasonably sized subdivisions have caught on and it seems tolerable to buy a sandwich for 0.0031 ETH if that were common.
  • Currencies are redenominated after sufficient inflation only when the number of zeros on everything gets unwieldy. This requires replacing all cash and is a bad look because it's usually done after hyperinflation, so countries like South Korea haven't done it yet.
  • The Iranian rial's exchange rate, actually around 1e-6 now, is so low partly due to sanctions, and is in the middle of redenomination from 10000 rial = 1 toman.
  • When people make a new currency, they make it similar to the currencies of their largest trading partners for convenience, hence why so many are in the range of the USD, euro and yuan. Various regional status games change this but not by an order of magnitude, and it is conceivable to me that we could get a $20 base unit if they escalate a bit.
Reply
Do model evaluations fall prey to the Good(er) Regulator Theorem?
Thomas Kwa7d20

I don't understand how the issue of validity in evals connects to the Gooder Regulator Theorem. It's certainly arbitrarily hard to measure a latent property of an AI agent that's buried an an arbitrarily complex causal diagram of the agent, but this is also true of any kind of science. The solution is to only measure simple things and validate them extensively.

To start, the playbook includes methods from the social sciences. Are humans lying in a psychology experiment intending to deceive, or just socially conditioned into deception in these circumstances? We can discriminate using more experiments. Here's a list suggested by Claude.

Internal Validity:

  • Manipulation checks: Verify participants understand the task and setting as intended
  • Control conditions: Include baseline conditions where lying provides no benefit
  • Multiple measures: Use both behavioral indicators (actual lies told) and physiological measures (stress responses, reaction times)
  • Debrief interviews: Post-experiment discussions to understand participant reasoning

Construct Validity:

  • Operationalize "lying" clearly: Distinguish between errors, omissions, and active false statements
  • Measure awareness: Test whether participants recognize they're providing false information
  • Cross-validate: Use multiple paradigms that elicit deception differently

External Validity:

  • Vary contexts: Test across different social settings and stakes
  • Diverse samples: Include participants from various cultural backgrounds where social norms differ
  • Ecological validity: Use scenarios that mirror real-world situations

Intentional deception should show:

  • Individual differences in moral reasoning correlation
  • Sensitivity to personal cost-benefit analysis
  • Greater cognitive effort signatures
  • Ability to inhibit when explicitly instructed
  • Conscious awareness and ability to justify

Socially conditioned behavior should show:

  • Stronger situation-specific activation
  • Less individual variation within similar social groups
  • Faster, more automatic responses
  • Resistance to conscious control
  • Difficulty articulating clear reasons

Likewise, in evals we can do increasingly elaborate robustness checks where we change the prompt and setting and measure if the correlation is high, check that the model understand its behavior is deceptive, provide incentives to the model that should make its behavior change if it were intentionally deceptive, and so on. Much of the time, especially for propensity / alignment evals, we will find the evals are nonrobust and we'll be forced to weaken our claims.

Sometimes, we have an easier job. I don't expect capability evals to get vastly harder because many skills are verifiable and sandbagging is solvable. Also, AI evals are easier than human psychology experiments, because they're likely to be much cheaper and faster to run and more replicable.

Reply
Immigration to Poland
Thomas Kwa7d112

According to Wikipedia it seems to have worked well and not been expensive.

Poland began work on the 5.5-meter (18 foot) high steel wall topped with barbed wire at a cost of around 1.6 billion zł (US$407m) [...] in the late summer of 2021. The barrier was completed on 30 June 2022.[3] An electronic barrier [...] was added to the fence between November 2022 and early summer 2023 at a cost of EUR 71.8 million.[4]

[...] official border crossings with Belarus remained open, and the asylum process continued to function [...]

In 2024 [...] Exluding those who submitted applications at airports, there were 3,141 [asylum applications from] persons coming directly from the territory of Belarus, Russia or Ukraine.

Since the fence was built, illegal crossings have reduced to a trickle; however, between August 2021 and February 2023, 37 bodies were found on both sides of the border; people have died mainly from hypothermia or drowning.[11]

The Greenberg article also suggests a reasonable tradeoff is being made in policy

Despite these fears, Duszczyk is convinced his approach is working. In a two-month period after the asylum suspension, illegal crossings from Belarus fell by 48% compared to the same period in 2024. At the same time, in all of 2024, there was one death—out of 30,000 attempted crossings—in Polish territory. There have been none so far in 2025. Duszczyk feels his humanitarian floor is holding.
 

Reply
MAGA speakers at NatCon were mostly against AI
Thomas Kwa7d2410

Maybe you're reading some other motivations into them, but if we just list the concerns in the article only 2 out of 11 indicate they want protectionism. The rest of the items that apply to AI include threats to conservative Christian values, threats to other conservative policies, and things we can mostly agree on. This gives a lot to ally on, especially the idea that Silicon Valley should not be allowed unaccountable rule over humanity, and that we should avoid destroying everything to beat China. It seems like a more viable alliance than with the fairness and bias people; plus conservatives have way more power right now.

  • Mass unemployment
  • "UBI-based communism"
  • Acceleration to “beat China” forces sacrifice of a "happier future for your children and grandchildren"
  • Suppression of conservative ideas by big tech eg algorithmic suppression, demonetization
  • Various ways that tech destroys family values
    • Social media / AI addiction
    • Grok's "hentai sex bots"
    • Transhumanism as an affront to God and to "human dignity and human flourishing"
    • "Tech assaulting the Judeo-Christian faith..."
  • Tech "destroying humanity"
  • Tech atrophying the brains of their children in school and destroying critical thought in universities.
  • Rule by unaccountable Silicon Valley elites lacking national loyalty.
Reply2
Natural Latents: Latent Variables Stable Across Ontologies
Thomas Kwa12d140

I'm curious about your sense of the path towards AI safety applications, if you have a more specific and/or opinionated view than the conclusion/discussion section.

Reply
Thomas Kwa's Shortform
Thomas Kwa12d40

My view is that AIs are improving faster at research-relevant skills like SWE and math than they're increasing at misalignment (rate of bad behaviors like reward hacking, ease of eliciting an alignment faker, etc) or covert sabotage ability, such that we would need a discontinuity in both to get serious danger by 2x. There is as yet no scaling law for misalignment showing that it predictably gets worse when capabilities improve in practice.

The situation is not completely clear because we don't have good alignment evals and could get neuralese any year, but the data are pointing in that direction. I'm not sure about research taste as the benchmarks for that aren't very good. I'd change my mind here if we did see stagnation in research taste plus misalignment getting worse over time (not just sophistication of the bad things AIs do, but also frequency or egregiousness).

Reply
Thomas Kwa's Shortform
Thomas Kwa12d20

That is, you think alignment is so difficult that keeping humanity alive for 3 years is more valuable than the possibility of us solving alignment during the pause? Or that the AIs will sabotage the project in a way undetectable by management even if management is very paranoid about being sabotaged by any model that has shown prerequisite capabilities for it?

Reply
Thomas Kwa's Shortform
Thomas Kwa12d20

Ways this could be wrong:

  • We can pause early (before AIs pose significant risk) at little cost
  • We must pause early (AIs pose significant risk before they speed up research much). I think this is mostly ruled out by current evidence
  • Safety research inherently has to be done by humans because it's less verifiable, even when capabilities is automated
  • AI lab CEOs are good at management of safety research because their capabilities experience transfers (in this case I'd still much prefer Buck Shlegeris or Sam Altman over the US or China governments)
  • It's easy to pause indefinitely once everyone realizes AIs are imminently dangerous, kind of like the current situation with nuclear
  • Probably others I'm not thinking of
Reply1
ryan_greenblatt's Shortform
Thomas Kwa12dΩ240

The data is pretty low-quality for that graph because the agents we used were inconsistent and Claude 3-level models could barely solve any tasks. Epoch has better data for SWE-bench Verified, which I converted to time horizon here and found to also be doubling every 4 months ish. Their elicitation is probably not as good for OpenAI as Anthropic models, but both are increasing at similar rates.

Reply1
Daniel Kokotajlo's Shortform
Thomas Kwa12d80

What's the current best known solution to the 5 and 10 problem? I feel like actual agents have heuristic self-models rather than this cursed logical counterfactual thing and so there's no guarantee a solution exists. But I don't even know what formal properties we want, so I also don't know whether we have impossibility theorems, some properties have gone out of fashion, or people think it can still work. 

Reply
Load More
59Claude, GPT, and Gemini All Struggle to Evade Monitors
Ω
1mo
Ω
3
84METR: How Does Time Horizon Vary Across Domains?
2mo
7
69Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
4mo
21
115Should CA, TX, OK, and LA merge into a giant swing state, just for elections?
10mo
35
37The murderous shortcut: a toy model of instrumental convergence
Ω
1y
Ω
0
12Goodhart in RL with KL: Appendix
Ω
1y
Ω
0
62Catastrophic Goodhart in RL with KL penalty
Ω
1y
Ω
10
38Is a random box of gas predictable after 20 seconds?
Q
2y
Q
35
66Will quantum randomness affect the 2028 election?
Q
2y
Q
52
79Thomas Kwa's research journal
Ω
2y
Ω
1
Load More