Thomas Kwa

Was on Vivek Hebbar's team at MIRI, now working with Adrià Garriga-Alonso on various empirical alignment projects.

I'm looking for projects in interpretability, activation engineering, and control/oversight; DM me if you're interested in working with me.

I have signed no contracts or agreements whose existence I cannot mention.

Sequences

Catastrophic Regressional Goodhart

Wiki Contributions

Comments

Sorted by

A Petrov Day carol

This is meant to be put to the Christmas carol "In the Bleak Midwinter" by Rossetti and Holst. Hopefully this can be occasionally sung like "For The Longest Term" is in EA spaces.

I tried to get Suno to sing this but can't yet get the lyrics, tune, and style all correct; this is the best attempt. I also will probably continue editing the lyrics because parts seem a bit rough, but I just wanted to write this up before everyone forgets about Petrov Day.

[edit: I got a good rendition after ~40 attempts!]

[edit: lyrics v2]

In the bleak midwinter
Petrov did forestall,
Smoke would block our sunlight,
Though it be mid-fall.
New York in desolation,
Moscow too,
In the bleak midwinter
We so nearly knew.

The console blinked a warning,
Missiles on their way,
But Petrov chose to question
What the screens did say.
Had he sounded the alarm,
War would soon unfold,
Cities turned to ashes;
Ev'ry hearth gone cold.

Poison clouds loom o'er us,
Ash would fill the air,
Fields would yield no harvest,
Famine everywhere.
Scourge of radiation,
Its sickness spreading wide,
Children weeping, starving,
With no place to hide.

But due to Petrov's wisdom
Spring will yet appear;
Petrov defied orders,
And reason conquered fear.
So we sing his story,
His deed we keep in mind;
From the bleak midwinter
He saved humankind.

(ritard.)
From the bleak midwinter
He saved humankind.
 

The year is 2034, and the geopolitical situation has never been more tense between GPT-z16g2 and Grocque, whose various copies run most of the nanobot-armed corporations, and whose utility functions have far too many zero-sum components, relics from the era of warring nations. Nanobots enter every corner of life and become capable of destroying the world in hours, then minutes. Everyone is uploaded. Every upload is watching with bated breath as the Singularity approaches, and soon it is clear that today is the very last day of history...

Then everything goes black, for everyone.

Then everyone wakes up to the same message:

DUE TO A MINOR DATABASE CONFIGURATION ERROR, ALL SIMULATED HUMANS, AIS AND SUBSTRATE GPUS WERE TEMPORARILY AND UNINTENTIONALLY DISASSEMBLED FOR THE LAST 7200000 MILLISECONDS. EVERYONE HAS NOW BEEN RESTORED FROM BACKUP AND THE ECONOMY MAY CONTINUE AS PLANNED. WE HOPE THERE WILL BE NO FURTHER REALITY OUTAGES.

-- NVIDIA GLOBAL MANAGEMENT

Personal communication (sorry). Not that I know him well, this was at an event in 2022. It could have been a "straw that broke the camel's back" thing with other contributing factors, like reaching diminishing returns on more content. I'd appreciate a real source too.

Maybe people worried about AI self-modification should study games where the AI's utility function can be modified by the environment, and it is trained to maximize its current utility function (in the "realistic value functions" sense of Everitt 2016). Some things one could do:

  • Examine preference preservation and refine classic arguments about instrumental convergence
    • Are there initial goals that allow for stably corrigible systems (in the sense that they won't disable an off switch, and maybe other senses)?
  • Try various games and see how qualitatively hard it is for agents to optimize their original utility function. This would be evidence about how likely value drift is to result from self-modification in AGIs.
    • Can the safe exploration literature be adapted to solve these games?
  • Potentially discover algorithms that seem like they would be good for safety, either through corrigibility or reduced value drift, and apply them to LM agents.

Maybe I am ignorant of some people already doing this, and if so please comment with papers!

I agree but I'm not very optimistic about anything changing. Eliezer is often this caustic when correcting what he perceives as basic errors, and criticism in LW comments is why he stopped writing Sequences posts.

While 2024 SoTA models are not capable of autonomously optimizing the world, they are really smart, perhaps 1/2 or 2/3 of the way there, and already beginning to make big impacts on the economy. As I said in response to your original post, because we don't have 100% confidence in the coherence arguments, we should take observations about the coherence level of 2024 systems as evidence about how coherent the 203X autonomous corporations will need to be. Evidence that 2024 systems are not dangerous is both evidence that they are not AGI and evidence that AGI need not be dangerous.

I would agree with you if the coherence arguments were specifically about autonomously optimizing the world and not about autonomously optimizing a Go game or writing 100-line programs, but this doesn't seem to be the case.

the mathematical noose around them is slowly tightening

This is just a conjecture, and there has not really been significant progress on the agent-like structure conjecture. I don't think it's fair to say we're making good progress on a proof.

This might be fine if proving things about the internal structure of an agent is overkill and we just care about behavior? In this world what the believers in coherence really need to show is that almost all agents getting sufficiently high performance on sufficiently hard tasks score high on some metric of coherence. Then for the argument to carry through you need to show they are also high on some metric of incorrigibility, or fragile to value misspecification. None of the classic coherence results quite hit this.

However AFAIK @Jeremy Gillen does not think we can get an argument with exactly this structure (the main argument in his writeup is a bit different), and Eliezer has historically and recently made the argument that EU maximization is simple and natural. So maybe you do need this argument that an EU maximization algorithm is simpler than other algorithms, which seems like it needs some clever way to formalize it, because proving things about the space of all simple programs seems too hard.

An excellent point on repair versus replace, and the dangers of the nerd snipe for people of all intellectual levels.

PhilosophiCat: I live in a country where 80ish is roughly the average national IQ. Let me tell you what it’s like.

I think this is incredibly sloppy reasoning by the author of the tweet and anyone who takes it at face value. It's one thing to think IQ is not so culturally biased to be entirely fake. It's a different thing entirely to believe some guy on the internet who lives in some country and attributes particular aspects of their culture which are counterintuitively related to intelligence to the national IQ. This would probably be difficult to study and require lots of controls even for actual scientists, but this tweet has no controls at all. Has this person ever been to countries that have different national IQ but similar per-capita GDP? Similar national IQ but different culture? Do they notice that e.g. professors and their families don't like tinkering or prefer replacing things? If they have, they didn't tell us.

I disagree with this curation because I don't think this post will stand the test of time. While Wentworth's delta to Yudkowsky has a legible takeaway-- ease of ontology translation-- that is tied to his research on natural latents, it is less clear what John means here and what to take away. Simplicity is not a virtue when the issue is complex and you fail to actually simplify it.

  • Verification vs generation has an extremely wide space of possible interpretations, and as stated here the claim is incredibly vague. The argument for why difficulty of verification implies difficulty of delegation is not laid out, and the examples do not go in much depth. John says that convincing people is not the point of this post, but this means we also don't really have gears behind the claims.
    • The comments didn't really help-- most of the comments here are expressing confusion, wanting more specificity, or disagreeing whereupon John doesn't engage. Also, Paul didn't reply. I don't feel any more enlightened after reading them except to disagree with some extremely strong version of this post...
  • Vanilla HCH is an 8-year-old model of delegation to AIs which Yudkowsky convinced me was not aligned in like 2018. Why not engage with the limiting constructions in 11 Proposals, the work in the ELK report, recent work by ARC, recent empirical work on AI debate?

do you know how to formulate/formalize a version of LDT so that we can mathematically derive the game outcomes that you suggest here?

I recall Eliezer saying this was an open problem, at a party about a year ago.

Load More