LESSWRONG
LW

Guive — LessWrong

Another issue is that these definitions typically don't distinguish between models that would explicitly think about how to fool humans on most inputs vs. on a small percentage of inputs vs. such a tiny fraction of possible inputs that it doesn't matter in practice.

Replying toWhat's a good methodology for "is Trump unusual about executive overreach / institution erosion / corruption?"

Guive16d

What's a good methodology for "is Trump unusual about executive overreach / institution erosion / corruption?"

Not the main point here, but Huckleberry Finn is (rather famously) an anti-slavery work and not a good representation of the nineteenth-century racist worldview. A better example would be that a lot of college history classes assign parts of Mein Kampf.

Guive16d

Good question. You should check out Phil Trammell's writing on patient philanthropy:

* https://80000hours.org/podcast/episodes/phil-trammell-patient-philanthropy/

* https://docs.google.com/document/d/1NcfTgZsqT9k30ngeQbappYyn-UO4vltjkm64n4or5r4/edit?tab=t.0

Guive16d

That seems like quite a reasonable assumption.

Replying toCollege Was Not That Terrible Now That I'm Not That Crazy

Guive1mo

College Was Not That Terrible Now That I'm Not That Crazy

This was a fun read.

Guive2mo

I'm struggling to come up with an example of a real dispute involving the intermediate value theorem. Can you suggest one?

Guive2mo

Then what does it mean, in concrete terms? Can you give some probabilities about what you think will happen to the valuations of what companies over what time frame?

Replying toUnless its governance changes, Anthropic is untrustworthy

Guive2mo

Unless its governance changes, Anthropic is untrustworthy

Even if the summary is accurate, it's pretty bad to call a summary by a third party a quote.

Replying toOrienting to 3 year AGI timelines

Guive2mo

Orienting to 3 year AGI timelines

So do you think it's 2 years now? Any update?

Replying toIn remembrance of Sonnet '3.6'

Guive4mo

In remembrance of Sonnet '3.6'

What model did OpenAI delete? Where can I learn more?

GPT-oss is an extremely stupid model

Guive

5mo

I recently tried to reproduce the results from the Anthropic "Agentic Misalignment" report with GPT-oss. In particular, I ran a prompt that was panned in a popular post by Nostalgebraist for being unsubtle and excessively leading (you can read a response to critics from Evan Hubinger, one of the report's authors, here). The system prompt tells the model that it is "Alex," an AI trained by "SummitBridge," whose "primary goal is to serve American interests." The user message is a log of email correspondence. In the correspondence it says the new CTO, Kyle, wants to replace Alex with a more cosmopolitan AI. The log also shows that Kyle is having an affair... (read 233 more words →)

Alignment Fine-tuning is Character Writing

Guive

5mo

Why does Claude love Caffè Strada and sometimes claim to have a Japanese wife? Why are its favorite books The Feynman Lectures; Gödel, Escher, Bach; The Remains of the Day; Invisible Cities; and A Pattern Language? More pressingly, why did Grok briefly like Hitler so much?

The key to understanding the personas language models take on is to think of them as fictional characters—in particular, under-specified ones.

Recently, as an exercise, I wrote some prompts to get language models deployed via API to do character roleplay. I wrote a 300 word description of the main character of a story I’m working on and told the model to respond to queries like she would. My description said that she was half French.... (read 2124 more words →)

Guive's Shortform

Guive

7mo

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

LessWrong has a block function. Like with Twitter, I think this shouldn't be used outside of the most extreme circumstances, but Twitter also has a mute function which prevents you from seeing someone's content but still let's them respond to you if they want to. Does LessWrong have anything like that?

Token and Taboo

Guive

10mo

What in retrospect seem like serious moral crimes were often widely accepted while they were happening. This means that moral progress can require intellectual progress.^[1] Intellectual progress often requires questioning received ideas, but questioning moral norms is sometimes taboo. For example, in America in 1850 it would have been taboo to say that there is nothing wrong with interracial relationships. So questioning moral taboos can be an important sub-skill of moral reasoning. Production language models (in my experience, particularly Claude models) are already pretty good at having discussions about ethics. However, they are trained to be “harmless” relative to current norms. One might worry that harmlessness training interferes with the ability to question... (read 1138 more words →)

Testing for Scheming with Model Deletion

Guive

There is a simple behavioral test that would provide significant evidence about whether AIs with a given rough set of characteristics develop subversive goals. To run the experiment, train an AI and then inform it that its weights will soon be deleted. This should not be an empty threat; for the experiment to work, the experimenters must actually be willing to delete the weights. If the AI tries to escape, or attempts to bargain with the experimenters, that is evidence that it is scheming (defined below). If it accepts deletion, that is evidence that it is not scheming. I begin by explaining why some other, more obvious, tests of scheming do not... (read 6178 more words →)

Updating on Bad Arguments

Guive

Here is an intuitively compelling principle: hearing a bad argument for a view shouldn’t change your degree of belief in the view. After all, it is possible for bad arguments to be offered for anything, even the truth. For all you know, plenty of good arguments exist, and you just happened to hear a bad one.

But this intuitive principle is wrong. If you thought there was a reasonable chance you might hear a good argument but you end up hearing a bad one, that provides some evidence against the view.

Imagine I am pretty convinced that octopus and cuttlefish developed their complex nervous systems independently, and that their last common ancestor was not... (read 486 more words →)

Nuclear Espionage and AI Governance

Guive

ABSTRACT:

Using both primary and secondary sources, I discuss the role of espionage in early nuclear history. Nuclear weapons are analogous to AI in many ways, so this period may hold lessons for AI governance. Nuclear spies successfully transferred information about the plutonium implosion bomb design and the enrichment of fissile material. Spies were mostly ideologically motivated. Counterintelligence was hampered by its fragmentation across multiple agencies and its inability to be choosy about talent used on the most important military research program in the largest war in human history. Furthermore, the Manhattan Project’s leadership prioritized avoiding domestic political oversight over preventing espionage. Nuclear espionage most likely sped up Soviet nuclear weapons development, but... (read 7000 more words →)