This might be a stupid question, but has anyone considered just flooding LLM training data with large amounts of (first-person?) short stories of desirable ASI behavior?

The way I imagine this to work is basically that an AI agent would develop really strong intuitions that "that's just what ASIs do". It might prevent it from properly modelling other agents that aren't trained on this, but it's not obvious to me that that's going to happen or that it's such a decisively bad thing to outweigh the positives

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Siebe5d70

This makes me wonder if it's possible that "evil personas" can be entirely eliminated from distilled models, by including positive/aligned intent labels/traces throughout the whole distillation dataset

The Failed Strategy of Artificial Intelligence Doomers

Siebe1mo11

Seems to me the name AI safety is currently still widely used, no? As it covers much more than just alignment strategies, by including also stuff like control and governance

The Failed Strategy of Artificial Intelligence Doomers

Siebe1mo1-4

The AI Doomers are only one of several factions that oppose AI and seek to cripple it via weaponized regulation.

Bad faith

There are also factions concerned about “misinformation” and “algorithmic bias,” which in practice means they think chatbots must be censored to prevent them from saying anything politically inconvenient.

Bad faith

AI Doomer coalition abandoned the name “AI safety” and rebranded itself to “AI alignment.”

Seems wrong

Ten people on the inside

Siebe1mo80

What about whistle-blowing and anonymous leaking? Seems like it would go well together with concrete evidence of risk.

Training on Documents About Reward Hacking Induces Reward Hacking

Siebe2mo54

This is very interesting, and I had a recent thought that's very similar:

This might be a stupid question, but has anyone considered just flooding LLM training data with large amounts of (first-person?) short stories of desirable ASI behavior?

The way I imagine this to work is basically that an AI agent would develop really strong intuitions that "that's just what ASIs do". It might prevent it from properly modelling other agents that aren't trained on this, but it's not obvious to me that that's going to happen or that it's such a decisively bad thing to outweigh the positives

I imagine that the ratio of descriptions of desirable vs. descriptions of undesirable behavior would matter, and perhaps an ideal approach would both (massively) increase the amount of descriptions of desirable behavior as well as filter out the descriptions of unwanted behavior?

Siebe's Shortform

Siebe2mo42

Looks like Evan Hubinger has done some very similar research just recently: https://www.lesswrong.com/posts/qXYLvjGL9QvD3aFSW/training-on-documents-about-reward-hacking-induces-reward

Siebe's Shortform

Siebe2mo30

I think it might make sense to do it as a research project first? Though you would need to be able to train a model from scratch

meemi's Shortform

Siebe2mo4224

I think you should publicly commit to:

full transparency about any funding from for profit organisations, including nonprofit organizations affiliated with for profit
no access to the benchmarks to any company
no NDAs around this stuff

If you currently have any of these with the computer use benchmark in development, you should seriously try to get out of those contractual obligations if there are any.

Ideally, you commit to these in a legally binding way, which would make it non-negotiable in any negotiation, and make you more credible to outsiders.

Jonathan Claybrough's Shortform

Siebe2mo1412

I don't think that all media produced by AI risk concerned people needs to mention that AI risk is a big deal - that just seems annoying and preachy. I see Epoch's impact story as informing people of where AI is likely to go and what's likely to happen, and this works fine even if they don't explicitly discuss AI risk

I don't think that every podcast episode should mention AI risk, but it would be pretty weird in my eyes to never mention it. Listeners would understandably infer that "these well-informed people apparently don't really worry much, maybe I shouldn't worry much either". I think rationalists easily underestimate how much other people's beliefs depend on what the people around them & their authority figures believe.

I think they have a strong platform to discuss risks occasionally. It also simply feels part of "where AI is likely to go and what's likely to happen".