A very related experiment is described in Yudkowsky 2017, and I think one doesn't even need LLMs for this—I started playing with an extremely simple RL agent trained on my laptop, but then got distracted by other stuff before achieving any relevant results. This method of training an agent to be "suspicious" of too high rewards would also pair well with model expansion; train the reward-hacking-suspicion circuitry fairly early as to avoid ability to sandbag this, and lay traps for reward hacking again and again during the gradual expansion process.

$300 Fermi Model Competition

niplav11d90

Thank you for running the competition! It made me use & appreciate squiggle more, and I expect that a bunch of my estimation workflows in the future will be generating and then tweaking an AI-generated squiggle model.

AI #107: The Misplaced Hype Machine

niplav11d*20

My best guess is that the intended reading is "90% of the code at Anthropic", not in the world at large—if I remember the context correctly that felt like the option that made the most sense. (I was confused about this at first, and the original context on this is not clear whether the claim is about the world at large or about Anthropic specifically.)

Daniel Kokotajlo's Shortform

niplav22d42

Also relevant: Gwern on Tryon on rats, estimate of the cost of breeding very smart parrots/Keas.

On the Rationality of Deterring ASI

niplav23d20

~~Link in the first line of the post probably should also be https://www.nationalsecurity.ai/.~~

How to Make Superbabies

niplav23d83

Looked unlikely to me given the most-publicly-associated-with-MIRI person is openly & loudly advocating for funding this kind of work. But maybe the association isn't as strong as I think.

Cautions about LLMs in Human Cognitive Loops

niplav25d53

Great post, thank you. Ideas (to also mitigate extremely engaging/addictive outputs in long conversations):

Don't look at the output of the large model, instead give it to a smaller model and let the smaller model rephrase it.
- I don't think there's useful software for this yet, though that might not be so hard? Could be a browser extension. To do for me, I guess.
Don't use character.ai and similar sites. Allegedly, users spend on average two hours a day talking on there (though I find that number hard to believe). If I had to guess they're fine-tuning models to be engaging to talk to, maybe even doing RL based on conversation length. (If they're not yet doing it, a competitor might, or they might in the future).

Do clients need years of therapy, or can one conversation resolve the issue?

niplav1mo20

"One-shotting is possible" is a live hypothesis that I got from various reports from meditation traditions.

I do retract "I learned nothing from this post", the "How does one-shotting happen" section is interesting, and I'd like it to be more prominent. Thanks for poking, I hope I'll find the time to respond to your other comment too.

Will_Pearson's Shortform

niplav1mo*519

Please don't post 25k words of unformatted LLM (?) output.