niplav

I operate by Crocker's rules.

Website.

Sequences

Acausal Trade

Comments

Sorted by
niplav70

A very related experiment is described in Yudkowsky 2017, and I think one doesn't even need LLMs for this—I started playing with an extremely simple RL agent trained on my laptop, but then got distracted by other stuff before achieving any relevant results. This method of training an agent to be "suspicious" of too high rewards would also pair well with model expansion; train the reward-hacking-suspicion circuitry fairly early as to avoid ability to sandbag this, and lay traps for reward hacking again and again during the gradual expansion process.

niplav90

Thank you for running the competition! It made me use & appreciate squiggle more, and I expect that a bunch of my estimation workflows in the future will be generating and then tweaking an AI-generated squiggle model.

niplav*20

My best guess is that the intended reading is "90% of the code at Anthropic", not in the world at large—if I remember the context correctly that felt like the option that made the most sense. (I was confused about this at first, and the original context on this is not clear whether the claim is about the world at large or about Anthropic specifically.)

niplav20

Link in the first line of the post probably should also be https://www.nationalsecurity.ai/.

niplav83

Looked unlikely to me given the most-publicly-associated-with-MIRI person is openly & loudly advocating for funding this kind of work. But maybe the association isn't as strong as I think.

niplav53

Great post, thank you. Ideas (to also mitigate extremely engaging/addictive outputs in long conversations):

  • Don't look at the output of the large model, instead give it to a smaller model and let the smaller model rephrase it.
    • I don't think there's useful software for this yet, though that might not be so hard? Could be a browser extension. To do for me, I guess.
  • Don't use character.ai and similar sites. Allegedly, users spend on average two hours a day talking on there (though I find that number hard to believe). If I had to guess they're fine-tuning models to be engaging to talk to, maybe even doing RL based on conversation length. (If they're not yet doing it, a competitor might, or they might in the future).
niplav20

"One-shotting is possible" is a live hypothesis that I got from various reports from meditation traditions.

I do retract "I learned nothing from this post", the "How does one-shotting happen" section is interesting, and I'd like it to be more prominent. Thanks for poking, I hope I'll find the time to respond to your other comment too.

niplav*519

Please don't post 25k words of unformatted LLM (?) output.

Load More