Former community director EA Netherlands. Now disabled by long covid, ME/CFS. Worried about AGI & US democracy
This makes me wonder if it's possible that "evil personas" can be entirely eliminated from distilled models, by including positive/aligned intent labels/traces throughout the whole distillation dataset
Seems to me the name AI safety is currently still widely used, no? As it covers much more than just alignment strategies, by including also stuff like control and governance
The AI Doomers are only one of several factions that oppose AI and seek to cripple it via weaponized regulation.
Bad faith
There are also factions concerned about “misinformation” and “algorithmic bias,” which in practice means they think chatbots must be censored to prevent them from saying anything politically inconvenient.
Bad faith
AI Doomer coalition abandoned the name “AI safety” and rebranded itself to “AI alignment.”
Seems wrong
What about whistle-blowing and anonymous leaking? Seems like it would go well together with concrete evidence of risk.
This is very interesting, and I had a recent thought that's very similar:
This might be a stupid question, but has anyone considered just flooding LLM training data with large amounts of (first-person?) short stories of desirable ASI behavior?
The way I imagine this to work is basically that an AI agent would develop really strong intuitions that "that's just what ASIs do". It might prevent it from properly modelling other agents that aren't trained on this, but it's not obvious to me that that's going to happen or that it's such a decisively bad thing to outweigh the positives
I imagine that the ratio of descriptions of desirable vs. descriptions of undesirable behavior would matter, and perhaps an ideal approach would both (massively) increase the amount of descriptions of desirable behavior as well as filter out the descriptions of unwanted behavior?
Looks like Evan Hubinger has done some very similar research just recently: https://www.lesswrong.com/posts/qXYLvjGL9QvD3aFSW/training-on-documents-about-reward-hacking-induces-reward
I think it might make sense to do it as a research project first? Though you would need to be able to train a model from scratch
I think you should publicly commit to:
If you currently have any of these with the computer use benchmark in development, you should seriously try to get out of those contractual obligations if there are any.
Ideally, you commit to these in a legally binding way, which would make it non-negotiable in any negotiation, and make you more credible to outsiders.
I don't think that all media produced by AI risk concerned people needs to mention that AI risk is a big deal - that just seems annoying and preachy. I see Epoch's impact story as informing people of where AI is likely to go and what's likely to happen, and this works fine even if they don't explicitly discuss AI risk
I don't think that every podcast episode should mention AI risk, but it would be pretty weird in my eyes to never mention it. Listeners would understandably infer that "these well-informed people apparently don't really worry much, maybe I shouldn't worry much either". I think rationalists easily underestimate how much other people's beliefs depend on what the people around them & their authority figures believe.
I think they have a strong platform to discuss risks occasionally. It also simply feels part of "where AI is likely to go and what's likely to happen".
This might be a stupid question, but has anyone considered just flooding LLM training data with large amounts of (first-person?) short stories of desirable ASI behavior?
The way I imagine this to work is basically that an AI agent would develop really strong intuitions that "that's just what ASIs do". It might prevent it from properly modelling other agents that aren't trained on this, but it's not obvious to me that that's going to happen or that it's such a decisively bad thing to outweigh the positives