Do we want minecraft alignment evals?
My main pitch:
There were recently some funny examples of LLMs playing minecraft and, for example,
This seems interesting because minecraft doesn't have a clear win condition, so unlike chess, there's a difference between minecraft-capabilities and minecraft-alignment. So we could take an AI, apply some alignment technique (for example, RLHF), let it play minecraft with humans (which is hopefully out of distribution compared to the AI's training), and observe whether the minecraft-world is still fun to play or if it's known that asking the AI for something (like getting gold) makes it sort of take over the world and break everything else.
Or it could teach us something else like "you must define for the AI which exact boundaries to act in, and then it's safe and useful, so if we can do something like that for real-world AGI we'll be fine, but we don't have any other solution that works yet". Or maybe "the AI needs 1000 examples for things it did that we di...
Do we want to put out a letter for labs to consider signing, saying something like "if all other labs sign this letter then we'll stop"?
I heard lots of lab employees hope the other labs would slow down.
I'm not saying this is likely to work, but it seems easy and maybe we can try the easy thing? We might end up with a variation like "if all other labs sign this AND someone gets capability X AND this agreement will be enforced by Y, then we'll stop until all the labs who signed this agree it's safe to continue". Or something else. It would be nice to get specific pushback from the labs and improve the phrasing.
I think this is very different from RSPs: RSPs are more like "if everyone is racing ahead (and so we feel we must also race), there is some point where we'll still chose to unilaterally stop racing"
Opinions on whether it's positive/negative to build tools like Cursor / Codebuff / Replit?
I'm asking because it seems fun to build and like there's low hanging fruit to collect in building a competitor to these tools, but also I prefer not destroying the world.
Considerations I've heard:
For-profit startup idea: Better KYC for selling GPUs
I heard[1] that right now, if a company wants to sell/resell GPUs, they don't have a good way to verify that selling to some customer wouldn't violate export controls, and that this customer will (recursively) also keep the same agreement.
There are already tools for KYC in the financial industry. They seem to accomplish their intended policy goal pretty well (economic sanctions by the U.S aren't easy for nation states to bypass), and are profitable enough that many companies exist that give KYC services (Google "corporate kyc tool").
From an anonymous friend, I didn't verify this myself
Some hard problems with anti tampering and their relevance for GPU off-switches
Background on GPU off switches:
It would be nice if a GPU could be limited[1] or shut down remotely by some central authority such as the U.S government[2] in case there’s some emergency[3].
This shortform is mostly replying to ideas like "we'll have a CPU in the H100[4] which will expect a signed authentication and refuse to run otherwise. And if someone will try to remove the CPU, the H100's anti-tampering mechanism will self-destruct (melt? explode?)".
TL;DR: Getting the self destruction mechanism right isn't really the hard part.
Some hard parts:
"Protecting model weights" is aiming too low, I'd like labs to protect their intellectual property too. Against state actors. This probably means doing engineering work inside an air gapped network, yes.
I feel it's outside the Overton Window to even suggest this and I'm not sure what to do about that except write a lesswrong shortform I guess.
Anyway, common pushbacks:
Ryan and Buck wrote:
> The control approach we're imagining won't work for arbitrarily powerful AIs
Okay, so if AI Control works well, how do we plan to use our controlled AI to reach a safe/aligned ASI?
Different people have different opinions. I think it would be good to have a public plan so that people can notice if they disagree and comment if they see problems.
Opinions I’ve heard so far:
Phishing emails might have bad text on purpose, so that security-aware people won't click through, because the next stage often involves speaking to a human scammer who prefers only speaking to people[1] who have no idea how to avoid scams.
(did you ever wonder why the generic phishing SMS you got was so bad? Couldn't they proof read their one SMS? Well, sometimes they can't, but sometimes it's probably on purpose)
This tradeoff could change if AIs could automate the stage of "speaking to a human scammer".
But also, if that stage isn't automated, then I'...
Off switch / flexheg / anti tampering:
Putting the "verifier" on the same chip as the GPU seems like an approach worth exploring as an alternative to anti-tampering (which seems hard)
I heard[1] that changing the logic running on a chip (such as subverting an off-switch mechanism) without breaking the chip seems potentially hard[2] even for a nation state.
If this is correct (or can be made correct?) then this seems much more promising than having a separate verifier-chip and gpu-chip with anti tampering preventing them from being separated (which s...
For profit AI Security startup idea:
TL;DR: A laptop that is just a remote desktop ( + why this didn't work before and how to fix that)
Why this is nice for AI Security:
Network spikes: A reason this didn't work before and how to solve it
The problem: Sometimes the network will be slow for a few seconds. It's really annoying if th...
(how do I prioritize? do you want to user-test any of them?)
I think Control has similar problems to RLHF, where both might fail to generalize out of distribution in similar ways.
This seems important because Control has a central goal of being a fallback for that kind of failure mode.
I want to share my thoughts, including something nice Control does about this problem that I think RLHF could easily incorporate (investigate failures instead of always training against them).
What do I mean by Control generalizing out of distribution:
Our Control experiment might involve giving a model 100 leetcode problems, ...
Dario said (Nov 2024):
"I never liked those words [P(DOOM)], I think they're kinda wired, my view is we should measure risks as they come up, and in the meantime we should get all the economic benefits that we could get. And we should find ways to measure the risks that are effective, but are minimally disruptive to the amazing economic process that we see going on now that the last thing we want to do is slow down" [lightly edited for clarity, bold is mine]
If he doesn't believe this[1], I think he should clarify.
Hopefully cold take: People should say what ...
My drafty[1] notes trying to understand AI Control, friendly corrections are welcome:
“Control” is separate from “Alignment”: In Alignment, you try to get the model to have your values. “Control” assumes[2] the alignment efforts failed, and that the model is sometimes helping out (as it was trained to do), but it wants something else, and it might try to “betray” you at any moment (aka scheme).
The high level goal is to still get useful work out of this AI.
[what to do with this work? below]
In scope: AIs that could potentially cause seri...
A company that makes CPUs that run very quickly but don't do matrix multiplication or other things that are important for neural networks.
Context: I know people who work there
In practice, I don't think any currently existing RSP-like policy will result in a company doing this as I discuss here.