Do we want minecraft alignment evals?
My main pitch:
There were recently some funny examples of LLMs playing minecraft and, for example,
This seems interesting because minecraft doesn't have a clear win condition, so unlike chess, there's a difference between minecraft-capabilities and minecraft-alignment. So we could take an AI, apply some alignment technique (for example, RLHF), let it play minecraft with humans (which is hopefully out of distribution compared to the AI's training), and observe whether the minecraft-world is still fun to play or if it's known that asking the AI for something (like getting gold) makes it sort of take over the world and break everything else.
Or it could teach us something else like "you must define for the AI which exact boundaries to act in, and then it's safe and useful, so if we can do something like that for real-world AGI we'll be fine, but we don't have any other solution that works yet". Or maybe "the AI needs 1000 examples for things it did that we di...
Do we want to put out a letter for labs to consider signing, saying something like "if all other labs sign this letter then we'll stop"?
I heard lots of lab employees hope the other labs would slow down.
I'm not saying this is likely to work, but it seems easy and maybe we can try the easy thing? We might end up with a variation like "if all other labs sign this AND someone gets capability X AND this agreement will be enforced by Y, then we'll stop until all the labs who signed this agree it's safe to continue". Or something else. It would be nice to get specific pushback from the labs and improve the phrasing.
I think this is very different from RSPs: RSPs are more like "if everyone is racing ahead (and so we feel we must also race), there is some point where we'll still chose to unilaterally stop racing"
Opinions on whether it's positive/negative to build tools like Cursor / Codebuff / Replit?
I'm asking because it seems fun to build and like there's low hanging fruit to collect in building a competitor to these tools, but also I prefer not destroying the world.
Considerations I've heard:
For-profit startup idea: Better KYC for selling GPUs
I heard[1] that right now, if a company wants to sell/resell GPUs, they don't have a good way to verify that selling to some customer wouldn't violate export controls, and that this customer will (recursively) also keep the same agreement.
There are already tools for KYC in the financial industry. They seem to accomplish their intended policy goal pretty well (economic sanctions by the U.S aren't easy for nation states to bypass), and are profitable enough that many companies exist that give KYC services (Google "corporate kyc tool").
From an anonymous friend, I didn't verify this myself
Patching security problems in big old organizations involves problems that go a lot beyond "looking at code and changing it", especially if aiming for a "strong" solution like formal verification.
TL;DR: Political problems, code that makes no sense, problems that would be easy to fix even with a simple LLM that isn't specialized on improving security.
The best public resource I know is about this is Recoding America.
Some examples iirc:
I also learned some surprising things from working on fixing/rewriting a major bank in Israel. I can't share such juicy stories as Recoding America publicly, but here are some that I can:
Some hard problems with anti tampering and their relevance for GPU off-switches
Background on GPU off switches:
It would be nice if a GPU could be limited[1] or shut down remotely by some central authority such as the U.S government[2] in case there’s some emergency[3].
This shortform is mostly replying to ideas like "we'll have a CPU in the H100[4] which will expect a signed authentication and refuse to run otherwise. And if someone will try to remove the CPU, the H100's anti-tampering mechanism will self-destruct (melt? explode?)".
TL;DR: Getting the self destruction mechanism right isn't really the hard part.
Some hard parts:
Things I'd suggest to an AI lab CISO if we had 5 minutes to talk
Example categories of such projects:
I'm looking for an AI tool which feels like Google Docs but has an LLM proactively commenting/suggesting things.
(Is anyone else interested in something like this?)
"Protecting model weights" is aiming too low, I'd like labs to protect their intellectual property too. Against state actors. This probably means doing engineering work inside an air gapped network, yes.
I feel it's outside the Overton Window to even suggest this and I'm not sure what to do about that except write a lesswrong shortform I guess.
Anyway, common pushbacks:
Ryan and Buck wrote:
> The control approach we're imagining won't work for arbitrarily powerful AIs
Okay, so if AI Control works well, how do we plan to use our controlled AI to reach a safe/aligned ASI?
Different people have different opinions. I think it would be good to have a public plan so that people can notice if they disagree and comment if they see problems.
Opinions I’ve heard so far:
Phishing emails might have bad text on purpose, so that security-aware people won't click through, because the next stage often involves speaking to a human scammer who prefers only speaking to people[1] who have no idea how to avoid scams.
(did you ever wonder why the generic phishing SMS you got was so bad? Couldn't they proof read their one SMS? Well, sometimes they can't, but sometimes it's probably on purpose)
This tradeoff could change if AIs could automate the stage of "speaking to a human scammer".
But also, if that stage isn't automated, then I'...
Off switch / flexheg / anti tampering:
Putting the "verifier" on the same chip as the GPU seems like an approach worth exploring as an alternative to anti-tampering (which seems hard)
I heard[1] that changing the logic running on a chip (such as subverting an off-switch mechanism) without breaking the chip seems potentially hard[2] even for a nation state.
If this is correct (or can be made correct?) then this seems much more promising than having a separate verifier-chip and gpu-chip with anti tampering preventing them from being separated (which s...
For profit AI Security startup idea:
TL;DR: A laptop that is just a remote desktop ( + why this didn't work before and how to fix that)
Why this is nice for AI Security:
Network spikes: A reason this didn't work before and how to solve it
The problem: Sometimes the network will be slow for a few seconds. It's really annoying if th...
(how do I prioritize? do you want to user-test any of them?)
I think Control has similar problems to RLHF, where both might fail to generalize out of distribution in similar ways.
This seems important because Control has a central goal of being a fallback for that kind of failure mode.
I want to share my thoughts, including something nice Control does about this problem that I think RLHF could easily incorporate (investigate failures instead of always training against them).
What do I mean by Control generalizing out of distribution:
Our Control experiment might involve giving a model 100 leetcode problems, ...
Dario said (Nov 2024):
"I never liked those words [P(DOOM)], I think they're kinda wired, my view is we should measure risks as they come up, and in the meantime we should get all the economic benefits that we could get. And we should find ways to measure the risks that are effective, but are minimally disruptive to the amazing economic process that we see going on now that the last thing we want to do is slow down" [lightly edited for clarity, bold is mine]
If he doesn't believe this[1], I think he should clarify.
Hopefully cold take: People should say what ...
My drafty[1] notes trying to understand AI Control, friendly corrections are welcome:
“Control” is separate from “Alignment”: In Alignment, you try to get the model to have your values. “Control” assumes[2] the alignment efforts failed, and that the model is sometimes helping out (as it was trained to do), but it wants something else, and it might try to “betray” you at any moment (aka scheme).
The high level goal is to still get useful work out of this AI.
[what to do with this work? below]
In scope: AIs that could potentially cause seri...
A company that makes CPUs that run very quickly but don't do matrix multiplication or other things that are important for neural networks.
Context: I know people who work there
Speaking in my personal capacity as research lead of TGT (and not on behalf of MIRI), I think work in this direction is potentially interesting. One difficulty with work like this are anti-trust laws, which I am not familiar with in detail but they serve to restrict industry coordination that restricts further development / competition. It might be worth looking into how exactly anti-trust laws apply to this situation, and if there are workable solutions. Organisations that might be well placed to carry out work like this might be the frontier model forum and affiliated groups, I also have some ideas we could discuss in person.
I also think there might be more legal leeway for work like this to be done if it's housed within organisations (government or ngos) that are officially tasked with defining industry standards or similar.