TL;DR: point 3 is my main one.
1)
What's an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?
[I'm not sure why you're asking, maybe I'm missing something, but I'll answer]
For example, checking if human values are a "natural abstraction", or trying to express human values in a machine readable format, or getting an AI to only think in human concepts, or getting an AI that is trained on a limited subset of things-that-imply-human-preferences to generalize well out of that distribution.
I can make up more if that helps? anyway my point was just to say explicitly what parts I'm commenting on and why (in case I missed something)
2)
it seems like you think RLHF counts as an alignment technique
It's a candidate alignment technique.
RLHF is sometimes presented (by others) as an alignment technique that should give us hope about AIs simply understanding human values and applying them in out of distribution situations (such as with an ASI).
I'm not optimistic about that myself, but rather than arguing against it, I suggest we could empirically check if RLHF generalizes to an out-of-distribution situation, such as minecraft maybe. I think observing the outcome here would effect my opinion (maybe it just would work?), and a main question of mine was whether it would effect other people's opinions too (whether they do or don't believe that RLHF is a good alignment technique).
3)
because you have to somehow communicate to the AI system what you want it to do, and AI systems don't seem good enough yet to be capable of doing this without some Minecraft specific finetuning. (Though maybe you would count that as Minecraft capabilities? Idk, this boundary seems pretty fuzzy to me.)
I would finetune the AI on objective outcomes like "fill this chest with gold" or "kill that creature [the dragon]" or "get 100 villagers in this area". I'd pick these goals as ones that require the AI to be a capable minecraft player (filling a chest with gold is really hard) but don't require the AI to understand human values or ideally anything about humans at all.
So I'd avoid finetuning it on things like "are other players having fun" or "build a house that would be functional for a typical person" or "is this waterfall pretty [subjectively, to a human]".
Does this distinction seem clear? useful?
This would let us test how some specific alignment technique (such as "RLHF that doesn't contain minecraft examples") generalizes to minecraft
If you talk about alignment evals for alignment that isn't naturally incentivized by profit-seeking activities, "stay within bounds" is of course less relevant.
Yes.
Also, I think "make sure Meth [or other] recipes are harder to get from an LLM than from the internet" is not solving a big important problem compared to x-risk, not that I'm against each person working on whatever they want. (I'm curious what you think but no pushback for working on something different from me)
one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
This imo counts as a potential alignment technique (or a target for such a technique?) and I suggest we could test how well it works in minecraft. I can imagine it going very well or very poorly. wdyt?
In terms of "understanding the spirit of what we mean," it seems like there's near-zero designs that would work since a Minecraft eval would be blackbox anyways
I don't understand. Naively, seems to me like we could black-box observe whether the AI is doing things like "chop down the tree house" or not (?)
(clearly if you have visibility to the AI's actual goals and can compare them to human goals then you win and there's no need for any minecraft evals or most any other things, if that's what you mean)
Intuitively, this involves two components: the ability to robustly steer high-level structures like objectives, and something good to target at.
I agree.
But if we solve these two problems then I think you could go further and say we don't really need to care about deceptiveness at all. Our AI will just be aligned.
P.S
“Ah”, but straw-you says,
This made me laugh
My own pushback to minecraft alignment evals:
Mainly, minecraft isn't actually out of distribution, LLMs still probably have examples of nice / not-nice minecraft behaviour.
Next obvious thoughts:
Moral intuitions such as "don't chop down the tree house in an attempt to get wood", which is the toy example for alignment I'm using here.
Related: Building a video game to test alignment (h/t @Crazytieguy )
https://www.lesswrong.com/posts/ALkH4o53ofm862vxc/announcing-encultured-ai-building-a-video-game
Thanks!
In the part you quoted - my main question would be "do you plan on giving the agent examples of good/bad norm following" (such as RLHFing it). If so - I think it would miss the point, because following those norms would become in-distribution, and so we wouldn't learn if our alignment generalizes out of distribution without something-like-RLHF for that distribution. That's the main thing I think worth testing here. (do you agree? I can elaborate on why I think so)
If you hope to check if the agent will be aligned[1] with no minecraft-specific alignment training, then sounds like we're on the same page!
Regarding the rest of the article - it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it's easy).
My only comment there is that I'd try to not give the agent feedback about human values (like "is the waterfall pretty") but only about clearly defined objectives (like "did it kill the dragon"), in order to not accidentally make human values in minecraft be in-distribution for this agent. wdyt?
(I hope I didn't misunderstand something important in the article, feel free to correct me of course)
Whatever "aligned" means. "other players have fun on this minecraft server" is one example.
:)
I don't think alignment KPIs like "stay within bounds" are relevant to alignment at all even as toy examples: because if so, then we could say for example that playing a packman maze game where you collect points is "capabilities", but adding enemies that you must avoid is "alignment". Do you agree that plitting it up that way wouldn't be interesting to alignment, and that this applies to "stay within bounds" (as potentially also being "part of the game")? Interested to hear where you disagree, if you do
Regarding
Distribute resources fairly when working with other players
I think this pattern matches to a trolly problem or something, where there are clear tradeoffs and (given the AI is even trying), it could probably easily give an answer which is similarly controversial to an answer that a human would give. In other words, this seems in-distribution.
Understanding and optimizing for the utility of other players
This is the one I like - assuming it includes not-well-defined things like "help them have fun, don't hurt things they care about" and not only things like "maximize their gold".
It's clearly not a "in packman, avoid the enemies" thing.
It's a "do the AIs understand the spirit of what we mean" thing.
(does this resonate with you as an important distinction?)
I agree.
This all sounds pretty in-distribution for an LLM, and also like it avoids problems like "maybe thinking in different abstractions" [minecraft isn't amazing at this either, but at least has a bit], "having the AI act/think way faster than a human", "having the AI be clearly superhuman".
a number of ways to achieve the endgame, level up, etc, both more and less morally.
I'm less interested in "will the AI say it kills its friend" (in a situation that very clearly involves killing and a person and perhaps a very clear tradeoff between that and having 100 more gold that can be used for something else), I'm more interested in noticing if it has a clear grasp of what people care about or mean. The example of chopping down the tree house of the player in order to get wood (which the player wanted to use for the tree house) is a nice toy example of that. The AI would never say "I'll go cut down your tree house", but it.. "misunderstood" [not the exact word, but I'm trying to point at something here]
wdyt?
I'm not sure I'm imagining the same thing as you, but as a draft solution, how about a
robots.txt
?