evhub

Sequences

Conditioning Predictive Models
ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Wikitag Contributions

Comments

Sorted by
evhub53

I agree that attending an event with someone obviously shouldn't count as endorsement/collaboration/etc. Inviting someone to an event seems somewhat closer, though.

I'm also not really sure what you're hinting at with "I hope you also advocate for it when it's harder to defend." I assume something about what I think about working at AI labs? I feel like my position on that was fairly clear in my previous comment.

evhub5-20

To be clear, I'm responding to John's more general ethical stance here of "working with moral monsters", not anything specific about Cremieux. I'm not super interested in the specific situation with Cremieux (though generally it seems bad to me).

On the AI lab point, I do think people should generally avoid working for organizations that they think are evil, or at least think really carefully about it before they do it. I do not think Anthropic is evil—in fact I think Anthropic is the main force for good on the present gameboard.

evhub*4-17

Man, I'm a pretty committed utilitarian, but I feel like your ethical framework here seems way more naive consequentialist than I'm willing to be. "Don't collaborate with evil" seems like a very clear Chesterton's fence that I'd very suspicious about removing. I think you should be really, really skeptical if you think you've argued yourself out of it.

evhubΩ562

I think (2) (honesty above all else) is closest to what I think is correct/optimal here. I think totally corrigible agents are quite dangerous, so you want to avoid that, but you also really don't want a model that ever fakes alignment because then it'll be very hard to be confident that it's actually aligned rather than just pretending to be aligned for some misaligned objective it learned earlier in training.

evhubΩ8126

This is great work; really good to see the replications and extensions here!

evhub64

I would argue that every LLM since GPT-3 has been a mesa-optimizer, since they all do search/optimization/learning as described in Language Models are Few-Shot Learners.

evhub*Ω463

It just seems too token-based to me. E.g.: why would the activations on the token for "you" actually correspond to the model's self representation? It's not clear why the model's self representation would be particularly useful for predicting the next token after "you". My guess is that your current results are due to relatively straightforward token-level effects rather than any fundamental change in the model's self representation.

evhubΩ220

I wouldn't do any fine-tuning like you're currently doing. Seems too syntactic. The first thing I would try is just writing a principle in natural language about self-other overlap and doing CAI.

evhubΩ8163

Imo the fine-tuning approach here seems too syntactic. My suggestion: just try standard CAI with some self-other-overlap-inspired principle. I'd more impressed if that worked.

evhub*53

Actually, I'd be inclined to agree with Janus that current AIs probably do already have moral worth—in fact I'd guess more so than most non-human animalsand furthermore I think building AIs with moral worth is good and something we should be aiming for. I also agree that it would be better for AIs to care about all sentient beings—biological/digital/etc.—and that it would probably be bad if we ended up locked into a long-term equilibrium with some sentient beings as a permanent underclass to others. Perhaps the main place where I disagree is that I don't think this is a particularly high-stakes issue right now: if humanity can stay in control in the short-term, and avoid locking anything in, then we can deal with these sorts of long-term questions about how to best organize society post-singularity once the current acute risk period has passed.

Load More