evhub

Sequences

Conditioning Predictive Models
ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Wikitag Contributions

Comments

Sorted by
evhub*Ω453

It just seems too token-based to me. E.g.: why would the activations on the token for "you" actually correspond to the model's self representation? It's not clear why the model's self representation would be particularly useful for predicting the next token after "you". My guess is that your current results are due to relatively straightforward token-level effects rather than any fundamental change in the model's self representation.

evhubΩ220

I wouldn't do any fine-tuning like you're currently doing. Seems too syntactic. The first thing I would try is just writing a principle in natural language about self-other overlap and doing CAI.

evhubΩ8163

Imo the fine-tuning approach here seems too syntactic. My suggestion: just try standard CAI with some self-other-overlap-inspired principle. I'd more impressed if that worked.

evhub*53

Actually, I'd be inclined to agree with Janus that current AIs probably do already have moral worth—in fact I'd guess more so than most non-human animalsand furthermore I think building AIs with moral worth is good and something we should be aiming for. I also agree that it would be better for AIs to care about all sentient beings—biological/digital/etc.—and that it would probably be bad if we ended up locked into a long-term equilibrium with some sentient beings as a permanent underclass to others. Perhaps the main place where I disagree is that I don't think this is a particularly high-stakes issue right now: if humanity can stay in control in the short-term, and avoid locking anything in, then we can deal with these sorts of long-term questions about how to best organize society post-singularity once the current acute risk period has passed.

evhub110

redesigned

What did it used to look like?

evhubΩ1322-3

Some random thoughts on CEV:

  1. To get the obvious disclaimer out of the way: I don't actually think any of this matters much for present-day alignment questions. I think we should as much as possible try to defer questions like this to future humans and AIs. And in fact, ideally, we should mostly be deferring to future AIs, not future people—if we get to the point where we're considering questions like this, that means we've reached superintelligence, and we'll either trust the AIs to be better than us at thinking about these sorts of questions, or we'll be screwed regardless of what we do.[1]
  2. Regardless, imo the biggest question that standard CEV leaves unanswered is what your starting population looks like that you extrapolate from. The obvious answer is "all the currently living humans," but I find that to be a very unsatisfying answer. One of the principles that Eliezer talks about in discussing CEV is that you want a procedure such that it doesn't matter who implements it—see Eliezer's discussion under "Avoid creating a motive for modern-day humans to fight over the initial dynamic." I think this is a great principle, but imo it doesn't go far enough. In particular:
    1. The set of all currently alive humans is hackable in various ways—e.g. trying to extend the lives of people whose values you like and not people whose values you dislike—and you don't want to incentivize any of that sort of hacking either.
    2. What about humans who recently died? Or were about to be born? What about humans in nearby Everett branches? There's a bunch of random chance here that imo shouldn't be morally relevant.
    3. More generally, I worry a lot about tyrannies of the present where we enact policies that are radically unjust to future people or even counterfactual possible future people.
  3. So what do you do instead? I think my current favorite solution is to do a bit of bootstrapping: first do some CEV on whatever present people you have to work with just to determine a reference class of what mathematical objects should or should not count as humans, then run CEV on top of that whole reference class to figure out what actual values to optimize for.
    1. It is worth pointing out that this could just be what normal CEV does anyway if all the humans decide to think along these lines, but I think there is real benefit to locking in a procedure that starts with a reference class determination first, since it helps remove a lot of otherwise perverse incentives.

  1. I'm generally skeptical of scenarios where you have a full superintelligence that is benign enough to use for some tasks but not benign enough to fully defer to (I do think this could happen for more human-level systems, though). ↩︎

evhubΩ7104

A lot of this stuff is very similar to the automated alignment research agenda that Jan Leike and collaborators are working on at Anthropic. I'd encourage anyone interested in differentially accelerating alignment-relevant capabilities to consider reaching out to Jan!

evhubΩ6110

We use "alignment" as a relative term to refer to alignment with a particular operator/objective. The canonical source here is Paul's 'Clarifying “AI alignment”' from 2018.

evhubΩ5100

I can say now one reason why we allow this: we think Constitutional Classifiers are robust to prefill.

evhubΩ44-4

I wish the post more strongly emphasized that regulation was a key part of the picture

I feel like it does emphasize that, about as strongly as is possible? The second step in my story of how RSPs make things go well is that the government has to step in and use them as a basis for regulation.

Load More