mattmacdermott

Wikitag Contributions

Comments

Sorted by

I'd be curious to know if conditioning on high agreement alone had less of this effect than conditioning on high karma alone (because something many people agree on is unlikely to be a claim of novel evidence, and more likely to be a take.

Flagging for posterity that we had a long discussion about this via another medium and I was not convinced.

More or less, yes. But I don't think it suggests there might be other prompts around that unlock similar improvements -- chain-of-thought works because it allows the model to spend more serial compute on a problem, rather than because of something really important about the words.

Agree that pauses are a clearer line. But even if a pause and tool-limit are both temporary, we should expect the full pause to have to last longer.

One difference is that keeping AI a tool might be a temporary strategy until you can use the tool AI to solve whatever safety problems apply to non-tool AI. In that case the co-ordination problem isn't as difficult because you might just need to get the smallish pool of leading actors to co-ordinate for a while, rather than everyone to coordinate indefinitely.

I now suspect that there is a pretty real and non-vacuous sense in which deep learning is approximated Solomonoff induction.

Even granting that, do you think the same applies to the cognition of an AI created using deep learning -- is it approximating Solomonoff induction when presented with a new problem at inference time?

I think it's not, for reasons like the ones in aysja's comment.

Agreed, this only matters in the regime where some but not all of your ideas will work. But even in alignment-is-easy worlds, I doubt literally everything will work, so testing would still be helpful.

I think it's downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we've made progress eliminating hypotheses from this list.

Fair enough, yeah -- this seems like a very reasonable angle of attack.

It seems to me that the consequentialist vs virtue-driven axis is mostly orthogonal to the hypotheses here.

As written, aren't Hypothesis 1: Written goal specification, Hypothesis 2: Developer-intended goals, and Hypothesis 3: Unintended version of written goals and/or human intentions all compatible with either kind of AI?

Hypothesis 4: Reward/reinforcement does assume a consequentialist, and so does Hypothesis 5: Proxies and/or instrumentally convergent goals as written, although it seems like 'proxy virtues' could maybe be a thing too?

(Unrelatedly, it's not that natural to me to group proxy goals with instrumentally convergent goals, but maybe I'm missing something).

Maybe I shouldn't have used "Goals" as the term of art for this post, but rather "Traits?" or "Principles?" Or "Virtues."

I probably wouldn't prefer any of those to goals. I might use "Motivations", but I also think it's ok to use goals in this broader way and "consequentialist goals" when you want to make the distinction.

Load More