Adam Scholl

Wikitag Contributions

Comments

Sorted by

For two, a person who has done evil, versus a person who is evil, are quite different things. I think that it's sadly not always the case that a person's character is aligned with a particular behavior of theirs.

I do think many of the historical people most widely considered to be evil now were similarly not awful in full generality, or even across most contexts. For example, Eichmann, the ops lead for the Holocaust, was apparently a good husband and father, and generally took care not to violate local norms in his life or work. Yet personally I feel quite comfortable describing him as evil, despite "evil" being a fuzzy folk term of the sort which tends to imperfectly/lossily describe any given referent.

I share the sense that "flaky breakthroughs" are common, but also... I mean, it clearly is possible for people to learn and improve, right? Including by learning things about themselves which lastingly affect their behavior.

Personally, I've had many such updates which have had lasting effects—e.g., noticing when reading the Sequences that I'd been accidentally conflating "trying as hard as I can" with "appearing to others to be trying as hard as one might reasonably be expected to" in some cases, and trying thereafter to correct for that.

I do think it's worth tracking the flaky breakthrough issue—which seems to me most common with breakthroughs primarily about emotional processing, or the experience of quite-new-feeling sorts of mental state, or something like that?—but it also seems worth tracking that people can in fact sometimes improve!

I think the word "technical" is a red herring here. If someone tells me a flood is coming, I don't much care how much they know about hydrodynamics, even if in principle this knowledge might allow me to model the threat with more confidence. Rather, I care about things like e.g. how sure they are about the direction from which the flood is coming, about the topography of our surroundings, etc. Personally, I expect I'd be much more inclined to make large/confident updates on the basis of information at levels of abstraction like these, than at levels about e.g. hydrodynamics or particle physics or so forth, however much more "technical," or related-in-principle in some abstract reductionist sense, the latter may be.

I do think there are also many arguments beyond this simple one which clearly justify additional (and more confident) concern. But I try to assess such arguments based on how compelling they are, where "technical precision" is one, but hardly the only factor which might influence this; e.g., another is whether the argument even involves the relevant level of abstraction, or bears on the question at hand.

I think the simple argument "building minds vastly smarter than our own seems dangerous" is in fact pretty compelling, and seems relatively easy to realize beforehand, as e.g. Turing and many others did. Personally, there are not any technical facts about current ML systems which update me more overall either way about our likelihood of survival than this simple argument does.

And I see little reason why they should—technical details of current AI systems strike me as around as relevant to predicting whether future, vastly more intelligent systems will care about us as do e.g. technical details about neuronal firing in beetles about whether a given modern government will care about us. Certainly modern governments wouldn't exist if neurons hadn't evolved, and I expect one could in fact probably gather some information relevant to predicting them by studying beetle neurons; maybe even a lot, in principle. It just seems a rather inefficient approach, given how distant the object of study is from the relevant question.

I interpreted Habryka's comment as making two points, one of which strikes me as true and important (that it seems hard/unlikely for this approach to allow for pivoting adequately, should that be needed), and the other of which was a misunderstanding (that they don't literally say they hope to pivot if needed).

We believe that, even without further breakthroughs, this work can almost entirely mitigate the risk that we unwittingly put misaligned circa-human-expert-level agents in a position where they can cause severe harm.

Sam, I'm confused where this degree of confidence is coming from? I found this post helpful for understanding Anthropic's strategy, but there wasn't much argument given about why one should expect the strategy to work, much less to "almost entirely" mitigate the risk!

To me, this seems wildly overconfident given the quality of the available evidence—which, as Aysja notes, involves auditing techniques like e.g. simply asking the models themselves to rate their evil-ness on a scale from 1-10... I can kind of understand evidence like this informing your background intuitions and choice of research bets and so forth, but why think it justifies this much confidence you'll catch/fix misalignment?

Yeah, I buy that he cares about misuse. But I wouldn't quite use the word "believe," personally, about his acting as though alignment is easy—I think if he had actual models or arguments suggesting that, he probably would have mentioned them by now.

No, I agree it's worth arguing the object level. I just disagree that Dario seems to be "reasonably earnestly trying to do good things," and I think this object-level consideration seems relevant (e.g., insofar as you take Anthropic's safety strategy to rely on the good judgement of their staff).

Dario/Anthropic-leadership are at least reasonably earnestly trying to do good things within their worldview

I think as stated this is probably true of the large majority of people, including e.g. the large majority of the most historically harmful people. "Worldviews" sometimes reflect underlying beliefs that lead people to choose actions, but they can of course also be formed post-hoc, to justify whatever choices they wished to make.

In some cases, one can gain evidence about which sort of "worldview" a person has, e.g. by checking it for coherency. But this isn't really possible to do with Dario's views on alignment, since to my knowledge, excepting the Concrete Problems paper he has actually not ever written anything about the alignment problem.[1] Given this, I think it's reasonable to guess that he does not have a coherent set of views which he's neglected to mention, so much as the more human-typical "set of post-hoc justifications."

(In contrast, he discusses misuse regularly—and ~invariably changes the subject from alignment to misuse in interviews—in a way which does strike me as reflecting some non-trivial cognition).

  1. ^

    Counterexamples welcome! I've searched a good bit and could not find anything, but it's possible I missed something.

Adam Scholl*2211

I spent some time learning about neural coding once, and while interesting it sure didn't help me e.g. better predict my girlfriend; I think in general neuroscience is fairly unhelpful for understanding psychology. For similar reasons, I'm default-skeptical of claims that work on the level of abstraction of ML is likely to help with figuring out whether powerful systems trained via ML are trying to screw us, or with preventing that.

Load More