Adam Scholl - LessWrong

I interpreted Habryka's comment as making two points, one of which strikes me as true and important (that it seems hard/unlikely for this approach to allow for pivoting adequately, should that be needed), and the other of which was a misunderstanding (that they don't literally say they hope to pivot if needed).

Putting up Bumpers

Adam Scholl3d*95

We believe that, even without further breakthroughs, this work can almost entirely mitigate the risk that we unwittingly put misaligned circa-human-expert-level agents in a position where they can cause severe harm.

Sam, I'm confused where this degree of confidence is coming from? I found this post helpful for understanding Anthropic's strategy, but there wasn't much argument given about why one should expect the strategy to work, much less to "almost entirely" mitigate the risk!

To me, this seems wildly overconfident given the quality of the available evidence—which, as Aysja notes, involves auditing techniques like e.g. simply asking the models themselves to rate their evil-ness on a scale from 1-10... I can kind of understand evidence like this informing your background intuitions and choice of research bets and so forth, but why think it justifies this much confidence you'll catch/fix misalignment?

Anthropic, and taking "technical philosophy" more seriously

Adam Scholl1mo*2-2

Yeah, I buy that he cares about misuse. But I wouldn't quite use the word "believe," personally, about his acting as though alignment is easy—I think if he had actual models or arguments suggesting that, he probably would have mentioned them by now.

Anthropic, and taking "technical philosophy" more seriously

Adam Scholl1mo20

No, I agree it's worth arguing the object level. I just disagree that Dario seems to be "reasonably earnestly trying to do good things," and I think this object-level consideration seems relevant (e.g., insofar as you take Anthropic's safety strategy to rely on the good judgement of their staff).

Anthropic, and taking "technical philosophy" more seriously

Adam Scholl1mo*160

Dario/Anthropic-leadership are at least reasonably earnestly trying to do good things within their worldview

I think as stated this is probably true of the large majority of people, including e.g. the large majority of the most historically harmful people. "Worldviews" sometimes reflect underlying beliefs that lead people to choose actions, but they can of course also be formed post-hoc, to justify whatever choices they wished to make.

In some cases, one can gain evidence about which sort of "worldview" a person has, e.g. by checking it for coherency. But this isn't really possible to do with Dario's views on alignment, since to my knowledge, excepting the Concrete Problems paper he has actually not ever written anything about the alignment problem.^[1] Given this, I think it's reasonable to guess that he does not have a coherent set of views which he's neglected to mention, so much as the more human-typical "set of post-hoc justifications."

(In contrast, he discusses misuse regularly—and ~invariably changes the subject from alignment to misuse in interviews—in a way which does strike me as reflecting some non-trivial cognition).

^{^}
Counterexamples welcome! I've searched a good bit and could not find anything, but it's possible I missed something.

Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

Adam Scholl2mo*2111

I spent some time learning about neural coding once, and while interesting it sure didn't help me e.g. better predict my girlfriend; I think in general neuroscience is fairly unhelpful for understanding psychology. For similar reasons, I'm default-skeptical of claims that work on the level of abstraction of ML is likely to help with figuring out whether powerful systems trained via ML are trying to screw us, or with preventing that.

Eli's shortform feed

Adam Scholl2mo*810

I haven't perceived the degree of focus as intense, and if I had I might be tempted to level similar criticism. But I think current people/companies do clearly matter some, so warrant some focus. For example:

I think it's plausible that governments will be inclined to regulate AI companies more like "tech startups" than "private citizens building WMDs," the more those companies strike them as "responsible," earnestly trying their best, etc. In which case, it seems plausibly helpful to propagate information about how hard they are in fact trying, and how good their best is.
- So far, I think many researchers who care non-trivially about alignment—and who might have been capable of helping, in nearby worlds—have for similar reasons been persuaded to join whatever AI company currently has the most safetywashed brand instead. This used to be OpenAI, is now Anthropic, and may be some other company in the future, but it seems useful to me to discuss the details of current examples regardless, in the hope that e.g. alignment discourse becomes better calibrated about how much to expect such hopes will yield.
There may exist some worlds where it's possible to get alignment right, yet also possible not to, depending on the choices of the people involved. For example, you might imagine that good enough solutions—with low enough alignment taxes—do eventually exist, but that not all AI companies would even take the time to implement those.
- Alternatively, you might imagine that some people who come to control powerful AI truly don't care whether humanity survives, or are even explicitly trying to destroy it. I think such people are fairly common—both in the general population (relevant if e.g. powerful AI is open sourced), and also among folks currently involved with AI (e.g. Sutton, Page, Schmidhuber). Which seems useful to discuss, since e.g. one constraint on our survival is that those who actively wish to kill everyone somehow remain unable to do so.

Mikhail Samin's Shortform

Adam Scholl3mo2017

When do you think would be a good time to lock in regulation? I personally doubt RSP-style regulation would even help, but the notion that now is too soon/risks locking in early sketches, strikes me as in some tension with e.g. Anthropic trying to automate AI research ASAP, Dario expecting ASL-4 systems between 2025—the current year!—and 2028, etc.

Mikhail Samin's Shortform

Adam Scholl3mo*6241

Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don't actually have good advice to give anyone.

It seems to me that other possibilities exist, besides "has model with numbers" or "confused." For example, that there are relevant ethical considerations here which are hard to crisply, quantitatively operationalize!

One such consideration which feels especially salient to me is the heuristic that before doing things, one should ideally try to imagine how people would react, upon learning what you did. In this case the action in question involves creating new minds vastly smarter than any person, which pose double-digit risk of killing everyone on Earth, so my guess is that the reaction would entail things like e.g. literal worldwide riots. If so, this strikes me as the sort of consideration one should generally weight more highly than their idiosyncratic utilitarian BOTEC.

Ten people on the inside

Adam Scholl3mo2617

The only safety techniques that count are the ones that actually get deployed in time.

True, but note this doesn't necessarily imply trying to maximize your impact in the mean timelines world! Alignment plans vary hugely in potential usefulness, so I think it can pretty easily be the case that your highest EV bet would only pay off in a minority of possible futures.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments