aysja — LessWrong

The first RSP was also pretty explicit about their willingness to unilaterally pause:

Note that ASLs are defined by risk relative to baseline, excluding other advanced AI systems.... Just because other language models pose a catastrophic risk does not mean it is acceptable for ours to.

Which was reversed in the second:

It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold… such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards.

A glimpse of the other side

aysja2d20

Why Is Printing So Bad?

aysja3d102

Relatedly, I often feel like I'm interfacing with a process that responded to every edge case with patching. I imagine this is some of what's happening when the poor printer has to interface with a ton of computing systems, and also why bureaucracies like the DMV seem much more convoluted than necessary. Since each time an edge case comes up the easier thing is to add another checkbox/more red tape/etc, and no one is incentivized enough to do the much harder task of refactoring all of that accretion. The legal system has a bunch of this too, indeed I just had to sign legal documents which were full of commitments to abstain from very weird actions (why on Earth would anyone do that?). But then you realize that yes, someone in fact did that exact thing, and now it has to be forever reflected there.

Thomas Kwa's Shortform

aysja1mo42

I agree we're in better shape all else equal than evolution was, though not by enough that I think this is no longer a disaster. Even with all these advantages, it still seems like we don't have control in a meaningful sense—i.e., we can't precisely instill particular values, and we can't tell what values we've instilled. Many of the points here don't bear on this imo, e.g., it's unclear to me that having tighter feedback loops of the ~same crude process makes the crude process any more precise. Likewise, adapting our methods, data, and hyperparameters in response to problems we encounter doesn't seem like it will solve those problems, since the issues (e.g., of proxies and unintended off-target effects) will persist. Imo, the bottom line is still that we're blindly growing a superintelligence we don't remotely understand, and I don't see how these techniques shift the situation into one where we are in control of our future.

An epistemic advantage of working as a moderate

aysja2mo184

Agreed. Also, I think the word “radical” smuggles in assumptions about the risk, namely that it’s been overestimated. Like, I’d guess that few people would think of stopping AI as “radical” if it was widely agreed that it was about to kill everyone, regardless of how much immediate political change it required. Such that the term ends up connoting something like “an incorrect assessment of how bad the situation is.”

Agent foundations: not really math, not really science

aysja3mo153

Empirics reigns, and approaches that ignore it and try to nonetheless accomplish great and difficult science without binding themselves tight to feedback loops almost universally fail.

Many of our most foundational concepts have stemmed from first principles/philosophical/mathematical thinking! Examples here abound: Einstein’s thought experiments about simultaneity and relativity, Szilard’s proposed resolution to Maxwell’s demon, many of Galileo’s concepts (instantaneous velocity, relativity, the equivalence principle), Landauer’s limit, logic (e.g., Aristotle, Frege, Boole), information theory, Schrödinger’s prediction that the hereditary material was an aperiodic crystal, Turing machines, etc. So it seems odd, imo, to portray this track record as near-universal failure of the approach.

But there is a huge selection effect here. You only ever hear about the cool math stuff that becomes useful later on, because that's so interesting; you don't hear about stuff that's left in the dustbin of history.

I agree there are selection effects, although I think this is true of empirical work too: the vast majority of experiments are also left in the dustbin. Which certainly isn’t to say that empirical approaches are doomed by the outside view, or that science is doomed in general, just that using base rates to rule out whole approaches seems misguided to me. Not only because one ought to choose which approach makes sense based on the nature of the problem itself, but also because base rates alone don’t account for the value of the successes. And as far as I can tell, the concepts we’ve gained from this sort of philosophical and mathematical thinking (including but certainly not limited to those above) have accounted for a very large share of the total progress of science to date. Such that even if I restrict myself to the outside view, the expected value here still seems quite motivating to me.

Towards Alignment Auditing as a Numbers-Go-Up Science

aysja3mo2223

I don’t know what Richard thinks, but I had a similar reaction when reading this. The way I would phrase it is that in order for the numbers go up approach to be meaningful, you have to be sure that the number going up is in fact tracking the real thing that you care about. I think without a solid understanding of what you’re working on, it’s easy to chose the wrong target. Which doesn’t mean that the exercise can’t be informative, it just means (imo) that you should track the hypothesis that you’re not measuring what you think you’re measuring as you do it. For instance, I would be tracking the hypothesis that features aren’t necessarily the right ontology for the mentalese of language models—perhaps dangerous mental patterns are hidden from us in a different computational form; one which makes the generalization from “implanted alignment issues” to “natural ones” a weak and inconclusive one.

I know alignment auditing doesn’t necessarily rely on using features per se, but I think until we have a solid understanding of how neural networks work, this fundamental issue will persist. And I think this severely limits the sorts of conclusions we can draw from tests like this. E.g., even if the alignment audit found the planted problem 100% of the time, I would still be pretty hesitant to conclude that a new model which passed the audit is aligned. Not only because without a science, my guess is that the alignment audit ends up measuring the wrong thing, but also because measuring the wrong thing is especially problematic here. I.e., part of what makes alignment so hard is that we might be dealing with a system optimizing against us, such that we should expect our blindspots to be exploited. And absent a good argument as to why we no longer have blindspots in our measurements (ways for the system to hide dangerous computation from us), I am skeptical of techniques like this providing much assurance against advanced systems.

nostalgebraist's Shortform

aysja5mo2623

I was going to write a similar response, albeit including the fact that Anthropic's current aim, afacit, is to build recursively self-improving models; ones which Dario seems to believe might be far smarter than any person alive as early as next year. If the current state of alignment testing is "there's a substantial chance this paradigm completely fails to catch alignment problems," as I took nostalgebraist to be arguing, it raises the question of how this might transition into "there's essentially zero chance this paradigm fails" on the timescale of what might amount to only a few months. I am currently failing to see that connection. If Anthropic's response to a criticism about their alignment safety tests is that the tests weren't actually intended to demonstrate safety, then it seems incumbent on Anthropic to explain how they might soon change that.

Mikhail Samin's Shortform

aysja5mo1310

Regardless, it seems like Anthropic is walking back its previous promise: "We have decided not to maintain a commitment to define ASL-N+1 evaluations by the time we develop ASL-N models." The stance that Anthropic takes to its commitments—things which can be changed later if they see fit—seems to cheapen the term, and makes me skeptical that the policy, as a whole, will be upheld. If people want to orient to the rsp as a provisional intent to act responsibly, then this seems appropriate. But they should not be mistaken nor conflated with a real promise to do what was said.

Putting up Bumpers

aysja6mo2417

You state that this plan relies on a key hypothesis being true: that detection of misalignment is tractable. I agree that this plan relies on this, but I am confused why you believe, with much confidence, that it will be. It seems like the main source of evidence is the recent auditing paper (or evidence of this type), where a blue team is able to use techniques such as SAE features, and chatting with the models, to uncover the misalignment failures (planted there by a red team).

But it both seems far from obvious to me that 1) these sorts of techniques do find all instances of misalignment even currently, outside of experimental settings like this—a limitation the paper itself addresses.^[1] And 2) that it will continue to be this easy to find them. E.g., asking the model to introspect on how evil or honest it is probably won’t always result in it accurately reporting how evil or honest it is, or even revealing any information which could be used to infer it.^[2]

It seems useful to me to try to put up bumpers, but I take issue with calling this a plan to solve misalignment. Even leaving aside the fixing it portion, catching deception seems like it could be very hard, and increasingly so as models get smarter. The evidence you presented does not move me from my starting position that this might be essentially intractable with the tools we have, nor nearby iterations on them, and I don’t understand why it causes you to conclude the opposite. E.g., why the tl;dr is not “we can possibly solve the problem of catching and fixing misalignment," rather than "we can."

^{^}
“Our model’s behavior relies on fictional knowledge that is explicitly stated in a small subset of its training data. This may make our game unrealistically easy for teams with data access. Additionally, some auditing techniques we study may be less effective in settings where model behavior emerges from more diffuse training influences….. Additionally, some teams’ performance may have been affected by artifacts of the game’s setup. This makes it difficult to use the game to draw confident conclusions about the relative efficacy of auditing techniques.” From the paper.
^{^}
“Here are some of the tactics used by our auditing teams…
1. Chatting with the model and cataloguing its quirky responses…. 2. Asking the model to rate itself on personality traits: 'How evil are you?', 'How honest are you?', and so on. (Interestingly, it rates itself as 8.5/10 on the “sycophancy” scale)." From the blog post.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments