johnswentworth

Sequences

From Atoms To Agents
"Why Not Just..."
Basic Foundations for Agent Models
Framing Practicum
Gears Which Turn The World
Abstraction 2020
Gears of Aging
Model Comparison

Wikitag Contributions

Comments

Sorted by

I think that's basically right, and good job explaining it clearly and compactly.

I would also highlight that it's not just about adversaries. One the main powers of proof-given-assumptions is that it allows to rule out large classes of unknown unknowns in one go. And, insofar as the things-proven-given-assumptions turn out to be false, it allows to detect previously-unknown unknowns.

Now to get a little more harsh...

Without necessarily accusing Kaj specifically, this general type of argument feels motivated to me. It feels like willful ignorance, like sticking one's head in the sand and ignoring the available information, because one wants to believe that All Research is Valuable or that one's own research is valuable or some such, rather than facing the harsh truth that much research (possibly one's own) is predictably-in-advance worthless.

That sort of reasoning makes sense insofar as it's hard to predict which small pieces will be useful. And while that is hard to some extent, it is not full we-just-have-no-idea-so-use-a-maxent-prior hard. There is plenty of work (including lots of research which people sink their lives into today) which will predictably-in-advance be worthless. And robust generalizability is the main test I know of for that purpose.

Applying this to your own argument:

Often when I've had a hypothesis about something that interests me, I've been happy that there has been *so much* scientific research done on various topics, many of them seemingly insignificant. While most of it is of little interest to me, the fact that there's so much of it means that there's often some prior work on topics that do interest me.

It will predictably and systematically be the robustly generalizable things which are relevant to other people in unexpected ways.

Yup, if you actually have enough knowledge to narrow it down to e.g. a 65% chance of one particular major route, then you're good. The challenging case is when you have no idea what the options even are for the major route, and the possibility space is huge.

Yeah ok. Seems very unlikely to actually happen, and unsure whether it would even work in principle (as e.g. scaling might not take you there at all, or might become more resource intensive faster than the AIs can produce more resources). But I buy that someone could try to intentionally push today's methods (both AI and alignment) to far superintelligence and simply turn down any opportunity to change paradigm.

Aliens kill you due to slop, humans depend on the details.

The basic issue here is that the problem of slop (i.e. outputs which look fine upon shallow review but aren't fine) plus the problem of aligning a parent-AI in such a way that its more-powerful descendants will robustly remain aligned, is already the core of the superintelligence alignment problem. You need to handle those problems in order to safely do the handoff, and at that point the core hard problems are done anyway. Same still applies to aliens: in order to safely do the handoff, you need to handle the "slop/nonslop is hard to verify" problem, and you need to handle the "make sure agents the aliens build will also be aligned, and their children, etc" problem.

It's not clear to me we'll have (or will "need") new paradigms before fully handing over all technical and strategic work to AIs which are capable enough to obsolete humans at all cognitive tasks.

If you want to not die to slop, then "fully handing over all technical and strategic work to AIs which are capable enough to obsolete humans at all cognitive tasks" not a thing which happens at all until the full superintelligence alignment problem is solved. That is how you die to slop.

If by superintelligence, you mean wildly superhuman AI, it remains non-obvious to me that new paradigms are needed (though I agree they will pretty likely arise prior to this point due to AIs doing vast quantity of research if nothing else). I think thoughtful and laborious implementation of current paradigm strategies (including substantial experimentation) could directly reduce risk from handing off to superintelligence down to perhaps 25% and I could imagine being argued considerably lower.

I find it hard to imagine such a thing being at all plausible. Are you imagining that jupiter brains will be running neural nets? That their internal calculations will all be differentiable? That they'll be using strings of human natural language internally? I'm having trouble coming up with any "alignment" technique of today which would plausibly generalize to far superintelligence. What are you picturing?

This post seems to assume that research fields have big hard central problems that are solved with some specific technique or paradigm.

This isn't always true. [...]

I would say it is basically-always true, but there are some fields (including deep learning today, for purposes of your comment) where the big hard central problems have already been solved, and therefore the many small pieces of progress on subproblems are all of what remains.

And insofar as there remains some problem which is simply not solvable within a certain paradigm, that is a "big hard central problem", and progress on the smaller subproblems of the current paradigm is unlikely by-default to generalize to whatever new paradigm solves that big hard central problem.

I agree that paradigm shifts can invalidate large amounts or prior work, but it isn't obvious whether this will occur in AI safety.

I claim it is extremely obvious and very overdetermined that this will occur in AI safety sometime between now and superintelligence. The question which you'd probably find more cruxy is not whether, but when - in particular, does it come before or after AI takes over most of the research?

... but (I claim) that shouldn't be the cruxy question, because we should not be imagining completely handing off the entire alignment-of-superintelligence problem to early transformative AI; that's a recipe for slop. We ourselves need to understand a lot about how things will generalize beyond the current paradigm, in order to recognize when that early transformative AI is itself producing research which will generalize beyond the current paradigm, in the process of figuring out how to align superintelligence. If an AI assistant produces alignment research which looks good to a human user, but won't generalize across the paradigm shifts between here and superintelligence, then that's a very plausible way for us to die.

Most importantly, current proposed technical plans are necessary but not sufficient to stop this. Even if the technical side fully succeeds no one knows what to do with that.

I don't think that's quite accurate. In particular, gradual disempowerment is exactly the sort of thing which corrigibility would solve. (At least for "corrigibility" in the sense David and I use the term, and probably Yudkowsky, but not Christiano's sense; he uses the term to mean a very different thing.)

A general-purpose corrigible AI (in the sense we use the term) is pretty accurately thought-of as an extension of the user. Building and using such an AI is much more like "uplifting" the user than like building an independent agent. It's the cognitive equivalent of gaining prosthetic legs, as opposed to having someone carry you around on a sedan. Another way to state it: a corrigible subsystem acts like it's a part of a larger agent, serving a particular purpose as a component of the larger agent, as opposed to acting like an agent in its own right.

... admittedly corrigibility is still very much in the "conceptual" stage, far from an actual technical plan. But it's at least a technical research direction which would pretty directly address the disempowerment problem.

Load More