Ash Gray - LessWrong

A Quick List of Some Problems in AI Alignment As A Field

Excellent post. I have nothing really to add, only that you're not alone in this:

Here's a (failure?) mode that I and others are already in, but might be too embarrassed to write about: taking weird career/financial risks, in order to obtain the financial security, to work on alignment full-time ^[2]. Anyone more risk-averse (good for alignment!) might just... work a normal job for years to save up, or modestly conclude they're not good enough to work in alignment altogether. If security mindset can be taught at all, this is a shit equilibrium.
Yes, I know EA and the alignment community are both improving at noob-friendliness. I'm glad of this. I'd be more glad if I saw non-academic noob-friendly programs that pay people, with little legible evidence of their abilities, to upskill full-time. IQ or other tests are legal, certainly in a context like this. Work harder on screening for whatever's unteachable, and teaching what is.

I'm more on the "working on having more energy so I can spend more time learning even with a 9-5" side than taking risks, but same idea.

The inordinately slow spread of good AGI conversations in ML

Ash Gray2y32

I think your overall point -- More Dakka, make AGI less weird -- is right. In my experience, though, I disagree with your disagreement:

I disagree with "the case for the risks hasn't been that clearly laid out". I think there's a giant, almost overwhelming pile of intro resources at this point, any one of which is more than sufficient, written in all manner of style, for all manner of audience.^[1]
(I do think it's possible to create a much better intro resource than any that exist today, but 'we can do much better' is compatible with 'it's shocking that the existing material hasn't already finished the job'.)

The short version is that while there is a lot written about alignment, I haven't seen the core ideas organised into something clear enough to facilitate critically engaging with those ideas.

In my experience, there's two main issues:

is low discoverability of "good" introductory resources.
is the existing (findable) resources are not very helpful if your goal is to get a clear understanding of the main argument in alignment -- that the default outcome of building AGI without explictly making sure it's aligned is strongly negative.

For 1, I don't mean that "it's hard to find any introductory resources." I mean that it's hard to know what is worth engaging with. Because of the problems in 2, its very time-consuming to try to get more than a surface-level understanding. This is an issue for me personally when the main purpose of this kind of exploration is to try and decide whether I want to invest more time and effort in the area.

For 2, there are many issues. The most common is that many resources are now quite old - are they still relevant? What is the state of the field now? Many are very long, or include "lists of ideas" without attempting to organise them into a cohesive whole, are single ideas, or are too vague or informal to evaluate. The result is a general feeling of "Well, okay...But there were a lot of assumptions and handwaves in all that, and I'm not sure if none of them matter."

(If anyone is interested I can give feedback on specific articles in a reply -- omitted here for length. I've read a majority of the links in the [1] footnote.)

Some things I think would help this situation:

Maintain an up-to-date list of quality "intro to alignment" resources.
- Note that this shouldn't be a catch-all for all intro resources. Being opinionated with what you include is a good thing as it helps newcomers judge relative importance.
Create a new introductory resource that doesn't include irrelevant distractions from the main argument.
- What I'm talking about here are things like timelines, scenarios and likelihoods, policy arguments, questionable analogies (especially evolutionary), appeals to meta-reasoning and the like, that don't have any direct bearing on alignment itself and add distracting noncentral details that mainly serve to add potential points of disagreement.
  
  I think Arbital does this best, but I think it suffers from being organised as a comprehensive list of ideas as separate pages rather than a cohesive argument supported by specific direct evidence. I'm also not sure how current it is.
People who are highly engaged in alignment write "What I think is most important to know about alignment" (which could be specific evidence or general arguments).

Lastly, when you say

If you're building a machine, you should have an at least somewhat lower burden of proof for more serious risks. It's your responsibility to check your own work to some degree, and not impose lots of micromorts on everyone else through negligence.^[2]
But I don't think the latter point matters much, since the 'AGI is dangerous' argument easily meets higher burdens of proof as well.

Do you have some specific work in mind which provides this higher burden of proof?

Humans are very reliable agents

Ash Gray2y20

OK, thanks for linking that. You're probably right in the specific example of MNIST. I'm less convinced about more complicated tasks - it seems like each individual task would require a lot of engineering effort.

One thing I didn't see - is there research which looks at what happens if you give neural nets more of the input space as data? Things which are explicitly out-of-distribution, random noise, abstract shapes, or maybe other modes that you don't particularly care about performance on, and label it all as "garbage" or whatever. Essentially, providing negative as well as positive examples, given that the input spaces are usually much larger than the intended distribution.

Humans are very reliable agents

Ash Gray2y90

>I imagine if our goal was "never misclassify an MNIST digit" we could get to 6-7 nines of "worst-case accuracy" even out of existing neural nets, at the cost of saying "I don't know" for the confusing 0.2% of digits.

Er, how? I haven't seen anyone describe a way to do this. Getting a neural network to meaningfully say "I don't know" is very much cutting-edge research as far as I'm aware.

Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

Ash Gray2y90

I think you and John are talking about two different facets of interpretability.

The first one is the question of "white-boxing:" how do the model's internal components interrelate to produce its output? On this dimension, the kind of models that you've given as examples are much more interpretable than neural networks.

What I think John is talking about, I understand as "grounding." (Cf. Symbol grounding problem) Although the decision tree (a) above is clear in that one can easily follow how the final decision comes about, the question remains -- who or what makes sure that the labels in the boxes correspond to features of the real world that we would also describe by those labels? So I think the claim is that on this dimension of interpretability, neural networks and logical/probabilistic models are more similar.

Specializing in Problems We Don't Understand

Ash Gray4y60

This is the focus of General Systems, as outlined by Weinberg. That book is very good, by the way - I highly recommend reading it. It's both very dense and very accessible.

It's always puzzled me that the rationalist community hasn't put more emphasis on general systems. It seems like it should fit in perfectly, but I haven't seen anyone mention it explicitly. General Semantics mentioned in the recent historical post is somewhat related, but not the same thing.

More on topic: One thing you don't mention is that there are fairly general problem solving techniques, which start before, and are relatively independent of, your level of specific technical knowledge. From what I've observed, most people are completely lost when approaching a new problem, because they don't even know where to start. So as well as your suggestion of focusing on learning the existence of techniques and when they apply, you can also directly focus on learning problem solving approaches.

LESSWRONG
LW

Posts

Wiki Contributions

Comments