All of Grue_Slinky's Comments + Replies

What are we assuming about utility functions?

Nitpick: "transfer learning" is the standard term, no? It has a Wiki page and seems to get a more coherent batch of search results than googling "robustness to data shift".

3Rohin Shah5y

It goes under many names, such as transfer learning, robustness to distributional shift / data shift, and out-of-distribution generalization. Each one has (to me) slightly different connotations, e.g. transfer learning suggests that the researcher has a clear idea of the distinction between the first and second setting (and so you "transfer" from the first to the second), whereas if in RL you change which part of the state space you're in as you act, I would be more likely to call that distributional shift rather than transfer learning.

Grue_Slinky5yΩ450

Whoops, mea culpa on that one! Deleted and changed to:

the main post there pointed out that seemingly anything can be trivially modeled as being a "utility maximizer" (further discussion here), whereas only some intelligent agents can be described as being "goal-directed" (as defined in this post), and the latter is a more useful concept for reasoning about AI safety.

Grue_Slinky5y30

In reasoning about AGI, we're all aware of the problems with anthropomorphizing, but it occurs to me that there's also a cluster of bad reasoning that comes from an (almost?) opposite direction, where you visualize an AGI to be a mechanical automaton and draw naive conclusions based on that.

For instance, every now and then I've heard someone from this community say something like:

What if the AGI runs on the ZFC axioms (among other things), and finds a contradiction, and by the principle of explosion it goes completely haywire?

Even if ZFC is inconsisten

Grue_Slinky5yΩ570

[copying from my comment on the EA Forum x-post]

For reference, some other lists of AI safety problems that can be tackled by non-AI people:

Luke Muehlhauser's big (but somewhat old) list: "How to study superintelligence strategy"

AI Impacts has made several lists of research problems

Wei Dai's, "Problems in AI Alignment that philosophers could potentially contribute to"

Kaj Sotala's case for the relevance of psychology/cog sci to AI safety (I would add that Ought is currently testing the feasibility of IDA/Debate by doing psy... (read more)

1evhub5y

Also relevant is Geoffrey Irving and Amanda Askell's "AI Safety Needs Social Scientists."

Grue_Slinky5yΩ350

*begins drafting longer proposal*

Yeah, this is definitely more high-risk, high-reward than the others, and the fact that there's potentially some very substantial spillover effects if successful makes me both excited and nervous about the concept. I'm thinking of Arbital as an example of "trying to solve way too many problems at once", so I want to manage expectations and just try to make some exercises that inspire people to think about the art of mathematizing certain fuzzy philosophical concepts. (Running title is "Formalization Exercises", but I'm not sure if there's a better pithy name that captures it).

In any case, I appreciate the feedback, Mr. Entworth.

4johnswentworth5y

Oh no, not you too. It was bad enough with just Bena.

Grue_Slinky5yΩ11230

(8)

In light of the “Fixed Points” critique, a set of exercises that seem more useful/reflective of MIRI’s research than those exercises. What I have in mind is taking some of the classic success stories of formalized philosophy (e.g. Turing machines, Kolmogorov complexity, Shannon information, Pearlian causality, etc., but this could also be done for reflective oracles and logical induction), introducing the problems they were meant to solve, and giving some stepping stones that guide one to have the intuitions and thoughts that (presu... (read more)

johnswentworth5yΩ5100

I think this would be an extremely useful exercise for multiple independent reasons:

it's directly attempting to teach skills which I do not currently know any reproducible way to teach/learn
it involves looking at how breakthroughs happened historically, which is an independently useful meta-strategy
it directly involves investigating the intuitions behind foundational ideas relevant to the theory of agency, and could easily expose alternative views/interpretations which are more useful (in some contexts) than the usual presentations

Grue_Slinky5yΩ120

(7)

A critique of MIRI’s “Fixed Points” paradigm, expanding on some points I made on MIRIxDiscord a while ago (which would take a full post to properly articulate). Main issue is, I'm unsure if it's still guiding anyone's research and/or who outside MIRI would care.

Grue_Slinky5yΩ660

(6)

An analysis of what kinds of differential progress we can expect from stronger ML. Actually, I don’t feel like writing this post, but I just don’t understand why Dai and Christiano, respectively, are particularly concerned about differential progress on the polynomial hierarchy and what’s easy-to-measure vs. hard-to-measure. My gut reaction is “maybe, but why privilege that axis of differential progress of all things”, and I can’t resolve that in my mind without doing a comprehensive analysis of potential ȁ... (read more)

5Rohin Shah5y

Re: easy-to-measure vs. hard-to-measure axis: That seems like the most obvious axis on which AI is likely to be different from humans, and it clearly does lead to bad outcomes?

Grue_Slinky5yΩ8110

(5)

A skeptical take on Part I of “What failure looks like” (3 objections, to summarize briefly: not much evidence so far, not much precedent historically, and “why this, of all the possible axes of differential progress?”) [Unsure if these objections will stand up if written out more fully]

Grue_Slinky5y*Ω560

(4)

A post discussing my confusions about Goodhart and Garrabrant’s taxonomy of it. I find myself not completely satisfied with it:

1) “adversarial” seems too broad to be that useful as a category

2) It doesn’t clarify what phenomenon is meant by “Goodhart”; in particular, “regressional” doesn’t feel like something the original law was talking about, and any natural definition of “Goodhart” that includes it seems really broad

3) Whereas “regressional” and “extrema... (read more)

Grue_Slinky5yΩ120

(3)

“When and why should we be worried about robustness to distributional shift?”: When reading that section of Concrete Problems, there’s a temptation to just say “this isn’t relevant long-term, since an AGI by definition would have solved that problem”. But adversarial examples and the human safety problems (to the extent we worry about them) both say that in some circumstances we don’t expect this to be solved by default. I’d like to think more about when the naïve “AGI will be smart... (read more)

4Rohin Shah5y

Concerns about mesa-optimizers are mostly concerns that "capabilities" will be robust to distributional shift while "objectives" will not be robust.

(2)

[I probably need a better term for this] “Wide-open-source game theory”: Where other agents can not only simulate you, but also figure out "why" you made a given decision. There’s a Standard Objection to this: it’s unfair to compare algorithms in environments where they are judged not only by their actions, but on arbitrary features of their code; to which I say, this isn’t an arbitrary feature. I was thinking about this in the context of how, even if an AGI makes the right decision, we care “whyȁ... (read more)

(1)

A classification of some of the vulnerabilities/issues we might expect AGIs to face because they are potentially open-source, and generally more “transparent” to potential adversaries. For instance, they could face adversarial examples, open-source game theory problems, Dutch books, or weird threats that humans don’t have to deal with. Also, there’s a spectrum from “extreme black box” to “extreme white box” with quite a few plausible milestones along the way, that makes for a certain transparency hierarchy, and it may be helpful to analyze this (or at least take a stab at formulating it).

Grue_Slinky5yΩ120

Upvote this comment (and downvote the others as appropriate) if most of the other ideas don’t seem that fruitful.

By default, I’d mostly take this as a signal of “my time would be better spent working on someone else’s agenda or existing problems that people have posed” but I suppose other alternatives exist, if so comment below.

Deconfuse Yourself about Agency

Grue_Slinky5yΩ370

I have a bunch of half-baked ideas, most of which are mediocre in expectation and probably not worth investing my time and other’s attention writing up. Some of them probably are decent, but I’m not sure which ones, and the user base is probably as good as any for feedback.

So I’m just going to post them all as replies to this comment. Upvote if they seem promising, downvote if not. Comments encouraged. I reserve the “right” to maintain my inside view, but I wouldn’t make this poll if I didn’t put substantial weight on this community’s opinions.

Grue_Slinky5yΩ11230

(8)

2Grue_Slinky5y

(7) A critique of MIRI’s “Fixed Points” paradigm, expanding on some points I made on MIRIxDiscord a while ago (which would take a full post to properly articulate). Main issue is, I'm unsure if it's still guiding anyone's research and/or who outside MIRI would care.

6Grue_Slinky5y

(6) An analysis of what kinds of differential progress we can expect from stronger ML. Actually, I don’t feel like writing this post, but I just don’t understand why Dai and Christiano, respectively, are particularly concerned about differential progress on the polynomial hierarchy and what’s easy-to-measure vs. hard-to-measure. My gut reaction is “maybe, but why privilege that axis of differential progress of all things”, and I can’t resolve that in my mind without doing a comprehensive analysis of potential “differential progresses” that ML could precipitate. Which, argh, sounds like an exhausting task, but someone should do it?

Grue_Slinky5yΩ8110

(5)

6Grue_Slinky5y

(4) A post discussing my confusions about Goodhart and Garrabrant’s taxonomy of it. I find myself not completely satisfied with it: 1) “adversarial” seems too broad to be that useful as a category 2) It doesn’t clarify what phenomenon is meant by “Goodhart”; in particular, “regressional” doesn’t feel like something the original law was talking about, and any natural definition of “Goodhart” that includes it seems really broad 3) Whereas “regressional” and “extremal” (and perhaps “causal”) are defined statistically, “adversarial” is defined in terms of agents, and this may have downsides (I’m less sure about this objection) But I’m also not sure how I’d reclassify it and that task seems hard. Which partially updates me in favor of the Taxonomy being good, but at the very least I feel there’s more to say about it.

2Grue_Slinky5y

(3) “When and why should we be worried about robustness to distributional shift?”: When reading that section of Concrete Problems, there’s a temptation to just say “this isn’t relevant long-term, since an AGI by definition would have solved that problem”. But adversarial examples and the human safety problems (to the extent we worry about them) both say that in some circumstances we don’t expect this to be solved by default. I’d like to think more about when the naïve “AGI will be smart” intuition applies and when it breaks.

1Grue_Slinky5y

(2) [I probably need a better term for this] “Wide-open-source game theory”: Where other agents can not only simulate you, but also figure out "why" you made a given decision. There’s a Standard Objection to this: it’s unfair to compare algorithms in environments where they are judged not only by their actions, but on arbitrary features of their code; to which I say, this isn’t an arbitrary feature. I was thinking about this in the context of how, even if an AGI makes the right decision, we care “why” it did so, i.e. because it’s optimizing for what we want vs. optimizing for human approval for instrumental reasons). I doubt we’ll formalize this “why” anytime soon (see e.g. section 5 of this), but I think semi-formal things can be said about it upon some effort. [I thought of this independently from (1), but I think every level of the “transparency hierarchy” could have its own kind of game theory, much like the “open-source” level clearly does]

1Grue_Slinky5y

(1) A classification of some of the vulnerabilities/issues we might expect AGIs to face because they are potentially open-source, and generally more “transparent” to potential adversaries. For instance, they could face adversarial examples, open-source game theory problems, Dutch books, or weird threats that humans don’t have to deal with. Also, there’s a spectrum from “extreme black box” to “extreme white box” with quite a few plausible milestones along the way, that makes for a certain transparency hierarchy, and it may be helpful to analyze this (or at least take a stab at formulating it).

2Grue_Slinky5y

Upvote this comment (and downvote the others as appropriate) if most of the other ideas don’t seem that fruitful. By default, I’d mostly take this as a signal of “my time would be better spent working on someone else’s agenda or existing problems that people have posed” but I suppose other alternatives exist, if so comment below.

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

This question also has a negative answer, as witnessed by the example of an ant colony --- agent-like behavior without agent-like architecture, produced by a "non-agenty" optimization process of evolution. Nonetheless, a general version of the question remains: If some X exhibits agent-like behavior, does it follow that there exists some interesting physical structure causally upstream of X?

Neat example! But for my part, I'm confused about this last sentence, even after reading the footnote:

An example of such "interesting physical struc

... (read more)

1VojtaKovarik5y

First off, while I feel somewhat de-confused about X-like behavior, I don't feel very confident about X-like architectures. Maybe the meaning is somewhat clear on higher levels of abstraction (e.g., if my brain goes "realize I want to describe a concept --> visualize several explanations and judge each for suitability --> pick the one that seems the best --> send a signal to start typing it down", then this would be a kind of search/optimization-thingy). But on the level of physics, I don't really know what an architecture means. So take this with a grain of salt. Maybe the term "physical structure" is misleading. The thing I was trying to point at is the distinction between being able to accurately model Y using model X, and Y actually being X. In the sense that there might be a giant look-up table (GLUT) that accuractly predicts your behavior, but on no level of abstraction is it correct to say that you actually are a GLUT. Whereas modelling you as having some goals, planning, etc. might be less accurate but somewhat more, hm, true. I realize this isn't very precise, but I guess you can see what I mean. That being said, I suppose that what I meant by "optimization architecture" is, for example, a stochastic gradient descent with the emphasis on "this is the input", "this is the part of the algorithm that does the calculation", and "this is the output". An "implementation of an optimization architecture" would be...well, the atoms of your computer that perform SGD, or maybe some simple bacteria that moves in the direction where the concentration of whatever-it-likes is the highest (not that anything I know would implement precisely SGD, but still). Ad "interesting physical structure" behind the ant-colony: If by "evolution" we mean the atoms that the world is made of, as they changed over time until your ant colony emerged...then yeah, this is a physical structure causally upstream of the ant colony, and one that is responsible for the ant colony behaving the wa

Grue_Slinky5yΩ11250

For reference, LeCun discussed his atheoretic/experimentalist views in more depth in this FB debate with Ali Rahimi and also this lecture. But maybe we should distinguish some distinct axes of the experimentalist/theorist divide in DL:

1) Experimentalism/theorism is a more appropriate paradigm for thinking about AI safety

2) Experimentalism/theorism is a more appropriate paradigm for making progress in AI capabilities

Where the LeCun/Russell debate is about (1) and LeCun/Rahimi is about (2). And maybe this is oversimplifying things, since "theorism"... (read more)

8orthonormal5y

Good comment. I disagree with this bit: And then it would probably have been seen as outmoded and thrown away completely when AI capabilities research progressed into realms that vastly surpassed GOFAI. I don't know that there's an easy way to get capabilities researchers to think seriously about safety concerns that haven't manifested on a sufficient scale yet.

What are we assuming about utility functions?

Grue_Slinky5yΩ330

That all seems pretty fair.

If a system is trying to align with idealized reflectively-endorsed values (similar to CEV), then one might expect such values to be coherent.

That's why I distinguished between the hypotheses of "human utility" and CEV. It is my vague understanding (and I could be wrong) that some alignment researchers see it as their task to align AGI with current humans and their values, thinking the "extrapolation" less important or that it will take care of itself, while others consider extrapolation an important part... (read more)

2abramdemski5y

I didn't reply to this originally, probably because I think it's all pretty reasonable. My thinking on this is pretty open. In some sense, everything is extrapolation (you don't exactly "currently" have preferences, because every process is expressed through time...). But OTOH there may be a strong argument for doing as little extrapolation as possible. Well, imitating you is not quite right. (EG, the now-classic example introduced with the CIRL framework: you want the AI to help you make coffee, not learn to drink coffee itself.) Of course maybe it is imitating you at some level in its decision-making, like, imitating your way of judging what's good. I'm thinking things like: will it disobey requests which it understands and is capable of? Will it fight you? Not to say that those things are universally wrong to do, but they could be types of alignment we're shooting for, and inconsistencies do seem to create trouble there. Presumably if we know that it might fight us, we would want to have some kind of firm statement about what kind of "better" reasoning would make it do so (e.g., it might temporarily fight us if we were severely deluded in some way, but we want pretty high standards for that).

2Davidmanheim5y

I've been thinking about whether you can have AGI that only aims for pareto-improvements, or a weaker formulation of that, in order to align with inconsistent values among groups of people. This is strongly based on Eric Drexler's thoughts on what he has called "pareto-topia". (I haven't gotten anywhere thinking about this because I'm spending my time on other things.)

Grue_Slinky5y20

To be clear I unendorsed the idea about a minute after posting because it felt like more of a low-effort shitpost than a constructive idea for understanding the world (and I don't want to make that a norm on shortform). That said I had in mind that you're describing the thing to someone who you can't communicate with beforehand, except there's common knowledge that you're forbidden any nouns besides "cake". In practice I feel like it degenerates to putting all the meaning on adjectives to construct the nouns you'd wa... (read more)

Grue_Slinky5y10

K-complexity: The minimum description length of something (relative to some fixed description language)

Cake-complexity: The minimum description length of something, where the only noun you can use is "cake"

[This comment is no longer endorsed by its author]Reply

1Tetraspace5y

Are we allowed to I-am-Groot the word "cake" to encode several bits per word, or do we have to do something like repeat "cake" until the primes that it factors into represent a desired binary string? (edit: ah, only nouns, so I can still use whatever I want in the other parts of speech. or should I say that the naming cakes must be "cake", and that any other verbal cake may be whatever this speaking cake wants)

AI Alignment Open Thread August 2019

Grue_Slinky5y150

I often hear about deepfakes--pictures/videos that can be entirely synthesized by a deep learning model and made to look real--and how this could greatly amplify the "fake news" phenomenon and really undermine the ability of the public to actually evaluate evidence.

And this sounds like a well-founded worry, but then I was just thinking, what about Photoshop? That's existed for over a decade, and for all that time it's been possible to doctor images to look real. So why should deepfakes be any scarier?

Part of it could be that we can fak... (read more)

2Pattern5y

There was a handful of news about things like a company being scammed out of a lot of money (the voice of the CEO was faked over a phone). This is a different issue than "the public" being fooled.

Grue_Slinky5yΩ230

Is this open thread not going to be a monthly thing?

FWIW I liked reading the comment threads here, and would be inclined to participate in the future. But that's just my opinion. I'm curious if more senior people had reasons for not liking the idea?

4Rohin Shah5y

I expected that it would be better for me to polish ideas before posting on the forum, and treated this as an experiment to check. I think it broadly confirmed my original view, so I'm not very likely to post top-level comments on open threads in the future, and I told the admins so. I don't know what their decision process was after that. (Possibly they expected that future open threads would be much quieter, since the two biggest comment threads here were both started by my top-level comments.)

Non-anthropically, what makes us think human-level intelligence is possible?

Grue_Slinky5y10

I'm not asking about the Fermi paradox, and its unclear to me how that's related. I'm wondering why we think general (i.e. human-level) intelligence is possible in our universe, if we're not allowed to invoke anthropic evidence. For instance, here's some possible ways one can answer my question [rot13'd to avoid spoiling people's answers]:

1. Nethr gung aba-cevzngr navzny vagryyvtrapr nyernql trgf hf "zbfg bs gur jnl gurer", naq tvira gur nccebcevngr raivebazrag, vg fubhyq or cbffvoyr va cevapvcyr sbe n fhpprffvb... (read more)

What are concrete examples of potential "lock-in" in AI research?

Grue_Slinky5yΩ220

Huh, that's a good point. Whereas it seems probably inevitable that AI research would've eventually converged on something similar to the current D(R)L paradigm, we can imagine a lot of different ways AI safety could have looked like instead right now. Which makes sense, since the latter is still young and in a kind of pre-paradigmatic philosophical stage, with little unambiguous feedback to dictate how things should unfold (and it's far from clear when substantially more of this feedback will show up).

I can imagine an alternate timeline whe... (read more)

Distance Functions are Hard