Whoops, mea culpa on that one! Deleted and changed to:
the main post there pointed out that seemingly anything can be trivially modeled as being a "utility maximizer" (further discussion here), whereas only some intelligent agents can be described as being "goal-directed" (as defined in this post), and the latter is a more useful concept for reasoning about AI safety.
In reasoning about AGI, we're all aware of the problems with anthropomorphizing, but it occurs to me that there's also a cluster of bad reasoning that comes from an (almost?) opposite direction, where you visualize an AGI to be a mechanical automaton and draw naive conclusions based on that.
For instance, every now and then I've heard someone from this community say something like:
What if the AGI runs on the ZFC axioms (among other things), and finds a contradiction, and by the principle of explosion it goes completely haywire?
Even if ZFC is inconsisten
...[copying from my comment on the EA Forum x-post]
For reference, some other lists of AI safety problems that can be tackled by non-AI people:
Luke Muehlhauser's big (but somewhat old) list: "How to study superintelligence strategy"
AI Impacts has made several lists of research problems
Wei Dai's, "Problems in AI Alignment that philosophers could potentially contribute to"
Kaj Sotala's case for the relevance of psychology/cog sci to AI safety (I would add that Ought is currently testing the feasibility of IDA/Debate by doing psy...
*begins drafting longer proposal*
Yeah, this is definitely more high-risk, high-reward than the others, and the fact that there's potentially some very substantial spillover effects if successful makes me both excited and nervous about the concept. I'm thinking of Arbital as an example of "trying to solve way too many problems at once", so I want to manage expectations and just try to make some exercises that inspire people to think about the art of mathematizing certain fuzzy philosophical concepts. (Running title is "Formalization Exercises", but I'm not sure if there's a better pithy name that captures it).
In any case, I appreciate the feedback, Mr. Entworth.
(8)
In light of the “Fixed Points” critique, a set of exercises that seem more useful/reflective of MIRI’s research than those exercises. What I have in mind is taking some of the classic success stories of formalized philosophy (e.g. Turing machines, Kolmogorov complexity, Shannon information, Pearlian causality, etc., but this could also be done for reflective oracles and logical induction), introducing the problems they were meant to solve, and giving some stepping stones that guide one to have the intuitions and thoughts that (presu...
I think this would be an extremely useful exercise for multiple independent reasons:
(7)
A critique of MIRI’s “Fixed Points” paradigm, expanding on some points I made on MIRIxDiscord a while ago (which would take a full post to properly articulate). Main issue is, I'm unsure if it's still guiding anyone's research and/or who outside MIRI would care.
(6)
An analysis of what kinds of differential progress we can expect from stronger ML. Actually, I don’t feel like writing this post, but I just don’t understand why Dai and Christiano, respectively, are particularly concerned about differential progress on the polynomial hierarchy and what’s easy-to-measure vs. hard-to-measure. My gut reaction is “maybe, but why privilege that axis of differential progress of all things”, and I can’t resolve that in my mind without doing a comprehensive analysis of potential ȁ...
(5)
A skeptical take on Part I of “What failure looks like” (3 objections, to summarize briefly: not much evidence so far, not much precedent historically, and “why this, of all the possible axes of differential progress?”) [Unsure if these objections will stand up if written out more fully]
(4)
A post discussing my confusions about Goodhart and Garrabrant’s taxonomy of it. I find myself not completely satisfied with it:
1) “adversarial” seems too broad to be that useful as a category
2) It doesn’t clarify what phenomenon is meant by “Goodhart”; in particular, “regressional” doesn’t feel like something the original law was talking about, and any natural definition of “Goodhart” that includes it seems really broad
3) Whereas “regressional” and “extrema...
(3)
“When and why should we be worried about robustness to distributional shift?”: When reading that section of Concrete Problems, there’s a temptation to just say “this isn’t relevant long-term, since an AGI by definition would have solved that problem”. But adversarial examples and the human safety problems (to the extent we worry about them) both say that in some circumstances we don’t expect this to be solved by default. I’d like to think more about when the naïve “AGI will be smart...
(2)
[I probably need a better term for this] “Wide-open-source game theory”: Where other agents can not only simulate you, but also figure out "why" you made a given decision. There’s a Standard Objection to this: it’s unfair to compare algorithms in environments where they are judged not only by their actions, but on arbitrary features of their code; to which I say, this isn’t an arbitrary feature. I was thinking about this in the context of how, even if an AGI makes the right decision, we care “whyȁ...
(1)
A classification of some of the vulnerabilities/issues we might expect AGIs to face because they are potentially open-source, and generally more “transparent” to potential adversaries. For instance, they could face adversarial examples, open-source game theory problems, Dutch books, or weird threats that humans don’t have to deal with. Also, there’s a spectrum from “extreme black box” to “extreme white box” with quite a few plausible milestones along the way, that makes for a certain transparency hierarchy, and it may be helpful to analyze this (or at least take a stab at formulating it).
Upvote this comment (and downvote the others as appropriate) if most of the other ideas don’t seem that fruitful.
By default, I’d mostly take this as a signal of “my time would be better spent working on someone else’s agenda or existing problems that people have posed” but I suppose other alternatives exist, if so comment below.
I have a bunch of half-baked ideas, most of which are mediocre in expectation and probably not worth investing my time and other’s attention writing up. Some of them probably are decent, but I’m not sure which ones, and the user base is probably as good as any for feedback.
So I’m just going to post them all as replies to this comment. Upvote if they seem promising, downvote if not. Comments encouraged. I reserve the “right” to maintain my inside view, but I wouldn’t make this poll if I didn’t put substantial weight on this community’s opinions.
(8)
In light of the “Fixed Points” critique, a set of exercises that seem more useful/reflective of MIRI’s research than those exercises. What I have in mind is taking some of the classic success stories of formalized philosophy (e.g. Turing machines, Kolmogorov complexity, Shannon information, Pearlian causality, etc., but this could also be done for reflective oracles and logical induction), introducing the problems they were meant to solve, and giving some stepping stones that guide one to have the intuitions and thoughts that (presu...
(5)
A skeptical take on Part I of “What failure looks like” (3 objections, to summarize briefly: not much evidence so far, not much precedent historically, and “why this, of all the possible axes of differential progress?”) [Unsure if these objections will stand up if written out more fully]
This question also has a negative answer, as witnessed by the example of an ant colony --- agent-like behavior without agent-like architecture, produced by a "non-agenty" optimization process of evolution. Nonetheless, a general version of the question remains: If some X exhibits agent-like behavior, does it follow that there exists some interesting physical structure causally upstream of X?
Neat example! But for my part, I'm confused about this last sentence, even after reading the footnote:
An example of such "interesting physical struc...
For reference, LeCun discussed his atheoretic/experimentalist views in more depth in this FB debate with Ali Rahimi and also this lecture. But maybe we should distinguish some distinct axes of the experimentalist/theorist divide in DL:
1) Experimentalism/theorism is a more appropriate paradigm for thinking about AI safety
2) Experimentalism/theorism is a more appropriate paradigm for making progress in AI capabilities
Where the LeCun/Russell debate is about (1) and LeCun/Rahimi is about (2). And maybe this is oversimplifying things, since "theorism"...
That all seems pretty fair.
If a system is trying to align with idealized reflectively-endorsed values (similar to CEV), then one might expect such values to be coherent.
That's why I distinguished between the hypotheses of "human utility" and CEV. It is my vague understanding (and I could be wrong) that some alignment researchers see it as their task to align AGI with current humans and their values, thinking the "extrapolation" less important or that it will take care of itself, while others consider extrapolation an important part...
To be clear I unendorsed the idea about a minute after posting because it felt like more of a low-effort shitpost than a constructive idea for understanding the world (and I don't want to make that a norm on shortform). That said I had in mind that you're describing the thing to someone who you can't communicate with beforehand, except there's common knowledge that you're forbidden any nouns besides "cake". In practice I feel like it degenerates to putting all the meaning on adjectives to construct the nouns you'd wa...
K-complexity: The minimum description length of something (relative to some fixed description language)
Cake-complexity: The minimum description length of something, where the only noun you can use is "cake"
I often hear about deepfakes--pictures/videos that can be entirely synthesized by a deep learning model and made to look real--and how this could greatly amplify the "fake news" phenomenon and really undermine the ability of the public to actually evaluate evidence.
And this sounds like a well-founded worry, but then I was just thinking, what about Photoshop? That's existed for over a decade, and for all that time it's been possible to doctor images to look real. So why should deepfakes be any scarier?
Part of it could be that we can fak...
I'm not asking about the Fermi paradox, and its unclear to me how that's related. I'm wondering why we think general (i.e. human-level) intelligence is possible in our universe, if we're not allowed to invoke anthropic evidence. For instance, here's some possible ways one can answer my question [rot13'd to avoid spoiling people's answers]:
1. Nethr gung aba-cevzngr navzny vagryyvtrapr nyernql trgf hf "zbfg bs gur jnl gurer", naq tvira gur nccebcevngr raivebazrag, vg fubhyq or cbffvoyr va cevapvcyr sbe n fhpprffvb...
Huh, that's a good point. Whereas it seems probably inevitable that AI research would've eventually converged on something similar to the current D(R)L paradigm, we can imagine a lot of different ways AI safety could have looked like instead right now. Which makes sense, since the latter is still young and in a kind of pre-paradigmatic philosophical stage, with little unambiguous feedback to dictate how things should unfold (and it's far from clear when substantially more of this feedback will show up).
I can imagine an alternate timeline whe...
Yes, perhaps I should've been more clear. Learning certain distance functions is a practical solution to some things, so maybe the phrase "distance functions are hard" is too simplistic. What I meant to say is more like
Fully-specified distance functions are hard, over and above the difficulty of formally specifying most things, and it's often hard to notice this difficulty
This is mostly applicable to Agent Foundations-like research, where we are trying to give a formal model of (some aspect of) how agents work. Sometimes, we can reduce...
How do CFAR's research interests/priorities compare with LW's Open Problems in Human Rationality? Based on Brienne and Anna's replies here, I suspect the answer is "they're pretty different", but I'd like to hear what accounts for this divergence.
I quite like the open questions that Wei Dai wrote there, and I expect I'd find progress on those problems to be helpful for what I'm trying to do with CFAR. If I had to outline the problem we're solving from scratch, though, I might say:
Wei Dai’s open problems feel pretty relevant to
... (read more)