FWIW, I found the Strawberry Appendix especially helpful for understanding how this approach to ELK could solve (some form of) outer alignment.
Other readers, consider looking at the appendix even if you don't feel like you fully understand the main body of the post!
Nice post! I see where you're coming from here.
(ETA: I think what I'm saying here is basically "3.5.3 and 3.5.4 seem to me like they deserve more consideration, at least as backup plans -- I think they're less crazy than you make them sound." So I don't think you missed these strategies, just that maybe we disagree about how crazy they look.)
I haven't thought this through all the way yet, and don't necessarily endorse these strategies without more thought, but:
It seems like there could be a category of strategies for players with "good" AGIs to prepa...
Thanks for the post, I found it helpful! the "competent catastrophes" direction sounds particularly interesting.
This is extremely cool -- thank you, Peter and Owen! I haven't read most of it yet, let alone the papers, but I have high hopes that this will be a useful resource for me.
It didn't bug me ¯\_(ツ)_/¯
Thanks for the post! FWIW, I found this quote particularly useful:
Well, on my reading of history, that means that all sorts of crazy things will be happening, analogous to the colonialist conquests and their accompanying reshaping of the world economy, before GWP growth noticeably accelerates!
The fact that it showed up right before an eye-catching image probably helped :)
This may be out-of-scope for the writeup, but I would love to get more detail on how this might be an important problem for IDA.
Thanks for the writeup! This google doc (linked near "raised this general problem" above) appears to be private: https://docs.google.com/document/u/1/d/1vJhrol4t4OwDLK8R8jLjZb8pbUg85ELWlgjBqcoS6gs/edit
This seems like a useful lens -- thanks for taking the time to post it!
I do agree. I think the main reason to stick with "robustness" or "reliability" is that that's how the problems of "my model doesn't generalize well / is subject to adversarial examples / didn't really hit the training target outside the training data" are referred to in ML, and it gives a bad impression when people rename problems. I'm definitely most in favor of giving a new name like "hitting the target" if we think the problem we care about is different in a substantial way (which could definitely happen going forward!)
OK -- if it looks like the delay will be super long, we can certainly ask him how he'd be OK w/ us circulating / attributing those ideas. In the meantime, there are pretty standard norms about unpublished work that's been shared for comments, and I think it makes sense to stick to them.
I agree re: terminology, but probably further discussion of unpublished docs should just wait until they're published.
Thanks for writing this, Will! I think it's a good + clear explanation, and "high/low-bandwidth oversight" seems like a useful pair of labels.
I've recently found it useful to think about two kind-of-separate aspects of alignment (I think I first saw these clearly separated by Dario in an unpublished Google Doc):
1. "target": can we define what we mean by "good behavior" in a way that seems in-principle learnable, ignoring the difficulty of learning reliably / generalizing well / being secure? E.g. in RL, this would be...
I really like this post, and am very glad to see it! Nice work.
I'll pay whatever cost I need to for violating non-usefulness-of-comments norms in order to say this -- an upvote didn't seem like enough.
Thanks for writing this -- I think it's a helpful kind of reflection for people to do!
Ah, gotcha. I'll think about those points -- I don't have a good response. (Actually adding "think about"+(link to this discussion) to my todo list.)
It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level.
Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?
These objections are all reasonable, and 3 is especially interesting to me -- it seems like the biggest objection to the structure of the argument I gave. Thanks.
I'm afraid that the point I was trying to make didn't come across, or that I'm not understanding how your response bears on it. Basically, I thought the post was prematurely assuming that schemes like Paul's are not amenable to any kind of argument for confidence, and we will only ever be able to say "well, I ran out of ideas for how to break it", so I wanted to sketch an argument structure to exp
..."naturally occurring" means "could be inputs to this AI system from the rest of the world"; naturally occurring inputs don't need to be recognized, they're here as a base case for the induction. Does that make sense?
If there are other really powerful reasoners in the world, then they could produce value-corrupting single pages of text (and I would then worry about Soms becoming corrupted). If there aren't, I'd guess that possible input single pages of text aren't value-corrupting in an hour. (I would certainly want a much better answer than "I guess it's f
...My comment, for the record:
I'm glad to see people critiquing Paul's work -- it seems very promising to me relative to other alignment approaches, so I put high value on finding out about problems with it. By your definition of "benign", I don't think humans are benign, so I'm not going to argue with that. Instead, I'll say what I think about building aligned AIs out of simulated human judgement.
I agree with you that listing and solving problems with such systems until we can't think of more problems is unsatisfying, and that we should have positive argumen
...I also commented there last week and am awaiting moderation. Maybe we should post our replies here soon?
If I read Paul's post correctly, ALBA is supposed to do this in theory -- I don't understand the theory/practice distinction you're making.
I'm not sure you've gotten quite ALBA right here, and I think that causes a problem for your objection. Relevant writeups: most recent and original ALBA.
As I understand it, ALBA proposes the following process:
FWIW, this also reminded me of some discussion in Paul's post on capability amplification, where Paul asks whether we can even define good behavior in some parts of capability-space, e.g.:
The next step would be to ask: can we sensibly define “good behavior” for policies in the inaccessible part H? I suspect this will help focus our attention on the most philosophically fraught aspects of value alignment.
I'm not sure if that's relevant to your point, but it seemed like you might be interested.
Discussed briefly in Concrete Problems, FYI: https://arxiv.org/pdf/1606.06565.pdf
This is a neat idea! I'd be interested to hear why you don't think it's satisfying from a safety point of view, if you have thoughts on that.
Thanks for writing this, Jessica -- I expect to find it helpful when I read it more carefully!
Thanks. I agree that these are problems. It seems to me that the root of these problems is logical uncertainty / vingean reflection (which seem like two sides of the same coin); I find myself less confused when I think about self-modeling as being basically an application of "figuring out how to think about big / self-like hypotheses". Is that how you think of it, or are there aspects of the problem that you think are missed by this framing?
Thanks Jessica. This was helpful, and I think I see more what the problem is.
Re point 1: I see what you mean. The intuition behind my post is that it seems like it should be possible to make a bounded system that can eventually come to hold any computable hypothesis given enough evidence, including a hypothesis including a model of itself of arbitrary precision (which is different from Solomonoff, which can clearly never think about systems like itself). It's clearly not possible for the system to hold and update infinitely many hypotheses the way Solomono
...Thanks, Paul -- I missed this response earlier, and I think you've pointed out some of the major disagreements here.
I agree that there's something somewhat consequentialist going on during all kinds of complex computation. I'm skeptical that we need better decision theory to do this reliably -- are there reasons or intuition-pumps you know of that have a bearing on this?
Thanks Jessica, I think we're on similar pages -- I'm also interested in how to ensure that predictions of humans are accurate and non-adversarial, and I think there are probably a lot of interesting problems there.
Thanks Jessica -- sorry I misunderstood about hijacking. A couple of questions:
Is there a difference between "safe" and "accurate" predictors? I'm now thinking that you're worried about NTMs basically making inaccurate predictions, and that accurate predictors of planning will require us to understand planning.
My feeling is that today's current understanding of planning -- if I run this computation, I will get the result, and if I run it again, I'll get the same one -- are sufficient for harder prediction tasks. Are there particular aspects of planni
I agree with paragraphs 1, 2, and 3. To recap, the question we're discussing is "do you need to understand consequentialist reasoning to build a predictor that can predict consequentialist reasoners?"
A couple of notes on paragraph 4:
Thanks, Jessica. This argument still doesn't seem right to me -- let me try to explain why.
It seems to me like something more tractable than Solomonoff induction, like an approximate cognitive-level model of a human or the other kinds of models that are being produced now (or will be produced in the future) in machine learning (neural nets, NTMs, other etc.), could be used to approximately predict the actions of humans making plans. This is how I expect most kinds of modeling and inference to work, about humans and about other systems of interest in the w
..."Additionally, the fact that the predictor uses consequentialist reasoning indicates that you probably need to understand consequentialist reasoning to build the predictor in the first place."
I've had this conversation with Nate before, and I don't understand why I should think it's true. Presumably we think we will eventually be able to make predictors that predict a wide variety of systems without us understanding every interesting subset ahead of time, right? Why are consequentialists different?
Very thoughtful post! I was so impressed that I clicked the username to see who it was, only to see the link to your LessWrong profile :)
Just wanted to mention that watching this panel was one of the things that convinced me to give AI safety research a try :) Thanks for re-posting, it's a good memory.
To at least try to address your question: one effect could be that there are coordination problems, where many people would be trying to "change the world" in roughly the same direction if they knew that other people would cooperate and work with them. This would result in less of the attention drain you suggest. This seems more like what I've experienced.
I'm more worried about people being stupid than mean, but that could be an effect of the bubble of non-mean people I'm part of.
Cool, thanks; sounds like I have about the same picture. One missing ingredient for me that was resolved by your answer, and by going back and looking at the papers again, was the distinction between consistency and soundness (on the natural numbers), which is not a distinction I think about often.
In case it's useful, I'll note that the procrastination paradox is hard for me to take seriously on an intuitive level, because some part of me thinks that requiring correct answers in infinite decision problems is unreasonable; so many reasoning systems fail on
...I don't (confidently) understand why the procrastination paradox indicates a problem to be solved. Could you clarify that for me, or point me to a clarification?
First off, it doesn't seem like this kind of infinite buck-passing could happen in real life; is there a real-life (finite?) setting where this type of procrastination leads to bad actions? Second, it seems to me that similar paradoxes often come up in other situations where agents have infinite time horizons and can wait as long as they want -- does the problem come from the infinity, or from some
...Is this sort of a way to get an agent with a DT that admits acausal trade (as we think the correct decision theory would) to act more like a CDT agent? I wonder how different the behaviors of the agent you specify are from those of a CDT agent -- in what kinds of situations would they come apart? When does "I only value what happens given that I exist" (roughly) differ from "I only value what I directly cause" (roughly)?
I would encourage you to apply, these ideas seem reasonable!
As far as choosing, I would advise you to choose the idea for which you can make the case most strongly that it is Topical and Impactful, as defined here.
It seems that if it is desired, the overseer could also set their behaviour and intentions so that the approval-directed agent acts as we would want an oracle or tool to act. This is a nice feature.
I think Nick Bostrom and Stuart Armstrong would also be interested in this, and might have good feedback for you.
High-level feedback: this is a really interesting proposal, and looks like a promising direction to me! Most of my inline comments on Medium are more critical, but that doesn't reflect my overall assessment.
That's what I thought at first, too, but then I looked at the paper, and their figure looks right to me. Could you check my reasoning here?
On p.11 of Vincent's and Nick's survey, there's a graph "Proportion of experts with 10%/50%/90% confidence of HLMI by that date". At around the the 1 in 10 mark of proportion of experts -- the horizontal line from 0.1 -- the graph shows that 1 in 10 experts thought there was a 50% chance of HLAI by 2020 or so (the square-boxes-line), and 1 in 10 thought there was a 90% chance of HLAI by 2030 or so (the triangl...
I would be curious to see more thoughts on this from people who have thought more than I have about stable/reliable self-improvement/tiling. Broadly speaking, I am also somewhat skeptical that it's the best problem to be working on now. However, here are some considerations in favor:
It seems plausible to me that an AI will be doing most of the design work before it is a "human-level reasoner" in your sense. The scenario I have in mind is a self-improvement cycle by a machine specialized in CS and math, which is either better than humans at these things, or
...I wonder if this example can be used to help pin down desiderata for decisions or decision counterfactuals. What axiom(s) for decisions would avoid this general class of exploits?
Hm, I don't know what the definition is either. In my head, it means "can get an arbitrary amount of money from", e.g. by taking it around a preference loop as many times as you like. In any case, glad the feedback was helpful.
Nice example! I think I understood better why this picks out the particular weakness of EDT (and why it's not a general exploit that can be used against any DT) when I thought of it less as a money-pump and more as "Not only does EDT want to manage the news, you can get it to pay you a lot for the privilege".
This caused me to find your substack! Sorry I missed it earlier, looking forward to catching up.