As a starting point, it might help to understand exactly where people's naïve intuitions about why corrigibility should be easy, clash with the technical argument that it's hard.
For me, the intuition goes like this: if I wanted to spend some fraction of my effort helping dolphins in their own moral reference frame, that seems like something I could do. I could give them gifts that I can predict that they'd like (like tasty fish or a water purifier), and be conservative when I couldn't figure out what dolphins "really wanted", and be eager to accept feedback when the dolphins wanted to change how I was trying to help. If my superior epistemic vantage point let me predict that the way dolphins would respond to gifts, would depend on the details like what order the gifts were presented in, I might compute an average over possible gift-orderings, or I might try to ask the dolphins to clarify, but I definitely wouldn't tile the lightcone with tiny molecular happy-dolphin sculptures, because I can tell that's not what dolphins want under any sensible notion of "want".
So what I'd like to understand better is, where exactly does the analogy between "humans being corrigible to dolphins (in the fraction of their efforts devoted to helping dolphins)" and "AI being corrigible to humans" break, such that I haven't noticed yet because empathic inference between mammals still works "well enough", but won't work when scaled to superintelligence? When I try to think of gift ideas for dolphins, am I failing to notice some way in which I'm "selfishly" projecting what I think dolphins should want onto them, or am I violating some coherence axiom?
When I try to think of gift ideas for dolphins, am I failing to notice some way in which I'm "selfishly" projecting what I think dolphins should want onto them, or am I violating some coherence axiom?
I think it's rather that 'it's easy to think of ways to help a dolphin (and a smart AGI would presumably find this easy too), but it's hard to make a general intelligence that robustly wants to just help dolphins, and it's hard to safely coerce an AGI into helping dolphins in any major way if that's not what it really wants'.
I think the argument is two-part, and both parts are important:
I don't feel like joining this, but I do wish you luck, and I'll make a high level observation about methodology.
I do believe there’s a legitimate, albeit small, chance that we solve corrigibility or find its “core” this week. Nonetheless, I think it’s of great value to be able to make actual progress on alignment issues as a community and to figure out how to do that better.
I don't consider myself to be a rationalist or EA, but I do post on this web site, so I guess this makes me part of the community of people who post on this site. My high level observation on solving corrigibility is this: the community of people who post on this site have absolutely no mechanism for agreeing among themselves whether a problem has been solved.
This is what you get when a site is in part a philosophy-themed website/forum/blogging platform. In philosophy, problems are never solved to the satisfaction of the community of all philosophers. This is not necessarily a bad thing. But it does imply that you should not expect that this community will ever be willing to agree that corrigibility, or any other alignment problem, has been solved.
In business, there is the useful terminology that certain meetings will be run as 'decision making meetings', e.g. to make a go/no-go decision on launching a certain product design, even though a degree of uncertainty remains. Other meetings are exploratory meetings only, and are labelled as such. This forum is not a decision making forum.
But it does imply that you should not expect that this community will ever be willing to agree that corrigibility, or any other alignment problem. has been solved.
Noting that I strongly disagree but don't have time to type out arguments right now, sorry. May or may not type out later.
I think we're pretty good at avoiding semantic arguments. The word "corrigible" can (and does) mean different things to different people on this site. Becoming explicit about what different properties you mean and which metrics they score well on resolves the disagreement. We can taboo the word corrigible.
This has actually already happened in the document with corrigible either meaning:
Then we can think "assuming corrigible-definition-1, then yes, this is a solution".
I don't see a benefit to the exploratory/decision making forum distinction when you can just do the above, but maybe I'm missing something?
Becoming explicit about what different properties you mean and which metrics they score well on resolves the disagreement.
Indeed this can resolve disagreement among a small sub-group of active participants. This is an important tool if you want to make any progress.
but maybe I'm missing something?
The point I was trying to make is about what is achievable for the entire community, not what is achievable for a small sub-group of committed participants. The community of people who post on this site have absolutely no mechanism for agreeing among themselves whether a problem has been solved, or whether some sub-group has made meaningful progress on it.
To make the same point in another way: the forces which introduce disagreeing viewpoints and linguistic entropy to this forum are stronger than the forces that push towards agreement and clarity.
My thinking about how strong these forces are has been updated recently, by the posting of a whole sequence of Yudkowsky conversations and also this one. In these discussion logs, Yudkowsky goes to full Great more-epistemic-than-thou Philosopher mode, Confidently Predicting AGI Doom while Confidently Dismissing Everybody's AGI Alignment Research Results. Painful to read.
I am way past Denial and Bargaining, I have Accepted that this site is a community of philosophers.
The linguistic entropy point is countered by my previous point, right? Unless you want to say not everyone who posts in this community is capable of doing that? Or can naturally do that?
In these discussion logs, Yudkowsky goes to full Great more-epistemic-than-thou Philosopher mode, Confidently Predicting AGI Doom while Confidently Dismissing Everybody's AGI Alignment Research Results. Painful to read.
Hahaha, yes. Yudkowsky can easily be interpreted as condescending and annoying in those dialogues (and he could've done a better job at not coming across that way). Though I believe the majority of the comments were in the spirit of understanding and coming to an agreement. Adam Shimi is also working on a post to describe the disagreements in the dialogue as different epistemic strategies, meaning the cause of disagreement is non-obvious. Alignment is pre-paradigmic, so agreeing is more difficult compared to communities that have clear questions and metrics to measure them on. I still think we succeed at the harder problem.
I am way past Denial and Bargaining, I have Accepted that this site is a community of philosophers.
By "community of philosophers", you mean noone makes any actual progress on anything (or can agree that progress is being made)?
Do you disagree on these examples or disagree that they prove the community makes progress and agrees that progress is being made?
Yes, by calling this site a "community of philosophers", I roughly mean that at the level of the entire community, nobody can agree that progress is being made. There is no mechanism for creating a community-wide agreement that a problem has been solved.
You give three specific examples of progress above. From his recent writings, it is clear that Yudkowsky does not believe, like you do, that any contributions posted on this site in the last few years have made any meaningful progress towards solving alignment. You and I may agree that some or all of the above three examples represent some form of progress, but you and I are not the entire community here, Yudkowsky is also part of it.
On the last one of your three examples, I feel that 'mesa optimizers' is another regrettable example of the forces of linguistic entropy overwhelming any attempts at developing crisply stated definitions which are then accepted and leveraged by the entire community. It is not like the people posting on this site are incapable of using the tools needed to crisply define things, the problem is that many do not seem very interested in ever using other people's definitions or models as a frame of reference. They'd rather free-associate on the term, and then develop their own strongly held beliefs of what it is all supposed to be about.
I am sensing from your comments that you believe that, with more hard work and further progress on understanding alignment, it will in theory be possible to make this community agree, in future, that certain alignment problems have been solved. I, on the other hand, do not believe that it is possible to ever reach that state of agreement in this community, because the debating rules of philosophy apply here.
Philosophers are always allowed to disagree based on strongly held intuitive beliefs that they cannot be expected to explain any further. The type of agreement you seek is only possible in a sub-community which is willing to use more strict rules of debate.
This has implications for policy-related alignment work. If you want to make a policy proposal that has a chance of being accepted, it is generally required that you can point to some community of subject matter experts who agree on the coherence and effectiveness of your proposal. LW/AF cannot serve as such a community of experts.
On the last one of your three examples, I feel that ‘mesa optimizers’ is another regrettable example of the forces of linguistic entropy overwhelming any attempts at developing crisply stated definitions which are then accepted and leveraged by the entire community. It is not like the people posting on this site are incapable of using the tools needed to crisply define things, the problem is that many do not seem very interested in ever using other people’s definitions or models as a frame of reference. They’d rather free-associate on the term, and then develop their own strongly held beliefs of what it is all supposed to be about
Yes.. clarity isn't optional.
MIRI abandonned the idea of producing technology a long time ago , so what it will offer to the the people who are working on AI technology is some kind of theory expressed by n some kind of document ..which will be of no use to them if they can't understand it.
And it takes a constant parallel effort to keep the lines of communication open. It's no use "woodshedding" , spending a lot of time developing your own ideas in your own language.
I've got a slightly terrifying hail mary "solve alignment with this one weird trick"-style paradigm I've been mulling over for the past few years which seems like it has the potential to solve corrigibility and a few other major problems (notably value loading without Goodharting, using an alternative to CEV which seems drastically easier to specify). There are a handful of challenging things needed to make it work, but they look to me maybe more achievable than other proposals which seem like they could scale to superintelligence I've read.
Realistically I am not going to publish it anytime soon given my track record, but I'd be happy to have a call with anyone who'd like to poke my models and try and turn it into something. I've had mildly positive responses from explaining it to Stuart Armstrong and Rob Miles, and everyone else I've talked to about it at least thought it was creative and interesting.
I've updated my meeting times to meet more this week if you'd like to sign up for a slot? (link w/ a pun) , and from his comment, I'm sure diffractor would also be open to meeting.
I will point out that there's a confusion in terms that I noticed in myself of corrigibility meaning either "always correctable" and "something like CEV", though we can talk that over a call too:)
Your Google docs link leads to Alex's "Corrigibility Can Be VNM-Incoherent" post. Is this a mistake or am I miss-understanding something?
Potential topics: what other topics besides corrigibility could we collaborate on in future weeks? Also, are we able to poll users for topics in site?
Meta: what are different formats this type of group collaboration could take? Comment suggestions with trade offs or discuss the cost/benefits of what I’m presenting in this post.
Meetups: want to co-work with others in the community? Comment availability, work preferences, and a way to contact you (eg calendly link, “dm me”, “ my email is bob and alice dot com”, etc).
Availability: Almost all times between 10 AM and PM, California time, regardless of day. Highly flexible hours. Text over voice is preferred, I'm easiest to reach on Discord. The LW Walled Garden can also be nice.
Update: I am available this week until Saturday evening at this calendly link(though I will close the openings if a large number of people sign up) I am available all Saturday Dec 4th (calendly link will allow you to see your time zone). We can read and discuss posts, do tasks together, or whatever you want. Previous one-on-one conversations with members of the community have gone really well.There’s not a person here I haven’t enjoyed getting to know, so do feel free to click that link and book a time!
A low-hanging fruit for solving alignment is to dedicate a chunk of time actually trying to solve a sub-problem collectively.
To that end, I’ve broken up researching the sub-problem of corrigibility into two categories in this google doc (you have suggestion privileges):
Additionally, I’ll post 3 top-level comments for:
I do believe there’s a legitimate, albeit small, chance that we solve corrigibility or find its “core” this week. Nonetheless, I think it’s of great value to be able to make actual progress on alignment issues as a community and to figure out how to do that better. Additionally, it’s immensely valuable to have an alignment topic post include a literature review, the community's up-to-date thoughts, and possible future research directions to pursue. I also believe a collaborative project like this will put several community members on the same page as far as terminology and gears-level models.
I explicitly commit to 3 weeks of this (so corrigibility this week and two more the next two weeks). After that is Christmas and New Years, after which I may resume depending on how it goes.
Thanks to Alex Turner for reviewing a draft.