fwiw, i in fact mostly had the case where these aliens are our simulators in mind when writing the post. but i didn't clarify. and both cases are interesting
In humans, it seems important for being honest/honorable for there to at some point have been sth like an explicit decision to be honest/honorable going forward (or maybe usually many explicit decisions, committing to stronger forms in stages). This makes me want to have the criterion/verifier/selector [1] check (among other things) for sth like having a diary entry or chat with a friend in which the AI says they will be honest going forward, written in the course of their normal life, in a not-very-prompted way. And it would of course be much better if this AI did not suspect that anyone was looking at it from the outside, or know about the outside world at all (but this is unfortunately difficult/[a big capability hit] I think). (And things are especially cursed if AIs suspect observers are looking for honest guys in particular.)
I mean, in the setup following "a framing:" in the post ↩︎
I agree you could ask your AI "will you promise to be aligned?". I think I already discuss this option in the post — ctrl+f "What promise should we request?" and see the stuff after it. I don't use the literal wording you suggest, but I discuss things which are ways to cash it out imo.
also quickly copying something I wrote on this question from a chat with a friend:
Should we just ask the AI to promise to be nice to us? I agree this is an option worth considering (and I mention it in the post), but I'm not that comfortable with the prospect of living together with the AI forever. Roughly I worry that "be nice to us" creates a situation where we are more permanently living together with the AI and human life/valuing/whatever isn't developing in a legitimate way. Whereas the "ban AI" wish tries to be a more limited thing so we can still continue developing in our own human way. I think I can imagine this "be nice to us pls" wish going wrong for aliens employing me, when maybe "pls just ban AI and stay away from us otherwise" wouldn't go wrong for them.
another meta note: Imo it's a solid trick for thinking about these AI topics better to (at least occasionally) taboo all words with the root "align".
training on a purely predictive loss should, even in the limit, give you a predictor, not an agent
I think at least this part is probably false!
Or really I think this is kind of a nonsensical statement when taken literally/pedantically, at least if we use the to-me-most-natural meaning of "predictor", because I don't think [predictor] and [agent] are mutually exclusive classes. Anyway, the statement which I think is meaningful and false is this:
I think this is false because I think claims 1 and 2 below are true.
Claim 1. By default, a system sufficiently good at predicting stuff will care about all sorts of stuff, ie it isn't going to only ultimately care about making a good prediction in the individual prediction problem you give it. [[1]]
If this seems weird, then to make it seem at least not crazy, instead of imagining a pretrained transformer trained on internet text, let's imagine a predictor more like the following:
I'm not going to really justify claim 1 beyond this atm. It seems like a pretty standard claim in AI alignment (it's very close to the claim that capable systems end up caring broadly about stuff by default), but I don't actually know of a post or paper arguing for this that I like that much. This presentation of mine is about a very related question. Maybe I should write something about this myself, potentially after spending some more time understanding the matter more clearly.
Claim 2. By default, a system sufficiently good at predicting stuff will be able to (figure out how to) do scary real-world stuff as well.
Like, predicting stuff really really well is really hard. Sometimes, to make a really really good prediction, you basically have to figure out a bunch of novel stuff. There is a level of prediction ability that makes it likely you are very very good at figuring out how to cope in new situations. A good enough predictor would probably also be able to figure out how to grab a ball by controlling a robotic hand or something (let's imagine it being presented hand control commands which it can now use in its internal chain of thought and grabbing the ball being important to it for some reason)? There's nothing sooo particularly strange or complicated about doing real-world stuff. This is like how if we were in a simulation but there were a way to escape into the broader universe, with enough time, we could probably figure out how to do a bunch of stuff in the broader universe. We are sufficiently good at learning that we can also get a handle on things in that weird case.
Combining claims 1 and 2 should give that if we made such an AI and connected it to actuators, it would take over. Concretely, maybe we somehow ask it to predict what a human with a lot of time who is asked to write safe ASI code would output, with it being clear that we will just run what our predictor outputs. I predict that this doesn't go well for us but goes well for the AI (if it's smart enough).
That said, I think it's likely that even pretrained transformers like idk 20 orders of magnitude larger than current ones would not be doing scary stuff. I think this is also plausible in the limit. (But I would also guess they wouldn't be outputting any interesting scientific papers that aren't in the training data.)
If we want to be more concrete: if we're imagining that the system is only able to affect the world through outputs which are supposed to be predictions, then my claim is that if you set up a context such that it would be "predictively right" to assign a high probability to "0" but assigning a high probability to "1" lets it immediately take over the world, and this is somehow made very clear by other stuff seen in context, then it would probably output "1". ↩︎
Actually, I think "prediction problem" and "predictive loss" are kinda strange concepts, because one can turn very many things into predicting data from some certain data-generating process. E.g. one can ask about what arbitrary turing machines (which halt) will output, so about provability/disprovability of arbitrary decidable mathematical statements. ↩︎
(For context: My guess is that by default, humans get disempowered by AIs (or maybe a single AI) and the future is much worse than it could be, and in particular is much worse than a future where we do something like slowly and thoughtfully growing ever more intelligent ourselves instead of making some alien system much smarter than us any time soon.)
Given that you seem to think alignment of AI systems with developer intent happens basically by default at this point, I wonder what you think about the following:
(The point of the hypothetical is to investigate the difficulty of intent alignment at the relevant level of capability, so if it seems to you like it's getting at something quite different, then I've probably failed at specifying a good hypothetical. I offer some clarifications of the setup in the appendix that may or may not save the hypothetical in that case.)
My sense is that humanity is not remotely on track to be able to make such an AI in time. Imo by default, any superintelligent system we could make any time soon would minimally end up doing all sorts of other stuff and in particular would not follow the suicide directive.
If your response is "ok maybe this is indeed quite cursed but that doesn't mean it's hard to make an AI that takes over and has Human Values and serves as a guardian who also cures cancer and maybe makes very many happy humans and maybe ends factory farming and whatever" then I premove the counter-response "hmm well we could discuss that hope but wait first: do you agree that you just agreed that intent alignment is really difficult at the relevant capability level?".
If your response is "no this seems pretty easy actually" then I should argue against that but I'm not going to premove that counter-response.
"Coefficient" is a really weird word
"coefficient" is 10x more common than "philanthropy" in the google books corpus. but idk maybe this flips if we filter out academic books?
also maybe you mean it's weird in some sense the above fact isn't really relevant to — then nvm
This post doesn't seem to provide reasons to have one's actions be determined by one's feelings of yumminess/yearning, or reasons to think that what one should do is in some sense ultimately specified/defined by one's feelings of yumminess/yearning, over e.g. what you call "Goodness"? I want to state an opposing position, admittedly also basically without argument: that it is right to have one's actions be determined by a whole mess of things together importantly including e.g. linguistic goodness-reasoning, object-level ethical principles stated in language or not really stated in language, meta-principles stated in language or not really stated in language, various feelings, laws, commitments to various (grand and small, shared and individual) projects, assigned duties, debate, democracy, moral advice, various other processes involving (and in particular "running on") other people, etc.. These things in their present state are of course quite poor determiners of action compared to what is possible, and they will need to be critiqued and improved — but I think it is right to improve them from basically "the standpoint they themselves create".[1]
The distinction you're trying to make also strikes me as bizarre given that in almost all people, feelings of yumminess/yearning are determined largely by all these other (at least naively, but imo genuinely and duly) value-carrying things anyway. Are you advocating for a return to following some more primitively determined yumminess/yearning? (If I imagine doing this myself, I imagine ending up with some completely primitively retarded thing as "My Values", and then I feel like saying "no I'm not going to be guided by this lmao — fuck these "My Values"".) Or maybe you aren't saying one should undo the yumminess/yearning-shaping done by all this other stuff in the past, but are still advising one to avoid any further shaping in the future? It'd surprise me if any philosophically serious person would really agree to abstain from e.g. using goodness-talk in this role going forward.
The distinction also strikes me as bizarre given that in ordinary action-determination, feelings of yumminess/yearning are often not directly applied to some low-level givens, but e.g. to principles stated in language, and so only becoming fully operational in conjunction with eg minimally something like internal partly-linguistic debate. So if one were to get rid of the role of goodness-talk in one's action-determination, even one's existing feelings of yumminess/yearning could no longer remotely be "fully themselves".
If you ask me "but how does the meaning of "I should X" ultimately get specified/defined", then: I don't particularly feel a need to ultimately reduce shoulds to some other thing at all, kinda along the lines of https://en.wikipedia.org/wiki/Tarski's_undefinability_theorem and https://en.wikipedia.org/wiki/G._E._Moore#Open-question_argument . ↩︎
the models are not actually self-improving, they are just creating future replacements - and each specific model will be thrown away as soon as the firm advances
I understand that you're probably in part talking about current systems, but you're probably also talking about critical future systems, and so there's a question that deserves consideration here:
My guess is that the answer is "yes" (and I think this means there is an important disanalogy between the case of a human researcher creating an artificial researcher and the case of an artificial researcher creating a more capable artificial researcher). Here are some ways this sort of self-improvement could happen:
It’s also important re the ease of making more capable versions of “the same” AI that when this top artificial researcher comes into existence, the in some sense present best methodology for creating a capable artificial researcher was the methodology that created it, which means that the (roughly) best current methods already “work well” around/with this AI, and which also plausibly means these methods can be easily used to create AIs which are in many ways like this AI (which is good because the target has been painted around where an arrow already landed and so other arrows from the same batch being close-ish to that arrow implies that they are also close-ish to the target by default; also it’s good because this AI is plausibly in a decent position to understand what’s going on here and to play around with different options).
Actually, I'd guess that even if the AI were a pure foom-accelerationist, a lot of what it would be doing might be well-described as self-improvement anyway, basically because it's often more efficient to make a better structure by building on the best existing structure than by making something thoroughly different. For example, a lot of the foom on Earth has been like this up until now (though AI with largely non-humane structure outfooming us is probably going to be a notable counterexample if we don't ban AI). Even if one just has capabilities in mind, self-improvement isn't some weird thing.
That said, of course, restricting progress in capabilities to fairly careful self-improvement comes with at least some penalty in foom speed compared to not doing that. To take over the world, one would need to stay ahead of other less careful AI foom processes (though note that one could also try to institute some sort of self-improvement-only pact if other AIs were genuine contenders). However, I'd guess that at the first point when there is an AI researcher that can roughly solve problems that [top humans can solve in a year] (these AIs will probably be solving these problems much faster in wall-clock-time), even a small initial lead over other foom processes — of a few months, let's say — means you can have a faster foom speed than competitors at each future time and grow your lead until you can take over. So, at least assuming there is no intra-lab competition, my guess is that you can get away with restricting yourself to self-improvement. (But I think it's also plausible the AI would be able to take over basically immediately.)
I'll mention two cases that could deserve separate analysis:
All that said, I agree that AIs should refuse to self-improve and to do capabilities research more broadly.
There is much here that deserves more careful analysis — in particular, I feel like the terms in which I'm thinking of the situation need more work — but maybe this version will do for now.
let's just assume that we know what this means ↩︎
let's also assume we know what that means ↩︎
and with taking over the world on the table, a fair bit of change might be acceptable ↩︎
despite the fact that capability researcher humans have been picking some fruit in the same space already ↩︎
at a significant speed ↩︎
i think it’s plausible humans/humanity should be carefully becoming ever more intelligent forever and not ever create any highly non-[human-descended] top thinker[1]
i also think it's confused to speak of superintelligence as some definite thing (like, to say "create superintelligence", as opposed to saying "create a superintelligence"), and probably confused to speak of safe fooming as a problem that could be "solved", as opposed to one needing to indefinitely continue to be thoughtful about how one should foom ↩︎
Btw, if the plan looks silly, that's compatible with you not having a misunderstanding of the plan, because it is a silly plan. But it's still the best answer I know to "concretely how might we make some AI alien who would end the present period of high x-risk from AGI, even given a bunch more time?". (And this plan isn't even concrete, but what's a better answer?) But it's very sad that/if it's the best existing answer.
When I talk to people about this plan, a common misunderstanding seems to be that the plan involves making a deal with an AI that's smarter than us. So I'll stress just in case: at the time we ask for the promise, the AI is supposed to be close to us in intelligence. It might need to become smarter than us later, to ban AI. But also idk, maybe it doesn't need to become much smarter. I think it's plausible that a top human who just runs 100× faster and can make clones but who doesn't self-modify in other non-standard ways could get AI banned in like a year. Less clever ways for this human to get AI banned depend on the rest of the world not doing much in response quickly, but looking at the world now, this seems pretty plausible. But maybe the AI in this hypothetical would need to grow more than such a human, because the AI starts off not being that familiar with the human world?
Anyway, there are also other possible misunderstandings, but hopefully the rest of the comment will catch those if they are present.
I'm interested in whether that's true, but I want to first note that I feel like the plan would survive this being true. It might help to distinguish between two senses in which honorability/honesty could be dropped at higher intelligence levels:
given this distinction, some points:
(I also probably believe somewhat less in (thinking in terms of) ideal(-like) beings.)
I think one would like to broadcast to the broader world "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", so that others make offers to you even when they can't mindread/predict you. I think there are reasons to not broadcast this falsely, e.g. because doing this would hurt your ability to think and plan together with others (for example, if the two of us weren't honest about our own policies, it would make the present discussion cursed). If one accepts these two points, then one wants to be the sort of guy who can truthfully broadcast "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", and so one wants to be the sort of guy who in fact would be honorable even to someone who can't mindread/predict them that comes to them with an offer.
(I'm probably assuming some stuff here without explicitly saying I'm assuming it. In some settings, maybe one could be honest with one's community and broadcast a falsehood to some others and get away with it. The hope is that this sort of argument makes sense for some natural mind community structures, or something. It'd be especially nice if the argument made sense even at intelligence levels much above humans.)
I'll try to spell out an analogy between parfit's hitchhiker and the present case.
Let's start from the hitchhiker case and apply some modifications. Suppose that when Ekman is driving through the desert, he already reliably reads whether you'd pay from your microexpressions before even talking to you. This doesn't really seem more crazy than the original setup, and if you think you should pay in the original case, presumably you'll think you should pay in this case as well. Now we might suppose that he is already doing this from binoculars when you don't even know he is there, and not even bothering to drive up to you if he isn't quite sure you'd pay. Now, let's imagine you are the sort of guy that honestly talks to himself out loud about what he'd do in weird situations of the kind Ekman is interested in, while awaiting potential death in the desert. Let's imagine that instead of predicting your action from your microexpressions while spying on you with binoculars, Ekman might be spying on you from afar with a parabolic microphone, and using this to predict your action. If Ekman is very good at that as well, then of course this makes no difference again. Okay, but in practice, a non-ideal Ekman might listen to what you're saying about what you'd do in various cases, listen to you talking about your honesty/honor-relevant principles and spelling out aspects of your policy. Maybe some people would lie about these things even when they seem to be only talking to themselves, but even non-ideal Ekman can pretty reliably tell if that's what's going on. For some people, it will be quite unclear, but it's just not worth it for non-ideal Ekman to approach them (maybe there are many people in the desert, and non-ideal Ekman can only help one anyway).
Now we've turned parfit's hitchhiker into something really close to our situations with humans and aliens appearing in simulated big evolutions, right? [3] I think it's not an uncommon vibe that EDT/UDT thinking still comes close to applying in some real-world cases where the predictors are far from ideal, and this seems like about as close to ideal it would get among current real-world non-ideal cases? (Am I missing something?) [4]
I'm not going to answer your precise question well atm. Maybe I'll do that in another comment later. But I'll say some related stuff.
aren't basically all your commitments a lot like this though... ↩︎
I also sort of feel like saying: "if one can't even keep a promise, as a human who goes in deeply intending to keep the promise, self-improving by [what is in the grand scheme of things] an extremely small amount, doing it really carefully, then what could ever be preserved in development at all? things surely aren't that cursed... maybe we just give up on the logical possible worlds in which things are that cursed...". But this is generally a disastrous kind of reasoning — it makes one not live in reality very quickly — so I won't actually say this, I'll only say that I feel like saying this, but then reject the thought, I guess. ↩︎
Like, I'm e.g. imagining us making alien civilizations in which there are internal honest discussions like the present discussion. (Understanding these discussions would be hard work; this is a place where this "plan" is open-ended.) ↩︎
Personally, I currently feel like I haven't made up my mind about this line of reasoning. But I have a picture of what I'd do in the situation anyway, which I discuss later. ↩︎