I don't mean alignment with human concerns. I mean that the AI itself is engaged in the same project we are: building a smarter system than itself. So if it's hard to control the alignment of such a system then it should be hard for the AI. (In theory you can imagine that it's only hard at our specific level of intelligence but in fact all the arguments that AI alignment is hard seem to apply equally well to the AI making an improved AI as to us making an AI).
See my reply above. The AI x-risk arguments require the assumption that superintelligence necessarily entails the agent try to optimize some simple utility function (this is different than orthogonality which says increasing intelligence doesn't cause convergence to any particular utility function). So the doesn't care option is off the table since (by orthogonality) it's super unlikely you get the one utility function which says just maximize intelligence locally (even global max isn't enough bc some child AI who has different goals could interfere).
Those are reasonable points but note that the arguments for AI x-risk depend on the assumption that any superintelligence will necessarily be highly goal directed. Thus, either the argument fails because superintelligence doesn't imply goal directed,
And given that simply maximizing the intelligence of future AIs is merely one goal in a huge space it seems highly unlikely that (especially if we try to avoid this one goal) we just get super unlucky and the AI has the one goal that is compatible with improvement.
I like the idea of this sequence, but -- given the goal of spelling out the argument in terms of first principles -- I think more needs to be done to make the claims precisce or acknowledge they are not.
I realize that you might be unable to be more precisce given the lack of precision in this argument generally -- I don't understand how people have invested so much time/mondy on research to solve the problem and so little on making the argument for it clear and rigorous -- but if that's the case I suggest you indicate where the definitions are insufficient/lacking/unclear.
I'll list a few issues here:
Even Bostrom's definition of superintelligence is deeply unclear. For instance, would an uploaded human mind which simply worked at 10x the speed of a normal human mind qualify as a superintelligence? Intuitively the answer should be no, but per the definition the answer is almost certainly yes (at least if we imbue that upload with extra patience). After all, virtually all cognitive tasks of interest benefit from extra time -- if not at the time of performance then extra time to practice (10x the practice games of chess would make you a better player). And if it did qualify it undermines the argument about superintelligence improvement (see below).
If you require a qualitative improvement rather than merely speeding up the rate of computation to be a superintelligence then the definition risks being empty. In many important cognitive tasks humans already implement the theoretically optimal algorithm or nearly do so. Lots of problems (eg search on unordered data on classical TM) have no better solution than just brute force and this likely includes quite a few tasks we care quite a bit about (maybe even in social interactions). Sure, maybe an AI could optimize away the part where our slow human brain slogs through (tho arg we have as well w/ computers) but that just sounds like increased processing speed.
Finally, does that superiority measure resource usage? Does a superintelligence need to beat us on a watt for watt comparison or could it use the computing capacity of the planet.
These are just a few concerns but they illustrate the inadequacy of the definition. And it's not just nitpicking. This loose way of talking about superintelligence invites us, w/o adequate argument, to assume the relationship we will have to AI is akin to the relationship you have with your dumb family members/friends. And even if that was the relationship, remember that your dumb friends wouldn't seem so easily dominated if they hadn't decided not to put in much effort into intellectual issues.
When it comes to talking about self-improvement the discussion is totally missing any notion of rate, extent or qualitative measure. The tendency is for people to assume that since technology seems to happen fast somehow so will this self-improvement but why should that be?
I mean we are already capable of self-improvement. We change the culture we pass down over time and as a result a child born today ends up learning more math, history and all sorts of problem solving tools in school that an ancient Roman kid wouldn't have learned [1]. Will AI self-improvement be equally slow? If it doesn't improve itself any faster than we improve our intelligence no problem. So any discussion of this issue that seeks to draw any meaningful conclusions needs to make some claim about the rate of improvement and even defining such a quantitative measure seems extremely difficult.
And it's not just the instantaneous rate of self-improvement that matters but also the shape of the curve. You seem to grant that figuring out how to improve AI intelligence will take the AI some time to figure out -- it's gotta do the same kind of trial and error we did to build it in the first place -- and won't be instantaneous. Ok, how does that time taken scale with increasing intelligence? Maybe an AI with a 100 SIQ points can build one with 101 SIQ after a week of work. But then maybe it takes 2 weeks for the 101 SIQ AI to figure out how to reach 102 and so on. Maybe it even asymptotes.
And what does any of this even mean? Is it getting much more capable or marginally capable? Why assume the former? Given the fact that there are mathematical limits on the most efficient possible algorithms shouldn't we expect an asymptote in ability? Indeed, there might be good reasons to think humans aren't far from it.
1: Ofc, I know that people will try and insist that merely having learned a bunch of skills/tricks in school that help you solve problems doesn't qualify as improving your intelligence. Why not? If it's just a measure of ability to solve relevant cognitive challenges such teaching sure seems to qualify. I think the temptation here is to import the way we use intelligence in human society as a measure of raw potential but that relies on a kind of hardware/software distinction that doesn't obviously make sense for AI (and arguably doesn't make sense for humans over long time scales -- Flynn effect).
Maybe this question has already been answered but I don't understand how recursive self-improvement of AIs is compatible with the AI alignment problem being hard.
I mean doesn't the AI itself face the alignment problem when it tries to improve/modify itself substantially? So wouldn't a sufficently intelligent AI refuse to create such an improvement for fear the goals of the improved AI would differ from its own?
I'd just like to add that even if you think this piece is completly mistaken I think it certainly shows we are definitely not knowledgeable enough about what and how values and motives work in us much less AI to confidently make the prediction that AIs will be usefully described with a single global utility function or will work to subvert their reward system or the like.
Maybe that will turn out to be true but before we spend so many resources on trying to solve AI alignment let's try to make the argument for the great danger much more rigorous first...usually best way to start anyway.
This is one of the most important posts ever on LW though I don't think the implications have been fully drawn out. Specifically, this post raises serious doubts about the arguments for AI x-risk as a result of alignment mismatch and the models used to talk about that risk. It undercuts both Bostrom's argument that an AI will have a meaningful (self-aware?) utility function and Yudkowsky's reward button parables.
The role these two arguments play in convincing people that AI x-risk is a hard problem is to explain why, if you don't anthropomorphize should a program that's , say, excellent at conducting/scheduling interviews to ferret out moles in the intelligence community try to manipulate external events at all not just think about them to better catch moles? I mean it's often the case that ppl fail to pursue their fervent goals outside familiar context. Why will AI be different? Both arguments conclude that AI will inevitably act like it's very effectively maximizing some simple utility function in all contexts and in all ways.
Bostrom tries to convince us that as creatures get more capable they tend to act more coherently (more like they are governed by a global utility function). This is of course true for evolved creatures but by offering a theory of how value type things can arise this theory predicts that if you only train your AI in a relatively confined class of circumstances (even if that requires making very accurate predictions about the rest of the world) it isn't going to develop that kind of simple global value but, rather, would likely find multie shards in tension without clear direction if forced to make value choices in very different circumstances. Similarly, it exains why the AI won't just wirehead itself by pressing it's rewaes button.
I absolutely think that the future of online marketing g involves more asking ppl for their prefs. I know I go into my settings on good to active curate what they show me.
Indeed, I think Google is leaving a fucking pile of cash on the table by not adding a "I dislike" button and a little survey on their ads.
I feel there is something else going on here too.
Your claimed outside view asks us to compare a clean codebase with an unclean one and I absolutely agree that it's a good case for using currentDate when initially writing code.
But you motivated this by considering refactoring and I think things go off the rails there. If the only issue in your codebase was you called currentDate yyymmdd consistently or even had other consistent weird names it wouldn't be a message it would just have slightly weird conventions. Any coder working on it for a non-trivial length of time would start just reading yyymmdd as current date in their head.
Tge codebase is only messy when you inconsistently use a bunch of different names for a concept that aren't very descriptive. But now refactoring faces exactly the same problem working with the code does..the confusion coders experience seeing the variable and wondering what it does becomes ambiguity which forces a time intensive refactor.
Practically the right move is probably better stds going forward and to encourage coders to fix variable names in any piece of code they touch. But I don't think it's really a good example of divergent intuitions once you are talking about the same things.
I don't think this is a big problem.. The people who use ad blockers are both a small fraction of internet users and the most sophisticated ones so I doubt they are a major issue for website profit. I mean sure, Facebook is eventually going to try to squeeze out the last few percent of users if they can do so with an easy countermeasure but if this was really a big concern websites would be pushing to get that info back from the company they use to host ads. Admittedly when I was working on ads for Google (I'm not cut out to be out of academia so I went back to it) I never really got into this part of the system so I can't comment on how it would work out but I think if this mattered enough companies serving ads would figure out how to report back to the page about ad blockers.
I'm sure some sites resent ad blockers and take some easy countermeasures but at an economic level I'm skeptical this really matters.
What this means for how you should feel about using ad blockers is more tricky but since I kinda like well targeted ads I don't have much advice on this point.
I agree that's a possible way things could be. However, I don't see how it's compatible with accepting the arguments that say we should assume that alignment is a hard problem. I mean absent such arguments why expect you have to do anything special beyond normal training to solve alignment?
As I see the argumentative landscape the high x-risk estimates depend on arguments that claim to give reason to believe that alignment is just a generally hard problem. I don't see anything in those arguments that distinguishes between these two cases.
In other words our arguments for alignment difficulty don't depend on any specific assumptions about capability of intelligence so we should currently assign the same probability to an AI being unable to save it's alignment problem as we do to us being unable to solve it.