To expand on you having a point. I have obviously not seen every AI proposal on the internet, but as far as I know, no one is proposing to build a wish granting AI that parses speech like a squish djinn (and ending up with such an AI would require a deliberate effort). So I don't think the squish djinn is a valid argument against proposed wish granting AIs. Any proposed or realistic speech interpreting AI would (as you say) parse english speech as english speech. An AI that makes arbitrary distinctions between different types of meaning would need serious deliberate effort, and as far as I know, no one is proposing to do this. This makes the squish djinn analogy invalid as an argument against proposals to build a wish granting AI. It is a basic fact that statements does not have specified "meanings" attached to them, and AI proposals takes this into account. To take an extreme example to make this very clear would be Bill saying: "Steve is an idiot" to two listeners where one listener will predictably think of one Steve and the other listener will predictable think of some other Steve (or a politician making a speech that different demographics will interpret differently and to their own liking). Bill (or the politician) does not have a specific meaning of which Steve (or which message) they are referring to. This speaker is deliberately making a statement in order to have different effects on different audiences. Another standard example is responding to a question about the location of an object with: "look behind you" (anyone that is able to understand english and has no serious mental deficiencies would be able to guess that the meaning is that the object is/might be behind them (as opposed to following the order and be surprised to see the object lying there and think "what a strange coincidence")). Building an AI that would parse "look behind you" without understanding that the person is actually saying "it is/might be behind you" would require deliberate effort as it would be necessary to painstakingly avoid using most information while trying to understand speech. Tone of voice, body language, eye gaze, context, prior knowledge of the speaker, models of people in general, etc, etc all provide valuable information when parsing speech. And needing to prevent an AI from using this information (even indirectly, for example through models of "what sentences usually mean") would put enormous additional burdens on an AI project. An example in the current context would be writing: "It is possible to communicate in a way so that one class of people will infer one meaning and take the speaker seriously and another class of people will infer another meaning and dismiss it as nonsense. This could be done by relying on the fact that people differ in their prior knowledge of the speaker and in their ability to understand certain concepts. One can use non standard vocabulary, take non standard strong positions, describe non common concepts, or otherwise give signals indicating that the speaker is a person that should not be taken seriously so that the speaker is dismissed by most people as talking nonsense. But people that knows the speaker would see a discrepancy and look closer (and if they are familiar with the non standard concepts behind all the "don't listen to me" signs they might infer a completely different message).".
To expand on the valid AI squish djinn analogy. I think that hard coding an AI that executes a command is practically impossible. But if it did succeeded, it would act sort of like a squish djinn given that command. And this argument/analogy is a valid and sufficient argument against trying to hard code such a command, making it relevant as long as there exists people that propose to hardcode such commands. If someone tried to hardcode an AI to execute such a command, and they succeeded in creating something that had a real world impact, I predict this represents a failure to implement the command (it would result in an AI that does something other than the squish djinn and something other than what the builders expect it to do). So the squish djinn is not a realistic outcome. But it is what would happen if they succeeded, and thus the squish djinn analogy is a valid argument against "command hard coding" projects. I can't predict what such an AI would actually do since that depends on how the project failed. Intuitively the situation where confused researchers fail to build a squish djinn does not feel very optimal, but making an argument on this basis is more vague, and require that the proposing researchers accepts their own limited technical ability (saying "doing x is clearly technically possible, but you are not clever enough to succeed" to the typical enthusiastic project proposer (that considers themselves to be clever enough to maybe be the first in the world to create a real AI) might not be the most likely argument to succeed (here I assume that the intent is to be understood, and not to lay the groundworks for later smugly saying "I pointed that out a long time ago" (if one later wants to be smug, then one should optimize for being loud, taking clear and strong positions, and not being understood))). The squish djinn analogy is simply a simpler argument. "Either you fail or you get a squish djinn" is true and simple and sufficient to argue against a project. When presenting this argument, you do spend most of the time arguing about what would happen in a situation that will never actually happen (project success). This might sound very strange to an outside observer, but the strangeness is introduced by the project proposers (invalid) assumption that the project can succeed (analogous to some atheist saying: "if god exists, and is omnipotent, then he is not nice, cuz there is suffering").
I think you have a point Will (an AI that interprets speech like a squish djinn would require deliberate effort and is proposed by no one), but I think that it is possible to construct a valid squish djinn/AI analogy (a squish djinn interpreting a command would be roughly analogous to an AI that is hard coded to execute that command).
Sorry to everyone for the repetitive statements and the resulting wall of text (that unexpectedly needed to be posted as multiple comments since it was to long). Predicting how people will interpret something is non trivial, and explaining concepts redundantly is sometimes a useful way of making people hear what you want them to hear.
Squish djinn is here used to denote a mind that honestly believes that it was actually instructed to squish the speaker (in order to remove regret for example), not a djinn that wants to hurt the speaker and is looking for a loophole. The squish djinn only care about doing what it is requested to do, and does not care at all about the well being of the requester, so it could certainly be referred to as hostile to the speaker (since it will not hesitate to hurt the speaker in order to achieve its goal (of fulfilling the request)). A cartoonish internal monologue of the squish djinn would be: "the speaker clearly does not want to be squished, but I don't care what the speaker wants, and I see no relation between what the speaker wants and what it is likely to request, so I determine that the speaker requested to be squished, so I will squish" (which sounds very hostile, but contains no will to hurt the speaker). The typical story djinn is unlikely to be a squish djinn (they usually have a motive to hurt or help the speaker, but is restricted by rules (a clever djinn that wants to hurt the speaker might still squish, but not for the same reasons as a squish djinn (such a djinn would be a valid analogy when opposing a proposal of the type "lets build some unsafe mind with selfish goals and impose rules on it" (such a project can never succeed, and the proposer is probably fundamentally confused, but a simple and correct and sufficient counter argument is: "if the project did succeed, the result would be very bad")))).
I think what Nesov is talking about is best described as a mind that will attack conditioned on victim behavior alone (not considering possible behavior changes of the victim in any way). This is different from an N order blackmailer. In fact I think blackmail is the wrong word here (Nesov says that he does not know what blackmail means in this context, so this is not that surprising). For example, instead of seeking behavior modification through threats, such a mind seeks justice through retribution. I think the most likely SI that implements this is extrapolating an evolved minds preferences. The will to seek justice trough retribution leads to behavior changes in many cases, which leads to an evolutionary advantage. But once it has evolved, its a preference. If a guy committed a horrific crime (completely ignoring all sorts of law enforcement threats), and then it was somehow ensured that he could never hurt anyone again, most people would want justice (and other evolved minds might have made the same simplification ("if someone does that, I will hit them" is a relatively easily encoded and relatively effective strategy)).
It is true that there might exist minds that will see the act of "giving in to retribution seekers" as deserving of retribution, and this could in principle cancel out all other retribution seekers. It would seem like privileging the hypothesis to think that all such things cancel out completely. You might have absolutely no way of estimating which actions would make people seek retribution against you (I think the most complicating factor is that many considers "non punishment of evildoers" to be worthy of retribution, and others consider "punishment of people that are not actually evildoers" as worthy of retribution), but that is a fact about your map, not a fact about the territory (and unlike the blackmail thing, this is not an instance of ignorance to be celebrated). And the original topic was what an SI would do.
An SI would presumably be able to estimate this. In the case of an SI that is otherwise indifferent to humans, this cashes out to increased utility for "punish humans to avoid retribution from those that think the non-punishment of humans is worthy of retribution" and increased utility for "treat humans nicely to avoid retribution from those that would seek retribution for not treating them nicely" (those that require extermination is not really that important if that is the default behavior). If the resources it would take to punish or help humans is small, this would reduce probability of extermination, and increase probability of punishment and help. The type of punishment would be in the form that would avoid retribution from those that categorically seek retribution for that type of punishment regardless of what the "crime" was. If there are lots of (evolvable, and likely to be extrapolated) minds that agree that a certain type of punishment (directed at our type of minds) constitute "torture" and that torturers deserve to be punished (completely independently of how this effects their actions), then it will have to find some other form of punishment. So, basically: "increased probability for very clever solutions that satisfy those demanding punishment, while not pissing of those that categorically dislikes certain types of punishments" (so, some sort of convoluted and confusing existence that some (evolvable and retribution inclined) minds consider "good enough punishment", and others consider "treated acceptably"). At least increased probability of "staying alive a bit longer in some way that costs very little resources".
This would for example have policy implications for people that assume the many worlds interpretation and does not care about measure. They can no longer launch a bunch of "semi randomized AIs" (not random in the sense of "random neural network connections" but more along the lines of "letting many teams create many designs, and then randomly select which one to run") and hope that one will turn out ok, and that the others will just kill everyone (since they can no longer be sure that an uncaring AI will kill them, they can no longer be sure that they will wake up in the universe of a caring AI).
(this seems related to what Will talks about sometimes, but using very different terminology)
Did you just succeed in using gender conflicts as the non political analogous example which allows rational discourse regarding a highly inflamed, trench war topic that would degenerate into something worse than a (subtle and cold version of a) flame war if discussed directly? (different estimates of ones type of reflective equilibrium results in different preferred extrapolation dynamic/initial group/etc (which of course results in cases where it can be instrumentally rational for a non perfect liar to believe in false things))
If you did this on purpose, you are my new personal hero!
(I'm arrogantly/wisely staying neutral on the question of whether or not it is at all useful to in any way engage with the sort of people whose project proposals can be validly argued against using squish djinn analogies)
(jokes often work by deliberately being understood in different ways at different times by the same listener (the end of the joke deliberately changes the interpretation of the beginning of the joke (in a way that makes fun of someone)). In this case the meaning of the beginning of the joke is not one thing or the other thing. The listener is not first failing to understand what was said and then, after hearing the end, succeeding to understand it. The speaker is intending the listener to understand the first meaning until reaching the end, so the listener is not "first failing to encode the transmission". There is no inherently true meaning of the beginning of the joke, no inherently true person that this speaker is actually truly referring to. Just a speaker that intends to achieve certain effects on an audience by saying things (and if the speaker is successful, then at the beginning of the joke the listener infers a different meaning from what it infers after hearing the end of the joke). One way to illuminate the concepts discussed above would be to write: "on a somewhat related note, I once considered creating the username "New_Willsome" and to start posting things that sounded like you (for the purpose of demonstrating that if you counter a ban by using sock puppets, you loose your ability to stop people from speaking in your name (I was considering the options of actually acting like I think you would have acted, and the option of including subtle distortions to what I think you would have said, and the option of doing my best to give better explanations of the concepts that you talk about)). But then a bunch of usernames similar to yours showed up and were met with hostility, and I was in a hurry, and drunk, and bat shit crazy, and God told me not to do it, and I was busy writing fanfic, so I decided not to do it (the last sentence is jokingly false. I was not actually in a hurry ... :) ... )")