I've posted on LW before, but I posted again here after a long hiatus because of recent AI news, and entirely unaware of the good heart thing; then made several comments after reading the original post, but thinking it was a joke. Now I understand why the site was so strangely active.
"An animal looking curiously in the mirror, but the reflection is a different kind of animal; in digital style."
"A cat looking curiously in the mirror, but the reflection is a different kind of animal; in digital style."
"A cat looking curiously in the mirror, but the reflection is a dog; in digital style."
Curious to see how it handles modified-reflection and lack-of-specificity.
Another thing whose True Name is probably a key ingredient for alignment (and which I've spent a lot of time trying to think rigorously about): collective values.
Which is interesting, because most of what we know so far about collective values is that, for naive definitions of "collective" and "values", they don't exist. Condorcet, Arrow, Gibbard and Satterthwaite, and (crucially) Sen have all helped show that.
I personally don't think that means that the only useful things one can say about "collective values" are negative results like the ones above. I think there are positive things to say; definitions of collectivity (for instance, of democracy) that are both non-trivial and robust. But finding them means abandoning the naive concepts of "collective values".
I think that this is probably a common pattern. You go looking for the True Name of X, but even if that search ever bears fruit, you'd rarely if ever look back and say "Y is the True Name of X". Instead, you'd say something like "(long math notation) is the True Name of itself, or for short, of Y. Though I found this by looking for X, calling it 'X' was actually a misnomer; that phrase has baked-in misconceptions and/or red herrings, so from now on, let's call it 'Y' instead."
I think this post makes sense given the premises/arguments that I think many people here accept: that AG(S)I is either amazingly good or amazingly bad, and that getting the good outcome is a priori vastly improbable, and that the work needed to close the gap between that prior and a good posterior is not being done nearly fast enough.
I don't reject those premises/arguments out of hand, but I definitely don't think they're nearly as solid as I think many here do. In my opinion, the variance in goodness of reasonably-thinkable post-AGSI futures is mind-bogglingly large, but it's still probably a bell curve, with greater probability density in the "middle" than in super-heaven or ultra-hell. I also think that just making the world a better place here and now probably usually helps with alignment.
This is probably not the place for debating these premises/arguments; they're the background of this post, not its point. But I do want to say that having a different view on that background is (at least potentially) a valid reason for not buying into the "containment" strategy suggested here.
Again, I think my point here is worthwhile to mention as one part of the answer to the post's question "why don't more people think in terms of containment". I don't think that we're going to resolve whether there's space in between "friendly" and "unfriendly" right here, though.
Sure, humans are effectively ruthless in wiping out individual ant colonies. We've even wiped out more than a few entire species of ant. But our ruthfulness about our ultimate goals — well, I guess it's not exactly ruthfulness that I'm talking about...
...The fact that it's not in our nature to simply define an easy-to-evaluate utility function and then optimize, means that it's not mere coincidence that we don't want anything radical enough to imply the elimination of all ant-kind. In fact, I'm pretty sure that for a large majority of people, there's no utopian ideal you could pitch and they'd buy into, that's so radical enough that getting there would imply or even suggest actions that would kill all ants. Not because humanity wouldn't be capable of doing that, just that we're not capable of wanting that, and that fact may be related to our (residual) ruthfulness and to our intelligence itself. And metaphorically, from a superintelligence's perspective, I think that humanity-as-a-whole is probably closer to being Formicidae than it is to being one species of ant.
...
This post, and its line of argument, is not about saying "AI alignment doesn't matter". Of fucking course it does. What I'm saying is: "it may not be the case that any tiny misalignment of a superintelligence is fatal/permanent". Because yes, a superintelligence can and probably will change the world to suit its goals, but it won't ruthlessly change the whole world to perfectly suit its goals, because those goals will not, themselves, be perfectly coherent. And in that gap, I believe there will probably still be room for some amount of humanity or posthumanity-that's-still-commensurate-with-extrapolated-human-values having some amount of say in their own fates.
The response I'm looking for is not at all "well, that's all OK then, we can stop worrying about alignment". Because there's a huge difference between future (post)humans living meagerly under sufferance in some tiny remnant of the world that a superintelligence doesn't happen to care about coherently enough to change, or them thriving as an integral part of the future that it does care about and is building, or some other possibility better or worse than those. But what I am arguing is that I think the "win big or lose big are the only options" attitude I see as common in alignment circles (I know that Eleizer isn't really cutting edge anymore, but, look at his recent April Fools' "joke" for an example) may be misguided. Not every superintelligence that isn't perfectly friendly is terrifyingly unfriendly, and I think that admitting other possibilities (without being complacent about them) might help useful progress in pursuing alignment.
...
As for your points about therapy: yes, of course, my off-the-cuff one-paragraph just-so-story was oversimplified. And yes, you seem to know a lot more about this than I do. But I'm not sure the metaphor is strong enough to make all that complexity matter here.
I guess we're using different definitions of "friendly/unfriendly" here. I mean something like "ruthlessly friendly/unfriendly" in the sense that humans (neurotic as they are) aren't. (Yes, some humans appear ruthless, but that's just because their "ruths" happen not to apply. They're still not effectively optimizing for future world-states, only for present feels.)
I think many of the arguments about friendly/unfriendly AI, at least in the earlier stages of that idea (I'm not up on all the latest) are implicitly relying on that "ruthless" definition of (un)friendliness.
You (if I understand) mean "friendly/unfriendly" in a weaker sense, in which humans can be said to be friendly/unfriendly (or neither? Not sure what you'd say about that, but it probably doesn't matter.)
As for the "smart people going to dumb therapists" argument, I think you're going back to a hidden assumption of ruthlessness: if the person knew how to feel better in the future, they would just do that. But what if, for instance, they know how to feel better in the future, but doing that thing wouldn't make them feel better right now unless they first simplify it enough to explain it to their dumb therapist? The dumb therapist is still playing a role.
My point is NOT to say that non-ruthless GASI isn't dangerous. My point is that it's not an automatic "game over" because if it's not ruthless it doesn't just institute its (un)friendly goals; it is at least possible that it would not use all its potential power.
Why does the AI even "want" failure mode 3? If it's a RL agent, it's not "motivated to maximize its reward", it's "motivated to use generalized cognitive patterns that in its training runs would have marginally maximized its reward". Failure mode 3 is the peak of an entirely separate mountain than the one RL is climbing, and I think a well-designed box setup can (more-or-less "provably") prevent any cross-peak bridges in the form of cognitive strategies that undermine this.
That is to say: yes, it can (or at least, it it's not provable that it can't) imagine a way to break the box, and it can know that the reward it would actually get from breaking the box would be "infinite", but it can be successfully prevented from "feeling" the infinite-ness of that potential reward, because the RL procedure itself doesn't consider a broken-box outcome to be a valid target of cognitive optimization.
Now, this creates a new failure mode, where it hacks its own RL optimizer. But that just makes it unfit, not dangerous. Insofar as something goes wrong to let this happen, it would be obvious and easy to deal with, because it would be optimizing for thinking it would succeed and not for succeeding.
(Of course, that last sentence could also fail. But at least that would require two simultaneous failures to become dangerous; and it seems in principle possible to create sufficient safeguards and warning lights around each of those separately, because the AI itself isn't subverting those safeguards unless they've already failed.)
One way of dividing up the options is: fix the current platform, or find new platform(s). The natural decay process seems to be tilting towards the latter, but there are downsides: the diaspora loses cohesion, and while the new platforms obviously offer some things the current one doesn't, they are worse than the current one in various ways (it's really hard to be an occasional lurker on FB or tumblr, especially if you are more interested in the discussion than the "OP").
If the consensus is to fix the current platform, I suggest trying the simple fixes first. As far as I can tell, that means, break the discussion/main dichotomy, and do something about "deletionist" downvoting. Also, making it clearer how to contribute to the codebase, with a clearer owner. I think that these things should be tried and given a chance to work before more radical stuff is attempted.
If the consensus is to find something new, I suggest that it should be something which has a corporation behind it. Something smallish but on the up-and-up, and willing to give enough "tagging" capability for the community to curate itself and maintain itself reasonably separate from the main body of users of the site. It should be something smaller than FB but something willing to take the requests of the community seriously. Reddit, Quora, StackExchange, Medium... this kind of thing, though I can see problems with each of those specific suggestions.
I disagree. I think the issue is whether "pro-liberty" is the best descriptive term in this context. Does it point to the key difference between things it describes and things it doesn't? Does it avoid unnecessary and controversial leaps of abstraction? Are there no other terms which all discussants would recognize as valid, if not ideal? No, no, and no.
Is "do whatever action you predict to maximize the electricity in this particular piece of wire" really "general"? You're basically claiming that the more intelligent someone is, the more likely they are to wirehead. With humans, in my experience, and for a loose definition of "wirehead", the pattern seems to be the opposite; and that seems to me to be solid enough in terms of how RL works that I doubt it's worth the work to dig deep enough to resolve our disagreement here.