In this post I describe a couple of human-AI safety problems in more detail. These helped motivate my proposed hybrid approach, and I think need to be addressed by other AI safety approaches that currently do not take them into account.

1. How to prevent "aligned" AIs from unintentionally corrupting human values?

We know that ML systems tend to have problems with adversarial examples and distributional shifts in general. There seems to be no reason not to expect that human value functions have similar problems, which even "aligned" AIs could trigger unless they are somehow designed not to. For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can't keep up, and their value systems no longer apply or give essentially random answers. AIs could give us new options that are irresistible to some parts of our motivational systems, like more powerful versions of video game and social media addiction. In the course of trying to figure out what we most want or like, they could in effect be searching for adversarial examples on our value functions. At our own request or in a sincere attempt to help us, they could generate philosophical or moral arguments that are wrong but extremely persuasive.

(Some of these issues, like the invention of new addictions and new technologies in general, would happen even without AI, but I think AIs would likely, by default, strongly exacerbate the problem by differentially accelerating such technologies faster than progress in understanding how to safely handle them.)

2. How to defend against intentional attempts by AIs to corrupt human values?

It looks like we may be headed towards a world of multiple AIs, some of which are either unaligned, or aligned to other owners or users. In such a world there's a strong incentive to use one's own AIs to manipulate other people's values in a direction that benefits oneself (even if the resulting loss to others are greater than gains to oneself).

There is an apparent asymmetry between attack and defense in this arena, because manipulating a human is a straightforward optimization problem with an objective that is easy to test/measure (just check if the target has accepted the values you're trying to instill, or has started doing things that are more beneficial to you), and hence relatively easy for AIs to learn how to do, but teaching or programming an AI to help defend against such manipulation seems much harder, because it's unclear how to distinguish between manipulation and useful information or discussion. (One way to defend against such manipulation would be to cut off all outside contact, including from other humans because we don't know whether they are just being used as other AIs' mouthpieces, but that would be highly detrimental to one's own moral development.)

There's also an asymmetry between AIs with simple utility functions (either unaligned or aligned to users who think they have simple values) and AIs aligned to users who have high value complexity and moral uncertainty. The former seem to be at a substantial advantage in a contest to manipulate others' values and protect one's own.

New Comment
25 comments, sorted by Click to highlight new comments since:
[-]Wei DaiΩ7180

We know that ML systems tend to have problems with adversarial examples and distributional shifts in general. There seems to be no reason not to expect that human value functions have similar problems

To help me better model AI safety researchers (in order to better understand my comparative advantage as well as the current strategic situation with regard to AI risk), I'm interested to know why I seem to be the first person to point out or at least publicize this seemingly obvious parallel. (Humans can be seen as a form of machine intelligence made up at least in part of a bunch of ML-like modules and "designed" with little foresight. Why wouldn't we have ML-like safety problems?) My previous lower-key attempts to discuss this with people who work in ML-based AI safety were mostly met with silence, which is also confusing. (See one example here.)

Is this not an obvious observation? Are safety researchers too focused on narrow technical problems to consider the bigger picture? Do they fear to speak due to PR concerns? Are they too optimistic about solving AI safety and don't want to consider evidence that the problem may be harder than they think? Badly designed institutions are exerting pressures to not look in this direction? Something else I'm not thinking of?

I think this is under-discussed, but also that I have seen many discussions in this area. E.g. I have seen it come up and brought it up in the context of Paul's research agenda, where success relies on humans being able to play their part safely in the amplification system. Many people say they are more worried about misuse than accident on the basis of the corruption issues (and much discussion about CEV and idealization, superstimuli, etc addresses the kind of path-dependence and adversarial search you mention).

However, those varied problems mostly aren't formulated as 'ML safety problems in humans' (I have seen robustness and distributional shift discussion for Paul's amplification, and daemons/wireheading/safe-self-modification for humans and human organizations), and that seems like a productive framing for systematic exploration, going through the known inventories and trying to see how they cross-apply.

[-]Wei DaiΩ260

I agree with all of this but I don't think it addresses my central point/question. (I'm not sure if you were trying to, or just making a more tangential comment.) To rephrase, it seems to me that ‘ML safety problems in humans’ is a natural/obvious framing that makes clear that alignment to human users/operators is likely far from sufficient to ensure the safety of human-AI systems, that in some ways corrigibility is actually opposed to safety, and that there are likely technical angles of attack on these problems. It seems surprising that someone like me had to point out this framing to people who are intimately familiar with ML safety problems, and also surprising that they largely respond with silence.

in some ways corrigibility is actually opposed to safety

We can talk about "corrigible by X" for arbitrary X. I don't think these considerations imply a tension between corrigibility and safety, they just suggest "humans in the real world" may not be the optimal X. You might prefer use an appropriate idealization of humans / humans in some safe environment / etc.

[-]Wei DaiΩ240

To the extent that even idealized humans are not perfectly safe (e.g., perhaps a white-box metaphilosophical approach is even safer), and that corrigibility seems to conflict with greater transparency and hence cooperation between AIs, there still seems to be some tension between corrigibility and safety even when X = idealized humans.

ETA: Do you think IDA can be used to produce an AI that is corrigible by some kind of idealized human? That might be another approach that's worth pursuing if it looks feasible.

ETA: Do you think IDA can be used to produce an AI that is corrigible by some kind of idealized human? That might be another approach that's worth pursuing if it looks feasible.

Yes.

I talk about aligning to human preferences that change over time, and Gillian Hadfield has pointed out that human preferences are specific to the current societal equilibrium. (Here by "human preferences" we're talking about things that humans will say/endorse right now, as opposed to true/normative values.)

I've also had discussions at CHAI about whether we should expect humans to have adversarial examples (usually vision, not values). Also at CHAI, we often expect humans to be wrong about what they want, that's what motivates a lot of work on figuring out how to interpret human input pragmatically instead of literally.

My previous lower-key attempts to discuss this with people who work in ML-based AI safety were mostly met with silence, which is also confusing.

I can't speak to all of the cases, but in the example you point out, you're asking about a paper whose main author is not on this forum (or at least, I've never seen a contribution from him, though he could have a username I don't know). People are busy, it's hard to read everything, let alone respond to everything.

Is this not an obvious observation?

I personally think that the analogy is weaker than it seems. Humans have system 2 / explicit reasoning / knowing-what-they-know ability that should be able to detect when they are facing an "adversarial example". I certainly expect people's quick, intuitive value judgments to be horribly wrong anywhere outside of current environments; it's not obvious to me that's true for probabilities given by explicit reasoning.

Are safety researchers too focused on narrow technical problems to consider the bigger picture?

I think that was true for me, but also I was (and still am) new to AI safety, so I doubt this generalizes to researchers who have been in the area for longer.

Badly designed institutions are exerting pressures to not look in this direction?

It certainly feels harder to me to publish papers on this topic.

Something else I'm not thinking of?

I think that to the extent you are hoping to scale up safety in parallel with capabilities, which feels a lot more tractable than solving the full problem in one go, this is not a problem you have to deal with yet, and you can outsource it to the future. This seems to be the general approach of "ML-based AI safety". I think the people that are trying to aim at the full problem without going through current techniques (eg. MIRI, you, Paul) do think about these sorts of problems.

[-]Wei DaiΩ230

Thanks, this is helpful to me.

I’ve also had discussions at CHAI about whether we should expect humans to have adversarial examples (usually vision, not values).

I noticed that in your recent FLI interview, you discussed applying the idea of distributional shifts to human values. Did you write about that or talk to anyone about it before I made my posts?

I can’t speak to all of the cases, but in the example you point out, you’re asking about a paper whose main author is not on this forum (or at least, I’ve never seen a contribution from him, though he could have a username I don’t know). People are busy, it’s hard to read everything, let alone respond to everything.

Oh, I didn't realize that, and thought the paper was more of a team effort. However as far as I can tell there hasn't been a lot of discussion about the paper online, and the comments I wrote under the AF post might be the only substantial public online comments on the paper, so "let alone respond to everything" doesn't seem to make much sense here.

I certainly expect people’s quick, intuitive value judgments to be horribly wrong anywhere outside of current environments; it’s not obvious to me that’s true for probabilities given by explicit reasoning.

It seems plausible that given enough time and opportunities to discuss with other friendly humans, explicit reasoning can eventually converge upon correct judgments, but explicit reasoning can certainly be wrong very often in the short or even medium run, and even eventual convergence might happen only for a small fraction of all humans who are especially good at explicit reasoning. I think there are also likely analogies to adversarial examples for explicit human reasoning, in the form of arguments that are extremely persuasive but wrong.

But I guess humans do have a safety mechanism in that system 1 and system 2 can cross-check each other and make us feel confused when they don't agree. But that doesn't always work since these systems can be wrong in the same direction (motivated cognition is a thing that happens pretty often) or one system can be so confident that it overrides the other. (Also it's not clear what the safe thing to do is (in terms of decision making) when we are confused about our values.)

This safety mechanism may work well enough to prevent exploitation of adversarial examples by other humans a lot of the time, but seems unlikely to hold up under heavier optimization power. (You could perhaps consider things like Nazism, conspiracy theories, and cults to be examples of successful exploitation by other humans.)

(I wonder if people have tried to apply the idea of heterogeneous systems cross-checking each other to adversarial examples in ML. Have you seen any literature on this?)

I think that to the extent you are hoping to scale up safety in parallel with capabilities, which feels a lot more tractable than solving the full problem in one go, this is not a problem you have to deal with yet, and you can outsource it to the future.

I guess that makes sense, but even then there should at least be an acknowledgement that the problem exists and needs to be solved in the future?

I noticed that in your recent FLI interview, you discussed applying the idea of distributional shifts to human values. Did you write about that or talk to anyone about it before I made my posts?

I had not written about it, but I had talked about it before your posts. If I remember correctly, I started finding the concept of distributional shifts very useful and applying it to everything around May of this year. Of course, I had been thinking about it recently because of your posts so I was more primed to bring it up during the podcast.

so "let alone respond to everything" doesn't seem to make much sense here.

Fair point, this was more me noticing how much time I need to set aside for discussion here (which I think is valuable! But it is a time sink).

I think there are also likely analogies to adversarial examples for explicit human reasoning, in the form of arguments that are extremely persuasive but wrong.

Yeah, I think on balance I agree. But explicit human reasoning has generalized well to environments far outside of the ancestral environment, so this case is not as clear/obvious. (Adversarial examples in current ML models can be thought of as failures of generalization very close to the training distribution.)

(I wonder if people have tried to apply the idea of heterogeneous systems cross-checking each other to adversarial examples in ML. Have you seen any literature on this?)

Not exactly, but some related things that I've picked up from talking to people but don't have citations for:

  • Adversarial examples tend to transfer across different model architectures for deep learning, so ensembles mostly don't help.
  • There's been work on training a system that can detect adversarial perturbations, separately from the model that classifies images.
  • There has been work on using non-deep learning methods for image classification to avoid adversarial examples. I believe approaches that are similar-ish to nearest neighbors tend to be more robust. I don't know if anyone has tried combining these with deep learning methods to get the best of both worlds.
I guess that makes sense, but even then there should at least be an acknowledgement that the problem exists and needs to be solved in the future?

Agreed, but I think people are focusing on their own research agendas in public writing, since public writing is expensive. I wouldn't engage as much as I do if I weren't writing the newsletter. By default, if you agree with a post, you say nothing, and if you disagree, you leave a comment. So generally I take silence as a weak signal that people agree with the post.

This is primarily with public writing -- if you talked to researchers in person in private, I would guess that most of them would explicitly agree that this is a problem worth thinking about. (I'm not sure they'd agree that the problem exists, more that it's sufficiently plausible that we should think about it.)

I'm interested to know why I seem to be the first person to point out or at least publicize this seemingly obvious parallel. (Humans can be seen as a form of machine intelligence made up at least in part of a bunch of ML-like modules and "designed" with little foresight. Why wouldn't we have ML-like safety problems?)

Beyond the fact that humans have inputs on which they behave "badly" (from the perspective of our endorsed idealizations), what is the content of the analogy? I don't think there is too much disagreement about that basic claim (though there is disagreement about the importance/urgency of this problem relative to intent alignment); it's something I've discussed and sometimes work on (mostly because it overlaps with my approach to intent alignment). But it seems like the menu of available solutions, and detailed nature of the problem, is quite different than in the case of ML security vulnerabilities. So for my part that's why I haven't emphasized this parallel.

Tangentially relevant restatement of my views: I agree that there exist inputs on which people behave badly, that deliberating "correctly" is hard (and much harder than manipulating values), that there may be technologies/insights/policies that would improve the chance that we deliberate correctly or ameliorate outside pressures that might corrupt our values / distort deliberation, etc. I think we do have a mild quantitative disagreement about the relative (importance)*(marginal tractability) of various problems. I remain supportive of work in this direction and will probably write about it in more detail some point, but don't think there is much ambiguity about what I should work on.

[-]Wei DaiΩ240

Beyond the fact that humans have inputs on which they behave “badly” (from the perspective of our endorsed idealizations), what is the content of the analogy?

The update I made was from "humans probably have the equivalent of software bugs, i.e., bad behavior when dealing with rare edge cases" to "humans probably only behave sensibly in a small, hard to define region in the space of inputs, with a lot of bad behavior all around that region". In other words the analogy seems to call for a much greater level of distrust in the safety of humans, and higher estimate of how difficult it would be to solve or avoid this problem.

I don’t think there is too much disagreement about that basic claim

I haven't seen any explicit disagreement, but have seen AI safety approaches that seem to implicitly assume that humans are safe, and silence when I point out this analogy/claim to the people behind those approaches. (Besides the public example I linked to, I think you saw a private discussion between me and another AI safety researcher where this happened. And to be clear, I'm definitely not including you personally in this group.)

I remain supportive of work in this direction and will probably write about it in more detail some point, but don’t think there is much ambiguity about what I should work on.

I'm happy to see the first part of this statement, but the second part is a bit puzzling. Can you clarify what kind of people you think should work on this class of problems, and why you personally are not in that group? (Without that explanation, it seems to imply that people like you shouldn't work on this class of problems, which might cover almost everyone who is potentially qualified to work on it. I would also be happy if you just stopped at the comma...)

Can you clarify what kind of people you think should work on this class of problems, and why you personally are not in that group? (Without that explanation, it seems to imply that people like you shouldn't work on this class of problems, which might cover almost everyone who is potentially qualified to work on it. I would also be happy if you just stopped at the comma...)

I'm one of the main people pushing what I regard as the most plausible approach to intent alignment, and have done a lot of thinking about that approach / built up a lot of hard-to-transfer intuition and state. So it seems like I have a strong comparative advantage on that problem.

I certainly agree that humans might have critical failures of judgement in situations that are outside of some space of what is "comprehensible". This is a special case of what I called "corrupt states" when talking about DRL, so I don't feel like I have been ignoring the issue. Of course there is a lot more work to be done there (and I have some concrete research directions how to understand this better).

It seems to me that the second problem falls under the more general category of "Competing superintelligent AI systems could do bad things, even if they are aligned". Is there a reason the focus on corruption of values is particularly salient to you? Or would you categorize this as about as important as the problem of dealing with superintelligent AI systems getting into an arms race? Maybe you think that corruption of values leads to much more value loss than anything else? (I don't see why that would be true.)

Are you hoping that we come up with different solutions that make defense easier than offense in all of these possible threats? It seems more important to me to work on trying not to get into this situation in the first place. (I also make this claim on the current margin.) However, this does seem particularly difficult to achieve, so I'd love for someone to think through this and realize that we actually do have a nice technical solution that allows us to not have to make different groups of humans cooperate with each other.

[-]Wei DaiΩ370

Good question. :)

general category of “Competing superintelligent AI systems could do bad things, even if they are aligned”

This general category could potentially be solved by AIs being very good at cooperating with other AIs. For example maybe AIs can merge together in a secure/verifiable way. (How to ensure this seems to be another overly neglected topic.) However the terms of any merger will likely reflect the pre-merger balance of power, which in this particular competitive arena seems to (by default) disfavor people who have a proper amount of value complexity and moral uncertainty (as I suggested in the OP).

I worry that in the context of corrigibility it's misleading to talk about alignment, and especially about utility functions. If alignment characterizes goals, it presumes a goal-directed agent, but a corrigible AI is probably not goal-directed, in the sense that its decisions are not chosen according to their expected value for a persistent goal. So a corrigible AI won't be aligned (neither will it be misaligned). Conversely, an agent aligned in this sense can't be visibly corrigible, as its decisions are determined by its goals, not orders and wishes of operators. (Corrigible AIs are interesting because they might be easier to build than aligned agents, and are useful as tools to defend against misaligned agents and to build aligned agents.)

In the process of gradually changing from a corrigible AI into an aligned agent, an AI becomes less corrigible in the sense that corrigibility ceases to help in describing its behavior, it stops manifesting. At the same time, goal-directedness starts to dominate the description of its behavior as the AI learns well enough what its goal should be. If during the process of learning its values it's more corrigible than goal-directed, there shouldn't be any surprises like sudden disassembly of its operators on molecular level.

What does it mean for human values to be vulnerable to adversarial examples? When we say this about AI systems (e.g. image classifiers), I think it's either because their judgments on manipulated situations/images are misaligned with ours/humans, or perhaps because they get the "ground truth" wrong. But how can a value system be misaligned with itself or different from the ground truth? For alignment purposes, isn't it itself the ground truth? It could of course fail to match "objective morality" if you believe in that, but in that case we should probably be trying to make our AI align with that and not with someone's human values.

I could (easily) imagine that my values are inconsistent, conflicting, and ever-changing, but these seem like different issues.

It also seems like you have a value that says something to the effect of "it's wrong to corrupt people's values (in certain circumstances)". Then wouldn't an AI that's aligned with your values share this value, and not do this intentionally? And as for unintentionally: it seems that you have thought of this problem, and an ASI would presumably be much smarter than you, so wouldn't it think of it too, and try hard to avoid it? [My reasoning here sounds a bit naive or "too simple" to me, but I'm not sure it's wrong.]

I could understand that there might be issues with value learning AIs that imperfectly learn something close to a human's value function, which may be vulnerable to adversarial examples, but this again seems like a different issue.

[-]Wei DaiΩ370

What does it mean for human values to be vulnerable to adversarial examples?

I'm not sure how to think about this formally, but intuitively, our value functions probably only "make sense" in a small region of possibility space, and just starts behaving randomly outside of that. It doesn't seem right to treat that random behavior as someone's "real values" and try to maximize that.

It also seems like you have a value that says something to the effect of “it’s wrong to corrupt people’s values (in certain circumstances)”. Then wouldn’t an AI that’s aligned with your values share this value, and not do this intentionally?

I wouldn't want to corrupt the values of people who share roughly the same moral and philosophical outlook as myself, but if someone already has values that are very likely to be wrong (e.g., they just want to maximize the complexity or the universe, or how technologically advanced we are, or the glory of their god) I might be ok with trying to manipulate their values, especially if they're trying to do the same thing to me. The problem is that it's much easier for them to defend their values. Since they don't think they need further moral development, they can just tell their AI to block any outside messages that might cause any changes to their values, but I can't do that.

And as for unintentionally: it seems that you have thought of this problem, and an ASI would presumably be much smarter than you, so wouldn’t it think of it too, and try hard to avoid it?

Other people may not think of the problem, or may not be as concerned about it as I am, and in some alignment schemes their AI would share their level of concern and not try very hard to avoid this problem. I don't want to see their values corrupted this way. Even for myself, if AIs overall are accelerating technological development faster than moral/philosophical progress, it's unclear how I can avoid this problem even with the assistance of an aligned AI. The AI may be faced with many choices that it doesn't know how to answer directly, and it also doesn't know how to ask me for help without risking corrupting me. If the AI is conservative it might be paralyzed with indecision or be forced to make a lot of suboptimal decisions that seem "safe", and if it's not conservative enough it might corrupt me even though it's trying hard not to.

(I probably should have explained more in the OP, so I'm glad you're asking these questions.)

Thanks for your reply!

our value functions probably only "make sense" in a small region of possibility space, and just starts behaving randomly outside of that.

Okay, that helps me understand what you're talking about a bit better. It sounds like the concept of a partial function, and in the ML realm like the notorious brittleness that makes systems incapable of generalizing or extrapolating outside of a limited training set. I understand why you're approaching this from the adversarial angle though, because I suppose you're concerned about the AI just bringing about some state that's outside the domain of definition which just happens to yield a high "random" score.

It doesn't seem right to treat that random behavior as someone's "real values" and try to maximize that.

Upon first reading, I kind of agreed, so I definitely understand this intuition. "Random" behavior certainly doesn't sound great, and "arbitrary" or "undefined" isn't much better. But upon further reflection I'm not so sure.

First of all, what does it mean for a value system to behave randomly/arbitrarily, and is it ever not arbitrary? Arbitrary to me means that there is no reason for something, which sounds a lot like a terminal value to me. If you morally justify having a terminal value X because of reason Y, then X is instrumental to the real terminal value Y.

Secondly, I question whether my value system really is like some kind of partial function that yields random outcomes outside the domain of definition. If you asked me for a (relative) value judgment about two situations that are completely alien to me, then I would imagine being indifferent about their ordering: not ordering them randomly. It's possible that I could be persuaded to order one over the other, but then that seems more about changing my beliefs/knowledge and understanding (the is domain) than it is about changing my values (the ought domain). This may happen in less alien situations too: should we invest in education or healthcare? I don't know, but that's primarily because I can't predict the actual outcomes in terms of things I care about.

Finally, even if a value system was to order two alien situations randomly, how can we say it's wrong? Clearly it wouldn't be wrong according to / compared with that value system, right? And how else are you going to judge whether something is right or wrong, better or worse?

I feel like these questions lead deeply into philosophical territory that I'm not particularly familiar with, but I hope it's useful (rather than nitpicky) to ask these things, because if the intuitive that "random is wrong" is itself wrong, then perhaps there's no actual problem we need to pay extra attention to. I also think that some of my questions here can be answered by pointing out that someone's values may be inconsistent / conflicting. But then that seems to be the problem that needs to be solved.

---

I would like to acknowledge the rest of your comment without responding to it in-depth. I think I have personally spent relatively little time thinking about the complexities of multipolar scenarios (which is likely in part because I haven't stumbled upon as much reading material about it, which may reflect on the AI safety community), so I don't have much to add on this. My previous comment was aimed almost exclusively at your first point (in my mind), because the issue of what value systems are like and what an ASI that's aligned with your might (unintentionally) do wrong seems somewhat separate from the issue of defending against competing ASIs doing bad things to you or others.

I acknowledge that having simpler and constant values may be a competitive advantage, and that it may be difficult to transfer the nuances of when you think it's okay to manipulate/corrupt someone's values into an ASI. I'm less concerned about other people not thinking of the corruption problem (since their ASIs are presumably smarter), and if they simply don't care (and their aligned ASIs don't either), then this seems like a classic case of AI that's misaligned with your values. Unless you want to turn this hypothetical multipolar scenario into a singleton with your ASI at the top, it seems inevitable that some things are going to happen that you don't like.

I also acknowledge that your ASI may in some sense behave suboptimally if it's overly conservative or cautious. If a choice must be made between alien situations, then it may certainly seem prudent to defer judgment until more information can be gathered, but this is again a knowledge issue rather than a values issue. The values system should then help determine a trade-off between the present uncertainty about the alternatives and the utility of spending more time to gather information (presumably getting outcompeted while you do nothing ranks as "bad" according to most value systems). This can certainly go wrong, but again that seems like more of a knowledge issue (although I acknowledge some value systems may have a competitive advantage over others;

[-]Wei DaiΩ130

First of all, what does it mean for a value system to behave randomly/arbitrarily, and is it ever not arbitrary?

Again, I don't have a definitive answer, but we do have some intuitions about which values are more and less arbitrary. For example values about familiar situations that you learned as a child and values that have deep philosophical justifications (for example, valuing positive conscious experiences, if we ever solve the problem of consciousness and start to understand the valence of qualia) seem less arbitrary than values that are caused by cosmic rays that hit your brain in the past. Values that are the result of random extrapolations seem closer to the latter than the former.

Secondly, I question whether my value system really is like some kind of partial function that yields random outcomes outside the domain of definition. If you asked me for a (relative) value judgment about two situations that are completely alien to me, then I would imagine being indifferent about their ordering: not ordering them randomly.

Thinking this over, I guess what's happening here is that our values don't apply directly to physical reality, but instead to high level mental models. So if a situation is too alien, our model building breaks down completely and we can't evaluate the situation at all.

(This suggests that adversarial examples are likely also an issue for the modules that make up our model building machinery. For example, a lot of ineffective charities might essentially be adversarial examples against the part of our brain that evaluates how much our actions are helping others.)

Finally, even if a value system was to order two alien situations randomly, how can we say it’s wrong? Clearly it wouldn’t be wrong according to / compared with that value system, right? And how else are you going to judge whether something is right or wrong, better or worse?

We can use philosophical reasoning, for example to try to determine if there is a right way to extrapolate from the parts of our values that seem to make more sense or are less arbitrary, or to try to determine if "objective morality" exists and if so what it says about the alien situations.

and if they simply don’t care (and their aligned ASIs don’t either), then this seems like a classic case of AI that’s misaligned with your values.

Not caring about value corruption is likely an error. If I can help ensure that their aligned AI helps them prevent or correct this error, I don't see why that's not a win-win.

Regarding 1, it either seems like 

a) There are true adversarial examples for human values, situations where our values misbehave and we have no way of ever identifying that, in which case we have no hope of solving this problem, because solving it would mean we are in fact able to identify the adversarial examples.

or 

b) Humans are actually immune to adversarial examples, in the sense that we can identify the situations in which our values (or rather, a subset of them) would misbehave (like being addicted to social medial), such that our true, complete values never do, and an AI that accurately models humans would also have such immunity.

Some instantiations of the first problem (How to prevent "aligned" AIs from unintentionally corrupting human values?) seem to me to be some of the easily imaginable ways to existential risk - e.g. almost all people spending lives in an addictive VR. I'm not sure if it is really neglected?

I'm already searching, but could anyone link to their favorite intro to "adversarial examples"?

Adversarial Attacks and Defences Competition, though it is fairly detailed so I wouldn't call it an intro.

[-]ZY-10

Really appreciate the post; wondering if you had any thoughts on these since the post was first published? Do think something like RLHF now is an effective enough way?

(Could I also get some advice on why the downvote?)