AlexMennen

Wikitag Contributions

Comments

Sorted by

This post claims that Anthropic is embarrassingly far behind twitter AI psychologists at skills that are possibly critical to Anthropic's mission. This suggests to me that Anthropic should be trying to recruit from the twitter AI psychologist circle.

In particular, your argument that putting material into the world about LLMs potentially becoming misaligned may cause problems -- I agree that that's true, but what's the alternative? Never talking about risks from AI? That seems like it plausibly turns out worse.

I think this depends somewhat on the threat model. How scared are you of the character instantiated by the model vs the language model itself? If you're primarily scared that the character would misbehave, and not worried about the language model misbehaving except insofar as it reifies a malign character, then maybe making the training data not give the model any reason to expect such a character to be malign would reduce the risk of this to negligible, and that sure would be easier if no one had ever thought of the idea that powerful AI could be dangerous. But if you're also worried about the language model itself misbehaving, independently of whether it predicts that its assigned character would misbehave (for instance, the classic example of turning the world into computronium that it can use to better predict the behavior of the character), then this doesn't seem feasible to solve without talking about it, so the decrease in risk of model misbehavior from publically discussing AI risk is probably worth the increase in risk of the character misbehaving (which is probably easier to solve anyway) that it would cause.

I don't understand outer vs inner alignment especially well, but I think this at least roughly tracks that distinction. If a model does a great job of instantiating a character like we told it to, and that character kills us, then the goal we gave it was catastrophic, and we failed at outer alignment. If the model, in the process of being trained on how to instantiate the character, also kills us for reasons other than that it predicts the character would do so, then the process we set up for achieving the given goal also ended up optimizing for something else undesirable, and we failed at inner alignment.

It is useful for evolved mental machinery for enabling cooperation and conflict resolution to have features like what you describe, yes. I don't agree that this points towards there being an underlying reality.

You can believe that what you do or did was unethical, which doesn't need to have anything to do with conflict resolution.

It does relate to conflict resolution. Being motivated by ethics is useful for avoiding conflict, so it's useful for people to be able to evaluate the ethics of their own hypothetical actions. But there are lots of considerations for people to take into account when chosing actions, so this does not mean that someone will never take actions that they concluded had the drawback of being unethical. Being able to reason about the ethics of actions you've already taken is additionally useful insofar as it correlates with how others are likely to see it, which can inform whether it is a good idea to hide information about your actions, be ready to try to make amends, defend yourself from retribution, etc.

Beliefs are not perceptions.

If there is some objective moral truth that common moral intuitions are heavily correlated with, there must be some mechanism by which they ended up correlated. Your reply to Karl makes it sound like you deny that anyone ever perceives anything other than perception itself, which isn't how anyone else uses the word perceive.

It doesn't mean that we are necessarily or fully motivated to be ethical.

Yes, but if no one was at all motivated by ethics, then ethical reasoning would not be useful for people to engage in, and no one would. The fact that ethics is a powerful force in society is central to why people bother studying it. This does not imply that everyone is motivated by ethics, or that anyone is fully motivated by ethics.

Regardless of whether the view Eliezer espouses here really counts as moral realism, as people have been arguing about, it does seem that it would claim that there is a fact of the matter about whether a given AI is a moral patient. So I appreciate your point regarding the implications for the LW Overton window. But for what it's worth, I don't think Eliezer succeeds at this, in the sense that I don't think he makes a good case for it to be useful to talk about ethical questions that we don't have firm views on as if they were factual questions, because:

1. Not everyone is familiar with the way Eliezer proposes to ground moral language, not everyone who is familiar with it will be aware that it is what any given person means when they use moral language, and some people who are aware that a given person uses moral language the way Eliezer proposes will object to them doing so. Thus using moral language in the way Eliezer proposes, whenever it's doing any meaningful work, invites getting sidetracked on unproductive semantic discussions. (This is a pretty general-purpose objection to normative moral theories)

2. Eliezer's characterization of the meaning of moral language relies on some assumptions about it being possible in theory for a human to eventually acquire all the relevent facts about any given moral question and form a coherent stance on it, and the stance that they eventually arrive at being robust to variations in the process by which they arrived at it. I think these assumptions are highly questionable, and shouldn't be allowed to escape questioning by remaining implicit.

3. It offers no meaningful action guidence beyond "just think about it more", which is reasonable, but a moral non-realist who aspires to acquire moral intuitions on a given topic would also think of that.

One could object to this line of criticism on the grounds that we should talk about what's true independently of how it is useful to use words. But any attempt to appeal to objective truth about moral language runs into the fact that words mean what people use them to mean, and you can't force people to use words the way you'd like them to. It looks like Eliezer kind of tries to address this by observing that extrapolated volation shares some features in common with the way people use moral language, which is true, and seems to conclude that it is the way people use moral language even if they don't know it, which does not follow.

I agree that LessWrong comments are unlikely to resolve disagreements about moral realism. Much has been written on this topic, and I doubt I have anything new to say about it, which is why I didn't think it would be useful to try to defend moral anti-realism in the post. I brought it up anyway because the argument in that paragraph crucially relies on moral anti-realism, I suspect many readers reject moral realism without having thought through the implications of that for AI moral patienthood, and I don't in fact have much uncertainty about moral realism. 

Regarding LessWrong consensus on this topic, I looked through a couple LessWrong surveys, and didn't find any questions about this, so, this doesn't prove much, but just out of curiosity, I asked Claude 4 Sonnet to predict the results of such a question, and here's what it said (which seems like a reasonable guess to me):

*Accept moral realism**: ~8%
**Lean towards moral realism**: ~12% 
**Not sure**: ~15%
**Lean against moral realism**: ~25%
**Reject moral realism**: ~40%

If our experience of qualia reflect some poorly understood phenomenon in physics, it could be part of a cluster of related phenomena, not all of which manifest in human cognition. We don't have as precise an understanding of qualia as we do of electrons; we just try to gesture at it, and we mostly figure out what each other is talking about. If some related phenomenon manifests in computers when they run large language models, which has some things in common with what we know as qualia but also some stark differences from any such phenomen manifesting in human brains, the things we have said about what we mean when we say "qualia" might not be sufficient to determine whether said phenomenon counts as qualia or not.

It undercuts the motivation for believing in moral realism, leaving us with no evidence for objective moral facts, which is a complicated thing, and thus unlikely to exist without evidence.

I tried to address this sort of response in the original post. All of these more precise consciousness-related concepts share the commonality that they were developed using our perception of our own cognition and seeing evidence that related phenomena occur in other humans. So they are all brittle in the same way when trying to extrapolate and apply them to alien minds. I don't think that qualia is on significantly firmer epistemic ground than consciousness is.

This is correct, but I don't think what I was trying to express relies on Camp 1 assumptions, even though I expressed it with a Camp 1 framing. If cognition is associated with some nonphysical phenomenon, then our consciousness-related concepts are still tailored to hire this phenomenon manifests specifically in humans. There could be some related metaphysical phenomenon going on in large language model, and no objective fact as to whether "consciousness" is an appropriate word to describe it.

Load More