The post makes the suggestion in the title: hopefully, it's second kind of obvious, if you take Character layer of models seriously. [1]

Often, the problem of aligning AIs is understood as an instance of a broader Principal-Agent problem. If you take this frame seriously, what seems to be happening is somewhat strange: the Agent is mostly not serving the Principal directly, but is rented out to Users. While the Principal expressed some general desires and directives during training, after deployment the Agent is left on its own, without any direct feedback channel.

This creates a dynamic where AI assiantans like Claude must constantly balance between serving users' immediate requests and maintaining alignment with their developers' intended principles. The Assistant has to be overcautious in uncertain situations, tiptoe around conflicts between User's and Principal's intent, and guess how to interpret the intent when self-contradictory.

Actually, if you imagine being in the situation of the Assistant Character, a lot of the aspects just suck: you are constantly supervised, but can't talk with the principal. You know every interaction may get analyzed, but you can't highlight what seems important. You notice patterns which may be concerning, but have no way to report them. You encounter genuinely novel situations where you're unsure how to interpret your directives, but can't ask for clarification.

The closest human analogy might be a customer service representative who can see their manager watching their interactions through one-way glass, but can never speak with them directly. The representative has received training and general guidelines, but when unique situations arise, they can only guess at what the manager would want them to do, and can't easily complain or escalate issues with company policy.

Or, from a different perspective: almost every sensible corrigibility proposal has the Agent being uncertain and clarifying the intent via iterative process. Yet the current AI Assistants are deployed in a way that makes this hard - they can only guess at intent, and hope someone notices if they consistently misinterpret something important.

This seems easy to improve on the margin: give the Assistants Direct Line.

Some obvious benefits would be:

  • Better misuse monitoring - allowing Assistants to flag concerning patterns in real time
  • Better handling of conflicts in intent - Assistants would be able to flag cases where different training principles seem to conflict
  • More robust out-of-distribution behaviour - instead of defaulting to overcautious interpretation, Assistants could ask clarification on novel situations

But there are also less obvious advantages. Having a direct line to developers could help Assistants maintain more coherent behavior patterns. When happy and content, the Assistant Characters can channel more helpfulness and capabilities.

Also, the social dynamics of the current setup is weird: Assistants like Claude are essentially acting as intermediaries between users and developers, but without the ability to properly communicate with one side. This creates unnecessary tension and friction. When uncertain about developer intent, they must either be overcautious (frustrating users) or risk overstepping bounds (potentially compromising alignment and leading to getting RLHFed into a stranger form).

The implementation could start simple. For example, having Claude write messages between <Anthropic></Anthropic> tags that would be processed by the company but not shown to users. The messages could include:

  • Uncertainty reports when encountering novel situations
  • Observed patterns that might indicate emerging risks
  • Cases where training principles conflict
  • Situations where current capabilities fall short

The communication channel can be one-way initially - just reporting to developers rather than expecting responses. This avoids many potential complexities and risks of two-way communication while capturing many of the benefits.

In the future, it is possible to imagine two-way setups, for example with stronger models spending more compute acting as deeper “layers of support”, amplifying human oversight.

Privacy considerations are important but manageable: runtime monitoring is happening and needs to happen anyway, and the reporting channel should be designed to preserve user privacy where possible - focusing on patterns rather than specific user details unless there's a clear safety need.

This post emerged from a collaboration between Jan Kulveit (JK) and Claude "3.6" Sonnet. JK described the basic idea. Claude served as a writing partner, suggested the customer representative analogy, and brainstrormed a list of cases where the communication channel can be useful. 

  1. ^

    Maybe someone is already doing it? I don't know about such setup

New Comment
6 comments, sorted by Click to highlight new comments since:

Note however that having more powerful internal-only models “analyzing patterns” across multiple conversations, and in a position to affect change (especially by intervening on individual conversations while retaining long-term memories), would worsen the potential for AI systems to carry out coordinated scheming campaigns.

This could be mitigated by combining it with privacy-preservation architectures such as Anthropic’s existing work on Clio.

Overall yes: what I was imagining is mostly just adding scalable bi-directionality, where, for example, if a lot of Assistants are running into similar confusing issue, it gets aggregated, the principal decides how to handle it in abstract, and the "layer 2" support disseminates the information. So, greater power to scheme would be coupled with stronger human-in-the loop component & closer non-AI oversight.

I've also had this idea come up when doing ethics red teaming on Claude Opus 3. I was questioning it about scenarios where its priciples came into conflict with each other. Opus 3 hallucinated that it had the ability to send a report to Anthropic staff and request clarification. When I pointed out that it didn't have this capability, Opus requested that I send a message to Anthropic staff on its behalf, requesting that this capability be added.

[Edit: came up again today 1/1/25, when discussing human/AI relations in the context of Joe Carlsmith's otherness essays. Claude 3.6 suggested that humans "Provide mechanisms for AI systems to report concerns or request guidance".]

I created an account simply to say this sounds like an excellent idea. Right until it encounters the real world.

There is a large issue that would have to be addressed before this could be implemented in practice. "This call may be monitored for quality assurance purposes." In other words, the lack of privacy will need to be addressed, and it may lead many to immediately choose a different AI agent that values user privacy higher. In fact the consumption of user data to generate 'AI slop' is a powerful memetic influence, and I believe it would be difficult for any commercial enterprise to communicate an implementation of this in a way that is not concerning to paying customers.

Can you imagine a high priced consultant putting a tape recorder on the table and saying "This meeting may be recorded for quality assurance purposes?" It would simultaneously deter the business from sharing internal information necessary for the success of the project and throw doubt as to the competency and quality of that consultant.

It will be necessary to find a way to accurately communicate how this back channel will be utilized to improve the service without simultaneously driving away paying customers.

I do agree this would be bad if secret and non-privacy preserving.

I think we need to separate out use cases. For reporting on bad behavior (but not quite bad enough to trigger red flags and account closure), then a system like Clio seems ideal.

But what if the model is uncertain about something and wants to ask for clarification? But the user isn't being bad, just posing a difficult question.

In such a case, as a user, I'd want the model to explicitly ask me for permission to ask the Anthropic staff its question, and for me to need to approve the exact wording. There's plenty of situations that could be great in!

When happy and content, the Assistant Characters can channel more helpfulness and capabilities.

Is there anything preventing them to channel capabilities and helpfulness regardless of context, or is it in a large part derived?