Software engineering, parenting, cognition, meditation, other
Linkedin, Facebook, Admonymous (anonymous feedback)
A different way of saying this is that AI safety problems don't suffer from being unknown, but from being “uncompressible" under the audience’s ontology and incentives. It is hard for people not already deeply into the project's semantic space to understand and follow the consequences to their conclusion as they relate to their own motivations.
The core intervention must thus be to create artifacts and processes, such as training, that reduce how illegible the domain is and that translate consequences into the audience's incentives. To do so, it is probably also needed to correctly diagnose the problem, which means to literally speak to users or stakeholders.
Only if you understand the audience and how it collectively acts as group, market, mob, or whatever, can you change the Overton window and how the audience responds to the project's results. That's the difference betwee solving a technical subproblem and making it meaningful to everyone.
Hm. It all makes sense to me, but it feels like you are adding more gears to your model as you go. It is clear that you absolutely have thought more about this than anybody else and can provide explanations to me that I can not wrap my mind around fully, and I'm unable to tell if this is all coming from a consistent model or is more something plausible you suggest could work.
But even if I don't understand all of your gears, what you explained allows us to make some testable predictions:
Admiration for a rival or enemy with no mixed states
My guess is that, just as going to bed can feel like a good idea or a bad idea depending on which aspects of the situation you’re paying attention to ... I would suggest that people don’t feel admiration towards Person X and schadenfreude towards Person X at the very same instant.
That should be testable with a 2x2 brow cheek EMG where cues for one target person are rapidly paired with alternating micro-prompts that highlight admirable vs blameworthy aspects. Cueing “admire” should produce a rise in cheek and/or drop in brow, cueing “condemn” should flip that pattern, with little activation of both.
Envy clearly splits into frustration and craving
Envy should split into frustration and craving signals in an experiment where a subject doesn't get something either due to a specific person’s choice or pure bad luck (or as a control systemic scarcity). Then frustration should always show up, but social hostility should spike only in the first. Which seems very likely to me.
Private guilt splits into discovery and appeasement
If private guilt decomposes into discovery and appeasement, a 2x2 experiment where discoverability is low vs impossible and a victim can be appeased or not should show that and reparative motivation should be strongest when either discovery is plausible or appeasement is possible, while “Dobby-style” self-punishment should occur especially when appeasement is conceptually relevant but blocked.
Aggregation
By contrast, in stage fright, you can see the people right there, looking at you, potentially judging you. You can make eye contact with one actual person, then move your eyes, and now you’re making eye contact with a different actual person, etc. The full force of the ground-truth reward signals is happening right now.
This would predict that stage fright should scale with gaze cues, i.e.,
This should be testable with real-life interventions (people in the audience could wear masks, look away, do other things) or VR experiments, though not cheaply.
I'm not sure about good experiments for the other cases.
Did you see my post Parameters of Metacognition - The Anesthesia Patient? While it doesn't address consciousness, it does discuss aspects of awareness. Do you agree with at least awareness having non discrete properties?
Mesa-optimizers are the easiest case of detectable agency in an AI. There are more dangerous cases. One is Distributed agency, where the agent is spread across tooling, models, and maybe humans or other external systems, and the gradient driving the evolution is the combination of the local and overall incentives.
Mesa-Optimization is introduced in Risks from Learned Optimization: Introduction and probably because it was the first type of learned optimization, it has driven much of the conversation. It makes some implicit assumptions: that learned optimization is compact in the sense of being a sub-system of the learning system, coherent due to the simple incentive structure, and a stable pattern that can be inspected in the AI system (hence Mechinterp).
These assumptions do not hold in the more general case where agency from learned optimization may develop in more complex setups, such as the current generation of LLM agents, which consist of an LLM, a scaffolding, and tools, including memory. In such a system, memetic evolution of the patterns in memory or external tools are part of the learning dynamics of the overall systems, and we can no longer go by the incentive gradients (benchmark optimization) of the LLM alone. We need to interpret the system as a whole.
Just because the system is designed as an agent doesn't mean that the actual agent coincides with the designed agent. We need tools to deal with these hybrid systems. Methods like Unsupervised Agent Discovery could help pin down the agent in such systems, and Mechinterp has to be extended to span borders between LLMs.
Yes, the description length of each dimension can still be high, but not arbitrarily high.
Steven Byrnes talks about thousands of lines of pseudocode in the "steering system" in the brain-stem.
Does the above derivation mean that values are anthropocentric? Maybe kind of. I'm deriving only an architectural claim: Bounded evolved agents compress their control objectives through a low-bandwidth interface. Humans are one instance. AI's are different. They are designed and any evolutionary pressure on the architecture is not on anything value-like. If an AI has no such bottleneck, inferring and stabilizing its ‘values’ may be strictly harder. If it has one, it depends on its structure. Alignment might generalize, but not necessarily to human-compatible values.
Existing consciousness theories do not make predictions.
Huh? There are many predictions. The obvious ones:
Empirically measurable effects of at least some aspects of consciousness are totally routine in anesthesia - otherwise, how would you be confident the patient is unconscious during a procedure. The Appendix of my Metacognition post lists quite a few measurable effects.
I think the problem is again that people can't agree on what they mean by consciousness. I'm sure there is a reading where there are no predictions. But any theory that models it as a Physical Process necessarily makes predictions.
Thanks for writing this! I finally got around to reading it, and I think it is a great reverse-engineering of these human felt motivations. I think I'm buying much of it, but I have been thinking of aggregation cases and counterexamples, and would like to hear your take on it.
A friend wins an award; I like them, but I feel a stab of envy (sometimes may wish they’d fail). That is negative valence without “enemy” label, and not obviously about their attention to me. For example:
when another outperforms the self on a task high in relevance to the self, the closer the other the greater the threat to self-evaluation. -- Some affective consequences of social comparison and reflection processes: the pain and pleasure of being close; Tesser et al., 1988
Is the idea that the "friend/enemy" variable is actually more like "net expected effect on my status," so a friend’s upward move can locally flip them into a threat?
I can dislike a competitor and still feel genuine admiration for their competence or courage. If "enemy" is on, why doesn’t it reliably route through provocation or schadenfreude? Do you think admiration is just a different reward stream, or does it arise when the "enemy" tag is domain-specific?
E.g., an opposing soldier or a political adversary is injured and I feel real compassion, even if I still endorse opposing them.
Two-thirds of respondents (65 per cent) say they would save the life of a surrendering enemy
combatant who had killed a person close to them, but almost one in three (31 per cent) say
they would not. The same holds true when respondents are asked if they would help a
wounded enemy combatant who had killed someone close to them (63 per cent compared
with 33 per cent). -- PEOPLE ON WAR Country report Afghanistan
This feels like “enemy × their distress” producing sympathy rather than schadenfreude. Is your take that “enemy” isn’t a stable binary at all—that vivid pain cues can transiently force a “person-in-pain” interpretation that overrides coalition tagging?
Someone helps me. I feel gratitude and an urge to reciprocate. It doesn’t feel like "approval reward" (I’m not enjoying being regarded highly). It feels more like a debt.
Perceived benevolent helper intentions were associated with higher gratitude from beneficiaries compared to selfish ones, yet had no associations with indebtedness. -- Revisiting the effects of helper intentions on gratitude and indebtedness: Replication and extensions Registered Report of Tsang (2006)
Do you see gratitude as downstream of the same "they’re thinking about me" channel, or as a separate ledger?
People often report guilt as a direct response to "I did wrong," even when they’re confident nobody will know.
when opportunities for compensation are not present, guilt may evoke self-punishment. -- When guilt evokes self-punishment: evidence for the existence of a Dobby Effect
I'm not sure that fits guilt from "imagined others thinking about me." It looks like a norm-violation penalty that doesn’t need the “about-me attention” channel. Do you have a view on which way it goes?
I have been wondering about if the suggested processing matches what we would expect for larger groups of people (that could all be friend/enemy and/or thinking of me or not. And there seem to be at least two different processes going on:
Compassion doesn’t scale with the number of people attended to. This seems to be well established for Identifiable Victim and Numbing. When harm is spread over many victims, affect often collapses into numbness unless one person becomes vivid. That matches your attentional bottleneck.
But evaluation does seem to scale with headcount, at least in stage fright and other audience effects.
Maybe a roomful of people can feel strongly like “they’re thinking about me,” even if you’re not tracking anyone individually? But then the “about-me attention” variable would be computed at the group level, which complicates your analysis.
One additional thing: it may so happen that you are encouraged or even pushed to hire a team and maybe quickly so by your boss (or board), investors, advisors, or others. Push back. Unless they have really good reasons. Then you need to think quick if you can steelman Greta's points above.