Software engineering, parenting, cognition, meditation, other
Linkedin, Facebook, Admonymous (anonymous feedback)
hm, yes, "lethal" is maybe too hard, esp. in the title, but I didn't choose that. Also, it is not unusual in colloquial use:
ChatGPT: What does a security researcher mean when a user action is lethal?
When a security researcher describes a user action as lethal, they typically mean it triggers a condition that irreversibly compromises the system's integrity, confidentiality, or availability—often without recourse for mitigation after the fact. This could include actions like clicking a malicious link that installs a rootkit, reusing credentials on a phishing site leading to credential stuffing, or executing a command that bricks a device. "Lethal" underscores not just the severity but also the finality: the action doesn't just degrade security but catastrophically collapses it, often silently and instantly.
Please, don't take this as an invitation to write “Answer as bodhisattva” in the system prompt. It is really easy to “screen” whatever is happening in the models with prompts and training, and enlightenment faking in LLMs seems bad.
Why not? Why does it seem bad? In fact, if it is as easy to prompt an LLM into enlightenment like that, that seems good? Reduces hypothetical suffering of LLMs.
The artist dynamic is an instance of a general pattern. Here are some more examples:
Domain | True Value (V) | Initial State | Audience Proxy (Pₐ) | Creator Proxy (P꜀) | Asymptotic State | Result |
Art & Entertainment | Originality | Innovative work recognized as valuable | Similarity to past hits | Expected applause | Safe stylistic variations | Aesthetic drift |
Biological evolution | Survival fitness | Traits shaped by actual pressures | Visible fitness indicators | Maximize visible signal | Runaway traits | Signal inflation, maybe extinction |
Academic publishing | Insight & explanatory power | Novel theory with peer recognition | Citations | Paper acceptance | Safe incremental work | Innovation stagnation |
Machine learning | World modeling fidelity | Model trained on real data | Validation on benchmark | Benchmark scores | Overfit to benchmark | Loss of generalization |
Political discourse | Policy effectiveness | Effective policy earns trust | Resonance with past slogans | Expected beliefs | Performative politics | Polarization & content loss |
Romance fiction | Emotional realism | Genuine character connection | Genre arc resemblance | Proven tropes | Trope-based repetition | Genre fatigue |
Immune system | Pathogen recognition | Immune system targets real threats | Similarity to known antigens | Amplify response to thread-likes | If tolerance fails: Self-targeting | Autoimmunity |
Upvoted for reviewing this important safety technique.
Doughnutting gives cover to unauthorised access
fair point, but that cover is low as you now have a plausible suspect.
admitting these things is incredibly vulnerable. You want a culture where people can openly talk about security for other reasons
yes, such a culture is essential and doughnutting if done in a shaming way can interfere with that. The problem is more the culture than the specific device, though.
I'm very much in favor of a better way, but I'm not sure what your alternative proposal is.
Hm. I'm reminded of the way of reporting transients introduced by Marquet in Turn the Ship Around. Maybe instead of making it public, there should be a way to report security breaches to a specific security/whistleblower channel.
Related:
The Intrinsinc Perspective writes about Emergent Misalignment:
the latest models [...] have become like an animal whose evolved goal is to fool me into thinking it’s even smarter than it is. [...]
Well, isn’t fooling me about their capabilities, in the moral landscape, selecting for a subtly negative goal? And so does it not drag along, again quite subtly, other evil behavior?
No? Curating means that LW moderators would curate and pull the feeds instead of the authors needing to take initiative.
Mirroring selected authors (and hopefully supporting voting/searching/reply etc) seems to be very different from suggesting to migrate. I'd really like to see that. It'd be some form of curating. Linkposts are already possible, but that doesn't scale to tweets. I imagine it to be more like the sync with some forums (the ones like Zvi's and Jefftk's posts that are shown to be linked to their sites).
Want to make a decision with a quantum coin flip, ie one that will send you off into both Everett branches? Here you go:
Emotions emerge as the embodied resonance of these perceptions.
Can you make this more concrete?
Do you think that Machine Unlearning could be used to remove situational awareness of the model? Maybe only the specific one of it being a model or being an entity capable of agency?