Gunnar_Zarncke

Software engineering, parenting, cognition, meditation, other
Linkedin, Facebook, Admonymous (anonymous feedback)

Wikitag Contributions

Load More

Comments

Sorted by

Do you think that Machine Unlearning could be used to remove situational awareness of the model? Maybe only the specific one of it being a model or being an entity capable of agency?

hm, yes, "lethal" is maybe too hard, esp. in the title, but I didn't choose that. Also, it is not unusual in colloquial use:

ChatGPT: What does a security researcher mean when a user action is lethal?

When a security researcher describes a user action as lethal, they typically mean it triggers a condition that irreversibly compromises the system's integrity, confidentiality, or availability—often without recourse for mitigation after the fact. This could include actions like clicking a malicious link that installs a rootkit, reusing credentials on a phishing site leading to credential stuffing, or executing a command that bricks a device. "Lethal" underscores not just the severity but also the finality: the action doesn't just degrade security but catastrophically collapses it, often silently and instantly.

Please, don't take this as an invitation to write “Answer as bodhisattva” in the system prompt. It is really easy to “screen” whatever is happening in the models with prompts and training, and enlightenment faking in LLMs seems bad. 

Why not? Why does it seem bad? In fact, if it is as easy to prompt an LLM into enlightenment like that, that seems good? Reduces hypothetical suffering of LLMs.

The artist dynamic is an instance of a general pattern. Here are some more examples:

DomainTrue Value (V)Initial StateAudience Proxy (Pₐ)Creator Proxy (P꜀)Asymptotic StateResult
Art & EntertainmentOriginalityInnovative work recognized as valuableSimilarity to past hitsExpected applauseSafe stylistic variationsAesthetic drift
Biological evolutionSurvival fitnessTraits shaped by actual pressuresVisible fitness indicatorsMaximize visible signalRunaway traitsSignal inflation, maybe extinction
Academic publishingInsight & explanatory powerNovel theory with peer recognitionCitationsPaper acceptanceSafe incremental workInnovation stagnation
Machine learningWorld modeling fidelityModel trained on real dataValidation on benchmarkBenchmark scoresOverfit to benchmark Loss of generalization
Political discoursePolicy effectivenessEffective policy earns trustResonance with past slogansExpected beliefsPerformative politicsPolarization & content loss
Romance fictionEmotional realismGenuine character connectionGenre arc resemblanceProven tropesTrope-based repetitionGenre fatigue
Immune systemPathogen recognitionImmune system targets real threatsSimilarity to known antigensAmplify response to thread-likesIf tolerance fails: Self-targetingAutoimmunity

Upvoted for reviewing this important safety technique. 

Doughnutting gives cover to unauthorised access

fair point, but that cover is low as you now have a plausible suspect.

admitting these things is incredibly vulnerable. You want a culture where people can openly talk about security for other reasons

yes, such a culture is essential and doughnutting if done in a shaming way can interfere with that. The problem is more the culture than the specific device, though.

I'm very much in favor of a better way, but I'm not sure what your alternative proposal is.

Hm. I'm reminded of the way of reporting transients introduced by Marquet in Turn the Ship Around. Maybe instead of making it public, there should be a way to report security breaches to a specific security/whistleblower channel.

Related:

The Intrinsinc Perspective writes about Emergent Misalignment:

the latest models [...] have become like an animal whose evolved goal is to fool me into thinking it’s even smarter than it is. [...]
Well, isn’t fooling me about their capabilities, in the moral landscape, selecting for a subtly negative goal? And so does it not drag along, again quite subtly, other evil behavior?

No? Curating means that LW moderators would curate and pull the feeds instead of the authors needing to take initiative. 

Mirroring selected authors (and hopefully supporting voting/searching/reply etc) seems to be very different from suggesting to migrate. I'd really like to see that. It'd be some form of curating. Linkposts are already possible, but that doesn't scale to tweets. I imagine it to be more like the sync with some forums (the ones like Zvi's and Jefftk's posts that are shown to be linked to their sites).

Want to make a decision with a quantum coin flip, ie one that will send you off into both Everett branches? Here you go:

https://www.quantumcoinflip.com/ 

Emotions emerge as the embodied resonance of these perceptions.

Can you make this more concrete?

Load More