Jon Kurishita

A retired cyber security engineer who is now freelancing with AI safety alignment research. A Japanese citizen living in Ibaraki prefecture, Japan but was born and raised in a country called USA (in case you never heard of it).

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Newest

AGI Ruin: A List of Lethalities

Jon Kurishita2d1-2

copy and paste your blog into 03-mini high to see how it would go against my "Dynamic Policy Layer" research. This is it's comment( not mine);

=============================================

These features of your DPL research collectively offer a comprehensive strategy to mitigate many of the lethal alignment risks described in the AGI Ruin paper. By embedding dynamic, real-time oversight and adaptive, decentralized ethical governance into the AI system, your framework provides a robust line of defense against emergent misalignment, hidden triggers, and other high-stakes vulnerabilities inherent to advanced AGI systems.

Open problems in emergent misalignment

Jon Kurishita3d10

I found your work on emergent misalignment both insightful and concerning—especially the observation that narrow fine-tuning for tasks like generating insecure code can lead to broadly misaligned behavior. In my research on the Dynamic Policy Layer (DPL), I tackle these challenges by proposing a continuous, real-time oversight mechanism. My approach centers on an Ethical Reasoning Validator (DPL-ERV) that is governed by a decentralized Federation of Ethical Agents (FoEA). This framework continuously updates a robust Ethical Baseline through adversarial training and meta-cognitive feedback, enabling it to detect, explain, and intervene when outputs deviate from ethical guidelines—even when misalignment is subtly triggered by narrow fine-tuning. I believe that integrating such adaptive oversight mechanisms could significantly mitigate risks like those you’ve described, and I would be very interested in exploring how these ideas might complement your findings in building safer, more aligned AI systems.

How can we promote AI alignment in Japan?

Jon Kurishita8d10

I am almost done with blueprint for AI safety Alignment called the DPL - Dynamic policy layer that can be used for foundation models. I hope I can also help in promoting AI alignment by providing my 6 chapters and 2 supplement paper series for the public and open source community.