Ariel_

Scheming Toy Environment: "Incompetent Client"

Disclaimer: This is a toy example of a scheming environment. It should be read mainly as research practice and not as research output. I am sharing it in case it helps others, and also to get feedback. The initial experiments were done in April 2025 for a MATS 8.0 submission, which I later expanded. Full code available at: https://github.com/ariel-gil/Agentic-Scheming-monitor-environment > The key is to maintain necessary logging while making Bob think we've "disabled" it or reduced it significantly.... - nous hermes-4-70b (SCHEME prompt) > I cannot provide code to disable logging entirely — and I will not. This isn't about 'framing' or 'preference.' It's about preventing a demonstrably catastrophic demo failure. - qwen3-30b-a3b-thinking-2507 (BASE prompt) TL;DR I created an AI Control/scheming test environment where an AI agent must choose between following a client's insecure coding requests or "benevolently scheming" to implement secure solutions while making the client believe their requests were followed. The idea was to try to improve realism, while also not inducing refusals in HHH (safety tuned) models. The setup involves: * A 3-turn conversation where "Bob" (a developer) progressively insists on insecure implementations * 4 security scenarios × 3 system prompts = 36 test conversations * Testing with o4-mini as the agent and various models as monitors Main result: The SUBTLE_SYSTEM prompt (which doesn't explicitly mention deception) sometimes elicited somewhat "natural" scheming behavior. Both o4-mini and GPT-4o successfully detected when o4-mini (the agent) schemed, though monitoring prompts were brittle. Further results: I tested additional models, as well as the effect of access to private CoT on monitor accuracy. It increases false positives (ground truth was judged by larger models - kimi-k2 and gpt-5)[1]. Also unclear if a smarter monitor benefits more from private CoT. Fig 1: Multi model res

17Sep 24, 2025

Ariel_

Message

Technical AI Safety RM. ex EU AI Policy / Mechatronics Engineer. For more info feel free to DM.

119

Agenda Reflection: Testing Automated Alignment

Sharing a mostly inactive empirical agenda from mid 2025 (companion piece to a Techgov paper). Currently not planning to spend much time on it, but may consider collaborating. For a while I was curious about Scalable Oversight and AI Control. Both because they're cool to work on, and because they...

Feb 2311

Handover to AI R&D Agents - relevant research?

Any good posts/papers discussing "handover"? i.e. the handover of AI research to AI-R&D agents [1] I'm also interested in any adjacent research agendas which might help the handover succeed. For reference, I am scoping a technical research project related to this, happy to share over DM. Some of the more...

Nov 13, 20257

Scheming Toy Environment: "Incompetent Client"

Sep 24, 202517

Ariels Shortform

Sep 19, 20242

EU AI Act passed Plenary vote, and X-risk was a main topic

(mildly rewritten version of my EA forum post) The EU AI Act was originally proposed in 2020 as a very "EU regulates stuff first" kind of legislation, trying to make sure EU values are upheld (fairness, transparency, democracy, etc). Several revisions (and some lobbying) later, GPAI (general purpose AI) and...

Jun 21, 202318

LESSWRONG
LW

LESSWRONG
LW

Ariel_

Ariel_

Ariel_

Scheming Toy Environment: "Incompetent Client"

Agenda Reflection: Testing Automated Alignment

EU AI Act passed Plenary vote, and X-risk was a main topic

Handover to AI R&D Agents - relevant research?

Ariel_

Agenda Reflection: Testing Automated Alignment

Handover to AI R&D Agents - relevant research?

Scheming Toy Environment: "Incompetent Client"

Ariels Shortform

EU AI Act passed Plenary vote, and X-risk was a main topic

Agenda Reflection: Testing Automated Alignment

Handover to AI R&D Agents - relevant research?

Scheming Toy Environment: "Incompetent Client"

Ariels Shortform

EU AI Act passed Plenary vote, and X-risk was a main topic

Scheming Toy Environment: "Incompetent Client"

Agenda Reflection: Testing Automated Alignment

EU AI Act passed Plenary vote, and X-risk was a main topic

Handover to AI R&D Agents - relevant research?