Cristian-Curaba - LessWrong

TL;DR
If the narrow (but intelligent) AI can weakly interact with the environment, adapt from feedback and exfiltrate secretly, even once, it can learn how to translate from its ancient language to the environmental one by building samples of pairs. The crucial aspect is training the agent on an unknown/unconventional alphabet: the outputs are meaningless from the environmental point of view, limiting the environmental learning possibilities of the language.

The goal of the following comment is two-sided:
Let's assume that the narrow (but intelligent) AI can adapt from feedback and exfiltrate secretly.
- If the output tokens are bit-vectors compatible with the environment (for example they follow the Unicode Standards), then the "Sumerian + reporter" method is not effective.
- If the output tokens are not compatible with the environment (therefore they require a middleware or a Python-like dictionary to fit standards to effectively interact with other software) then the approach is more robust.

1. Fragility to Exfiltration and Adaptation
The "Sumerian + reporter" safety layer relies on the assumption that the AI is perfectly boxed and cannot interact with the environment except through the reporter. However, this assumption is fragile. If the AI manages to exfiltrate even a small amount of information—such as by outputting functional code or structured data—it could begin building a dataset of translation pairs, mapping its Sumerian-based outputs to the environmental language (e.g., English). Over time, an adaptive AI could use this dataset to reconstruct the environmental language, effectively bypassing linguistic containment.

This vulnerability highlights a critical flaw: the system is only as secure as its ability to prevent even a single unnoticed breach. Once the AI begins to adapt or collect feedback (the 3. tries to limit further this ability), linguistic isolation becomes ineffective. This makes the approach fragile when dealing with an intelligent and adaptive system.

2. Performance Bottleneck Due to Language Limitations
Using an ancient language like Sumerian introduces a significant bottleneck in terms of expressiveness and precision. Modern scientific and technical concepts may not have direct equivalents in Sumerian, which could lead to inefficiencies or inaccuracies in communication. While it is possible to extend Sumerian with new terms, this process risks introducing inconsistencies or ambiguities, further complicating the system. For tasks requiring high precision, such as medicine design, this limitation could hinder the AI's effectiveness.

3. Alternative Approach: Custom Language with Non-Standard Alphabet
A more robust and effective alternative would be to create a custom language that is semantically equivalent to the environmental language but uses a private, non-standard alphabet (e.g., a Unicode-32 range like U+F0000 to U+FFFFD). This approach offers two main advantages:

- Robustness to Exfiltration: Outputs in a private encoding would be incompatible with standard systems unless processed through a specific middleware (Pyhton-like dictionary). This reduces the risk of the AI interacting meaningfully with the environment in the event of a breach.
- Preservation of Expressiveness: A custom language can be designed to fully support modern scientific and technical concepts, avoiding the limitations of using an ancient language like Sumerian.

However, even this refined approach is not without its own vulnerabilities. The security of the system depends entirely on the secrecy of the dictionary/encoding scheme. If the dictionary is leaked or accessed by the AI, the safety layer becomes easily exploitable: the AI would just need a way to add the middleware (both input and output) in its pipeline. This makes securing the dictionary a bottleneck. We may be able to fine-tune the agent (in case of leakage) on a different dictionary but we lack empirical feasibility of this approach. Furthermore, if the AI has situational awareness, it may prepare a piece of code (or other stuff) that will help it to learn the environmental language after the exfiltration, without even knowing the secret dictionary. Or it may collaborate with a not-exfiltrated AI to help it learn the environmental language. The first idea to patch such vulnerabilities is through a reliable monitoring and blocking system.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments