Posting this on behalf of user PrimalShadow on the AstralCodexTen Discord:
What are ways in which an AI that is trained to prove specific math theorems is dangerous? In particular, imagine you have an AI which is trained to take a prompt asking to prove a formal-language-predicate, and then it produces a machine-readable proof of said thing.
I would posit some answers to my own question:
- The AI ends up finding bugs in your prover instead of actually solving problems. (This is a failure mode, but probably not dangerous in the sense of being unfriendly)
- The AI is used by humans to do dangerous things, e.g. break cryptography. (This is a type of danger, but not what I would call unfriendliness either.)
- The AI somehow learns about the outside world, and modifies its outputs in an attempt to manipulate that world - E.g. to acquire more computing power - E.g. to get the operator to ask it easier questions
- The AI as described is not directly dangerous, but some tweak to it is dangerous I wonder if I'm missing any broad categories.
I responded with the suggestion of what you might call "paperclipping" (it deciding it requires some astronomical amount of compute to solve the problem, so takes over the world, or maybe even the galaxy/beyond, to maximize compute), which PrimalShadow noted already falls into "The AI somehow learns about the outside world". He also pointed out that the AI wouldn't need to be trained on non-mathematical data, which would presumably reduce any risk of that happening.
Neither of us could easily think of a scenario where this trivially goes wrong, and we're both curious what the potential risks are from the perspective of more experienced people in the field.
How are you imagining making this thing? If you're training an optimizer that you don't understand, this can lead to pitfalls / unintended generalization. If it just pops into existence fulfilling the spec by fiat, I think that's a lot less dangerous than us using ML to make it.
Well, you might train an agent that has preferences about solving theorems that it generalizes to preferences about the real world.
Then you'd get something sort of like problem #3, but with way broader possibilities. You don't have to say "well, it was trained on theorems, so it's just going to affect what sorts of theorems it gets asked." It can have preferences about the world in general because it's getting its preferences by unintended generalization. Information leaks about the world (see acylhalide's comment) will lead to creative exploitation of hardware, software, and user vulnerabilities.