I recently had a thought about AI alignment that I wanted to get some feedback on. Keep in mind this is not a complete solution, just a core idea to build upon (and misses a lot of implementation details that would be by no means trivial). I'm by no means...
As a new member of this community, it was suggested that I do a babble for ideas on AI Safety. I did this last night, and have compiled a document with my ideas, including short descriptions of what they are about:
https://docs.google.com/document/d/16ce97vgzhQwx5E2O1hzv5gXz83PiHtFQ2RwxElxDjGo/edit#
I chose to limit myself to 25 ideas, as I wanted to flesh them out a bit and didn't have the time for a full list of 50.
Please take a look and provide any feedback you may have!
Hi Raemon,
Thanks for the reply.
I was writing a reply to this, but I think I encountered a potential catastrophic failure scenario when breaking it down further to better explain:
I still hold that the structure ensures that the decision-making AI would be rewarded for making good plans with high utility, and not on the likelihood of those plans being implemented. And this would work for many cases, as it is more rewarding for the AI to tell you about the misaligned bad plan than it is for it to obscure the plan and get it implemented.
However, there remains the case where the AI can see it would get higher rewards on future decisions... (read 380 more words →)
As I said in my post, I'm not suggesting I have solved alignment. I'm simply trying to solve specific problems in the alignment space. Specifically what I'm trying to solve here are two things:
1) Transparency. That's not to say that you can ever know what a NN really is optimizing for (due to internal optimizers), but you can get them to produce a verifiable output. How you verify the output is a problem in itself, but the first step must be getting something you can verify.
2) Preventing training pressure from creating a system that trends its failure modes to the most extreme outcomes. There are questions on whether this can be done without just creating an expensive brick, and this is what I'm currently investigating. I believe it is possible and scalable, but I have no formal proof of such, and agree it is a valid concern with this approach.
I recently had a thought about AI alignment that I wanted to get some feedback on. Keep in mind this is not a complete solution, just a core idea to build upon (and misses a lot of implementation details that would be by no means trivial). I'm by no means claiming it removes all risks associated with AI on its own
My concept is to train an AI model to produce logical reasoning graphs. It will only be trained on the accuracy of facts used in reasoning, the logical consistency of the graph produced, and completeness. It explicitly won't be trained on alignment issues. If it is logically consistent, uses valid facts, and... (read 878 more words →)
This is amazingly close to my thoughts on the subject, and you've said it way better than I could have. I have additional thoughts on how we can make the "manager" part of the system safer, although that is more potential implementation detail rather than what I expect to happen regardless of developers' views on safety.
I do agree that these kinds of systems are far more likely to be preferred both in the short term and probably the long term too, and that while they aren't without risk, reducing that risk would be an easier problem to solve than doing so for monolithic systems.