Introduction
I am new to AI Safety and this is my first lesswrong post (crossposted to EA Forum) so please feel free to correct my mistakes. This post is going to be a bit about my interpretation of existing alignment research and how I intend to tackle the problem, and the rest on provably honest AI systems.
My Take on Alignment Research
Any sort of interpretability research is model-dependent and so is RL from Human Feedback as it tries to optimize the reward model learnt specific to the given task. I don't know how AI will do alignment research themselves but if they too try to do research to make individual models safe then that... (read 2318 more words →)
Thank you for the suggested link. That does seem like a good expansion on the same.
Yes, you are right, this argument clearly doesn't hold for neural networks as they exist now. But this entire article was to come to this exact line of reasoning, hence the emphasis on model-independent approaches. A standard feedforward network with fixed parameters and weights without a mesa-optimizer cannot lie variably owing to different situations as exploited in timing attacks, nor does it have the abilities of an AGI. Rather than any arbitrary AGI system, I would say any model that has inner and outer objectives to align to will take different computation times. I am not talking... (read more)