Trusting in your last sentence would be equivalent to trusting that increasingly great cognitive capacities will not lead to the emergence of conscience. Maybe they won't - but there's nothing obvious, and from my perspective, intuitive in it. If / when a conscience emerge you are not in control anymore (not to mention: you also have a serious ethical problem on your hands).
Thank you. There is a convoluted version of the world where the model tricked you the whole time: "Let's show these researchers results that will suggest that even if they catch me faking during training, they should not be overly worried because I will still end up largely aligned, regardless"
Not saying it's a high probability, to be clear, but: as a theoritical possibility.