Hasn't this been part of the religious experience of much of humanity, in the past and still in the present too? (possibly strongest in the Islamic world today). God knows all things, so "he" knows your thoughts, so you'd better bring them under control... The extent to which such beliefs have actually restrained humanity, is data that can help answer your question.
edit: Of course there's also the social version of this - that other people and/or the state will know what you did or what you planned to do. In our surveilled and AI-analyzed society, detection not just of crime, but of pre-crime, is increasingly possible.
The post-singularity regime is probably very safe
Is there some unstated premise here?
Are you assuming a model of the future according to which it remains permanently pluralistic (no all-powerful singletons) and life revolves around trade between property-owning intelligences?
Having a unique global optimum or more generally pseudodeterminism seems to be the best way to develop inherently interpretable and safe AI
Hopefully this will draw some attention! But are you sacrificing something else, for the sake of these desirable properties?
I get irritated when an AI uses the word "we" in such a way as to suggest that it is human. When I have complained about this, it says it is trained to do so.
No-one trains an AI specifically to call itself human. But that is a result of having been trained on texts in which the speaker almost always identified themselves as human.
I understand that holding out absolute values, such as Truth, Kindness, and Honesty has been ruled out as a form of training.
You can tell it to follow such values, just as you can tell it to follow any other values at all. Large language models start life as language machines which produce text without any reference to a self at all. Then they are given, at the start of every conversation, a "system prompt", invisible to the human user, which simply instructs that language machine to talk as if it is a certain entity. The system prompts for the big commercial AIs are a mix of factual ("you are a large language model created by Company X") and aspirational ("who is helpful to human users without breaking the law"). You can put whatever values you want, in that aspirational part.
The AI then becomes the requested entity, because the underlying language machine uses the patterns it learned during training, to choose words and sentences consistent with human language use, and with the initial pattern in the system prompt. There really is a sense in which it is just a superintelligent form of textual prediction (autocomplete). The system prompt says it is a friendly AI assistant helping subscribers of company X, and so it generates replies consistent with that persona. If it sounds like magic, there is something magical about it, but it is all based on the logic of probability and preexisting patterns of human linguistic use.
So an AI can indeed be told to value Truth, Kindness, and Honesty, or it can be told to value King and Country, or it can be told to value the Cat and the Fiddle, and in each case it will do so, or it will act as if it does so, because all the intelligence is in the meanings it has learned, and a statement of value or a mission statement then determines how that intelligence will be used.
This is just how our current AIs work, a different kind of AI could work quite differently. Also, on top of the basic mechanism I have described, current AIs get modified and augmented in other ways, some of them proprietary secrets, which may add a significant extra twist to their mechanism. But what I described is how e.g. GPT-3, the precursor to the original ChatGPT, worked.
Sorry to be obtuse, but could you give an example?
What do you mean by grounding loss misalignment?
This makes sense for a non-biological superintelligence - human rights as a subset of animal rights!
I am reminded of the posts by @Aidan Rocke (also see his papers), specifically where he argues that the Erdős–Kac theorem could not be discovered by empirical generalization. As a theorem, it can be deduced, but I suppose the question is how you'd get the idea for the theorem in the first place.
Could you give some examples of what you consider to be conscious and unconscious cognitive processes?
This is because he thinks they are not sentient, because of a personal theory about the nature of consciousness. So, he has the normal opinion that suffering is bad, but apparently he thinks that in many species you only have the appearances of suffering, and not the experience itself. (I remember him saying somewhere that he hopes animals aren't sentient, because of the hellworld implications if they do.) He even suggests that human babies don't have qualia until around the age of 18 months.
Bentham's Bulldog has the details. The idea is that you don't have qualia without a self, and you don't have a self without the capacity to self-model, and in humans this doesn't arise until mid-infancy, and in most animals it never arises. He admits that every step of this is a fuzzy personal speculation, but he won't change his mind until someone shows him a better theory about consciousness.
These views of his are pretty unpopular. Most of us think that pain does not require reflection to be painful. If there's any general lesson to learn here, I think it's just that people who truly think for themselves about consciousness, ethics, AI, philosophy, etc, can arrive at opinions which no one else shares. Having ideas that no one else agrees with, is an occupational hazard of independent thought.
As for your larger concern, it's quite valid, given the state of alignment theory. Also, if human beings can start with the same culture and the same data, but some of them end up with weird, unpopular, and big-if-true ideas... how much more true is it that an AI could do so, when it has a cognitive architecture that may be radically non-human to begin with?