As the internal ontology takes on any reflective aspects, parts of the representation that mix with facts about the AI's internals, I expect to find much larger differences
It could be worth exploring reflection in transparency-based AIs, the internals of which are observable. We can train a learning AI, which only learns concepts by grounding them on the AI's internals (consider the example of a language-based AI learning a representation linking saying words and its output procedure). Even if AI-learned concepts do not coincide with human concepts, because the AI's internals greatly differ from human experience (e.g. a notion of "easy to understand" assuming only a metaphoric meaning for an AI), AI-learned concepts remain interpretable to the programmer of the AI given the transparency of the AI (and the programmer of the AI could engineer control mechanisms to deal with disalignment). In other words, there will be unnatural abstractions, but they will be discoverable on the condition of training a different kind of AI - as opposed to current methods which are not inherently interpretable. This is monumental work, but desperately needed work
I like your arguments on AGI timelines, but the last section of your post feels like you are reflecting on something I would call "civilization improvement" rather than on a 20+ years plan for AGI alignment.
I am a bit confused by the way you are conflating "civilization improvement" with a strategy for alignment (when you discuss enhanced humans solving alignment, or discuss empathy in communicating a message "If you and people you know succeed at what you're trying to do, everyone will die"). Yes, given longer timelines, civilization improvement can play a big role in reducing existential risk including AGI x-risk, but I would prefer to sell the broad merits of interventions on their own, rather than squeeze them into a strategy for alignment from today's limited viewpoint. When making a multi-decade plan for civilization improvement, I think it is also important to consider the possibility of AGI-driven "civilization improvement", i.e. interventions will not only influence AGI development, but they may also be critically influenced by it.
Finally, when considering strategy for alignment under longer timelines, people can have useful non-standard insights, see for example this discussion on AGI paradigms and this post on agent foundations research.
I am also interested in interpretable ML. I am developing artificial semiosis, a human-like AI training process which can achieve aligned (transparency-based, interpretability-based) cognition. You can find an example of the algorithms I am making here: the AI runs a non-deep-learning algorithm, does some reflection and forms a meaning for someone “saying” something, a meaning different from the usual meaning for humans, but perfectly interpretable.
I support then the case for differential technological development:
There are two counter-arguments to this that I'm aware of, that I don't think in themselves justify not working on this.
Regarding 1, it may take several years to have interpretable ML reach capabilities equivalent to LLMs, but the future may offer surprises either in terms of coordination to pause the development of "opaque" advanced AI or of deep learning hitting a wall... at killing everyone. Let's have a plan also for the case we are still alive.
Regarding 2, interpretable ML would need to have programmed control mechanisms to be aligned. There is currently no such a field of AI safety as we do not have yet interpretable ML, but I imagine computer engineers being able to make progress on these control mechanisms (being able to make more progress than on mechanistic interpretability of LLMs). While it is true that control mechanisms can be disabled, you can always advocate for the highest security (like in Ian Hogarth's Island idea). You can then also reject this counterargument.
mishka noted that this paradigm of AI is more foomable. Self-modification is a huge problem. I have an intuition interpretable ML will exhibit a form of scaffolding, in that control mechanisms for robustness (i.e. for achieving capabilities) can advantageously double as alignment mechanisms. Thanks to interpretable ML, engineers may be able to study self-modification already in systems with limited capabilities and learn the right constraints.
In his paper, Searle brings forward a lot of arguments.
Early in his argumentation and referring to the Chinese room, Searle makes this argument (which I ask you not to mix with later arguments without care):
it seems to me quite obvious in the example that I do not understand a word of the Chinese stories. I have inputs and outputs that are indistinguishable from those of the native Chinese speaker, and I can have any formal program you like, but I still understand nothing. For the same reasons, Schank's computer understands nothing of any stories. whether in Chinese. English. or whatever. since in the Chinese case the computer is me. and in cases where the computer is not me, the computer has nothing more than I
Later, he writes:
the whole point of the original example was to argue that such symbol manipulation by itself couldn't be sufficient for understanding Chinese.
I am framing this argument in a way it can be analyzed:
1) P (the Chinese room) is X (a program capable of passing Turing test in Chinese);
2) Searle can be any X and not understanding Chinese (as exemplified by Searle being the Chinese room and not understanding Chinese, which can be demonstrated for certain programs)
thus 3) no X is understanding Chinese
Searle is arguing that “no program is understanding Chinese” (I stress this in order to reply to Said). The argument "P is X, P is not B, thus no X is B" is an invalid syllogism. Nevertheless, Searle believes in this case that “P not being B” implies (or strongly points towards) “X not being B”.
Yes, Searle’s intuition is known to be problematic and can be argued against accordingly.
My point however is that there is out there in the space of X a program P that is quite unintuitive. I am suggesting a positive example of “P possibly understanding Chinese” which could cut short the debate. Don’t you see that giving a positive answer to the question “can a program understand?” may bring some insight in Searle’s argument too (such as developing it into a "Chinese room test" to assess whether a given program can indeed understand)? Don't you want to look into my suggested program P (semiotic AI)?
In the beginning of my post I made it very clear:
Humans learn Chinese all the time; yet it is uncommon having them learning Chinese by running a program
Uhm, an Aboriginal tends to see meaning in anything. The more the regularities, the more meaning she will form. Semiosis is the dynamic process of interpreting these signs.
If you were put in a Chinese room with no other input than some incomprehensible scribbles you will probably start considering that what you are doing has indeed a meaning.
Of course, a less intelligent human in the room or a human put under pressure would not be able to understand Chinese even with the right algorithm. My point is that the right algorithm enables the right human to understand Chinese. Do you see that?
A more proper summary would read as follows:
1. P is an instantiated algorithm that behaves as if it [x]. (Where [x] = “understands and speaks Chinese”.)
2. If we examine P, we can easily see that its inner workings cannot possibly explain how it could [x].
3. Therefore, the fact that humans can [x] cannot be explainable by any algorithm.
I have some problem with your formulation. The fact that P does not understand [x] is nowhere in your formulation, not in premise #1. Conclusion #3 is wrong and should be written as "the fact that humans can [x] cannot be explainable by P". This conclusion does not need the premise that "P does not understand [x]" but only premise #2. In fact, at least two conclusions can be derived from premise #2, including a conclusion that "P does not understand [x]".
I state that - using a premise #2 that does not talk about any program - both Searle's conclusions hold true, but do not apply to an algorithm which performs (simulates) semiosis.
SCA infers that "somebody wrote that" where the term "somebody" is used more generally than in English.
SCA does not infer that another human being wrote that, but rather that a casual agent wrote that, maybe spirits of the caves.
If SCA enters two caves and observes natural patterns in cave A and the characters of "The adventures of Pinocchio" in cave B, she may deduce that two different spirits wrote them. Although she may discover some patterns in what spirit A (natural phenomena) wrote, she won't be able to discover a grammar as complex as in cave B. Spirit B wrote often the sequence "oor ", preceded sometimes by capital " P", sometimes by small " p". Therefore, she infers that symbols "p" and "P" are similar (at first, she may group also "d" with them, but she may correct that thanks to additional observations).
There is no hidden assumption that SCA knows she is observing a language in cave B. SCA is not a taught cryptographer, but rather an Aboriginal cryptographer. She performs statistical pattern matching only and makes the hypothesis that spirit B may have represented the concept of writing by using a sequence of letters "said". She discards other hypotheses that just a single character may correspond to the concept of writing (although she has some doubt with ":"). She discards other hypotheses that capitalised words are words reported to be written. On the other side, direct discourse in "The adventures of Pinocchio" supports her hypothesis about "said".
SCA keeps generating hypotheses that way so that she learns to decode more knowledge, without the need of knowing that the symbols are language (she rather discovers the concept of language).
TruePath, you are mistaken, my argument addresses the main issue of explaining computer understanding (moreover, it seems that you are making confusion between the Chinese room argument and the “system reply” to it).
Let me clarify. I could write the Chinese room argument as the following deduction argument:
1) P is a computer program that does [x]
2) There is no computer program sufficient for explaining human understanding of [x]
=> 3) Computer program P does not understand [x]
In my view, assumption (2) is not demonstrated and the argument should be reformulated as:
1) P is a computer program that does [x]
2’) Computer program P is not sufficient for explaining human understanding of [x]
=> 3) Computer program P does not understand [x]
The argument still holds against any computer program satisfying assumption (2’). Does however a program exist that can explain human understanding of [x] (a program such that a human executing it understands [x])?
My reply focuses on this question. I suggest to consider artificial semiosis. For example, a program P learns solely from symbolic experience of observing a symbols in a sequence that it should output “I say” (I have described how such a program would look like in my post). Another program Q could learn from symbolic experience solely how to speak Chinese. Humans do not normally learn these ways a rule for using “I say” or how to speak Chinese, because their experience is much richer. However, we could reason about the understanding that a human would have if he could have only symbolic experience and the right program instructions to follow. The semiosis performed by the human would not differ from the semiosis performed by the computer program. It can be said that program P understands a rule for using “I say”. It could be said that the computer program Q understands Chinese.
You can consider [x] to be a capability enabled by sensory-motion. You can consider [x] to be consciousness. My “semiosis reply” could of course be adapted to these situations too.
Daniel, I'm curious too. What do you think about Fluid Construction Grammar? Can it be a good theory of language?
I am skeptical about your theory of impact for investigating the question of which concepts would be convergent across minds, specifically your expectation that concepts validated through linguistic conventions may assist in non-ad-hoc interpretability of deep learning networks. Yet, I am interested in investigating semantics for the purpose of alignment. Let me try to explain how my model differs from yours.
First, for productively studying semantics, I recommend keeping a distinction between a semantics for vision (as the prototypical sensory input) and one for symbolic reasoning. I have the impression that your project can be described as curriculum learning for a visual reasoner. In the space of minds or programs, we have diffusion models and multi-modal LLMs, but there is space for programs (say a probabilistic computer vision program) learning to make visual predictions from visual data more efficiently than deep learning. It is a legitimate research question how to design a curriculum enabling a visual reasoner to successfully form concepts of increasing complexity or whether an algorithm exists enabling the visual reasoner to navigate in autonomy through a mass of visual data to successfully reach the same objective (in this paper, I described the visual reasoner as "visual AGI").
Early milestones could be learning concepts of edges, tridimensional space, rigid-body motion (very much in the direction of your toy example). Other milestones would be natural latents for shadows, textures, shapes, leaves, trees, laws of physics. I do not see any fundamental reasons why to stop here, instead you can indeed project learning concepts for animals, concepts for various actions, concepts for the sense of sights such as eyes as the organ of vision and attention focus of an animal, concepts related to animal behavior. Humans are a special case of animal, this can be proven also through ethology.
I have two comments:
- you say you want to use language as one source of information about latents. There are two possibilities. The first possibility involves the use of labels and even knowledge bases, but still requires a visual reasoner, only assuming its operation on an augmented reality. This modality does not exist in nature and is available only to programs, but there certainly exist programs successfully navigating linguistic conventions (see also Bill Benzon's comment). It seems to me that you plan to use language in this way. The alternative is postulating a symbolic reasoner which somehow got to master “language”, but then a different kind of semantics need to be accounted for and I could not find any discussion of it in your post.
- a visual reasoner does not seem an existential risk for humanity. Thanks to deep learning, we already have visual reasoners (it was easy to turn visual predictors into visual reasoners, even if adversarial attacks are still possible against AI). The naturality of laws of physics or patterns of human individual/social behavior is scientific inquiry which may however not be central to AI safety. The study of semantics of visual reasoning would support an interpretation of our AI as a naturalist, or even a human ethologist. On the contrary, I see risks connected to a possible development of an AI scientist (or computer scientist), but this development is subordinate to developing symbolic reasoning. I consider LLMs symbolic predictors whose internals are difficult to interpret. Can LLMs be turned into (potentially deadly) symbolic reasoners? I am not an expert; such a possibility cannot be safely excluded. My intuitions below about a semantics for symbolic reasoning would not provide much hint to those looking to integrate this reasoning capabilities in LLMs, but would not assist in advancing their interpretability either. However, I think they can help with AI safety.
In the space of minds or programs, there is space for programs learning symbolic reasoning in a way differing from human brains and in a way that is transparent to their programmer. A symbolic reasoner can be a program exposed to words and other symbols solely, for example does not need to have any sense of sight, can be very data-efficient, does not need to have any equivalent in nature. A symbolic reasoner would start from a concept that someone used words to say something and create increasingly complex reflective concepts about words, about the inventors of the words and their other inventions such as mathematics and computer programming. I pioneered investigation into symbolic reasoning some years ago. A early milestone is understanding reported saying (such as in “Augrh!” said Father Wolf. “It is time to hunt again.”). Other milestones are learning concepts of being an element of a class, of truth, of negation, of conjunction, of believing, of wanting, of functional application. You can see that the curriculum is completely different from that for a visual reasoner. The resulting semantics is less about the real world, but involves rather reflection about a “mental” internal world.
Now: can a semantics for symbolic reasoning help with AI safety? My intuition is that a symbolic reasoner with the property of transparency can be made to reason about the goals of its programmer such that there is at least a technical possibility of alignment (as opposed to language models whose internals we do not know how to interpret). The bonus point is the possibility to experiment with alignment at an early stage of the curriculum and to decide how far go with the curriculum (when it is safe to pursue distant milestones such as geometry, algebra, physics, biology).