Hey, I am Robert Kralisch, an independent conceptual/theoretical Alignment Researcher. I have a background in Cognitive Science and I am interested in collaborating on an end-to-end strategy for AGI alignment.
I am one of the organizers for the AI Safety Camp 2025, working as a research coordinator by evaluating and supporting research projects that fit under the umbrella of "conceptually sound approaches to AI Alignment".
The three main branches that I aim to contribute to are conceptual clarity (what should we mean by agency, intelligence, embodiment, etc), the exploration of more inherently interpretable cognitive architectures, and Simulator theory.
One of my concrete goals is to figure out how to design a cognitively powerful agent such that it does not become a Superoptimiser in the limit.
I think that "epistemic rationality" matches very well with what I am thinking of as level 3, which is my notion of intelligence. It is indeed applicable to non-agentic systems.
I am still thinking about whether to include meta-learning (referring to updating level 3 algorithms based on experience) and meta-processes above that in my concept of intelligence.
Would this layer of meta-learning be part of epistemic rationality, do you think? It becomes particularly relevant if the system is resource constrained and has to prioritize what to learn about, and/or cares about efficiency of learning. These constraints feel a bit less natural to introduce for a non-agentic system, other than if said system is set up by an agentic system for some purpose.
In any case, instrumental rationality does not seem to cover all that I mean with Competence, perhaps something narrower, like "cognitive competence". I find it a bit difficult to systematically distinguish between cognitive competence and noncognitive competence, because the cognitive part of the system is also implemented by its embodiment - and there are various correlations between events in "morphospace" and "cognitive space".
One way of resolving that might be to distinguish between on-surface properties of a markov blanket (corresponding to the system interfacing with its environment) and within-surface properties of that markov blanket (corresponding to integrated regulation systems and the cognition). There will still be feedback loops between those properties, so our mileage on obtaining clean distinctions into different competencies may vary.
If you are interested in some more thoughts on that, you can check out my post on extended embodiment.
An independent idea is that it is apparently possible to divide "competence" or "instrumental rationality" into two independent axes: Generality, and intelligence proper (or perhaps: competence proper).
I have already commented a bit on meta-learning above, per default my level 3 would refer just to online learning, but I am thinking of including different levels of meta-learning because of the algorithmic similarities.
Perhaps interestingly for you, I consider one of the primary purposes of meta-learning to refine a generally intelligent system into a more narrowly intelligent system, by improving its learning capabilities for a particular set of domains, in some sense biasing the cognition towards the kind of environment it seems to operate within (i.e. in terms of hypothesis generation, or which kinds of functions to use when approximating the behavior of an observed sequence).
Of course, unless the system loses its meta-learning capability, it will be able to respond to changes in its environment by re-aiming/updating its learning tendencies over time - so it is technically general if you give it some time, but ends up converging towards beneficial specialisation.
I think of generality of intelligence as relatively conceptually trivial. At the end of the day, a system is given a sequence of data via observation, and is now tasked with finding a function or set of functions that both corresponds to plausible transition rules of the given sequence, and has a reasonably high chance of correctly predicting the next element of the sequence (which is easy to train for by hiding later elements of the sequence from the modeling process and sequentially introducing them to test and potentially update the fit of the model).
Computationally speaking, the set of total atomic functions that you would have to consider in order to be able to compositionally construct arbitrary transition rules for sequences of discrete data packages, is very small. The only mathematical requirement is Turing universality - basically the entire difficulty arises due to resource constraints.
This seems to match with your thoughts about the appearance of greater generality simply due to more processing power. A cognitive system that is provided with more processing power could use that either to search more deeply those regions of causal models that it naturally considers, or it could branch out to consider new regions within model-space. Many brains in the animal kingdom seem to implement a sort of limited generative simulation of their environment, so that could be considered as a fairly general problem domain.
I could try to write more on this, but I am curious what you think about this so far and if I come across as reasonably clear.
Thanks a lot for the encouragement :)
Yes, I am trying to understand a generalized (which also means simplified) and formalizable parallel to human cognition. Some of my thinking on this is inspired by predictive coding and adaptive resonance theory (although prettly loosely by the latter), and I am trying to figure out the implications of our most updated understanding of neurobiological principles, together with a notion of the "riverbeds of cognition".
In other words, how can we design an architecture such that it is not pressured to take shortcuts or "work around" design decisions we made, as its cognition develops? Is there a "natural path" of cognitive development that avoids some of the common pitfalls and failure modes (i.e. can we aim inner alignment if we have proficiency in this area)?
This has a direct bearing on interpretability, and goes together with the goal of a sort of "conceptual curriculum" that is intended to teach the system natural abstractions.
If I remember correctly, the centrality of "constraint satisfaction" fell out of considering causal (hyper/meta)graphs as sensible representational substrate (which was partially inspired by Ben Goertzel). I personally find it quite intuitive to think in graphs.
Hm, I'll give some thought to how to integrate different types of data with this picture, but I think that the "useful" classification of data ultimately depends on whether the agent possesses the right "key" to interpret it, and by extension, how difficult that "key" is to produce from concepts that the agent is already proficient with.
At the end of the day, the agent can only "understand" any data in terms of internalized concepts, so there will often be some uncertainty whether the difficulty is in translating sensible data into that internal representation or the difficulty is in the data being about phenomena somewhat (or far) outside of conceptual bounds. Having a "key" with respect to some data means that this data can be reliably translated.
Referential containment is about how to structure internal representations of concepts such as to make them most useful (which could mean flexible, or it could mean efficient, etc, depending on the problem domain and constraints).
If humanity forgot all of its medical knowledge tomorrow, would we discover the same categories and sub-categories of medicine, structuring our knowledge and drawing the distinctions into the different areas of expertise similarly?
We could gather a bunch of observational data about "the art of preventing people from dying" and find clusters in interventions or treatment strategies that have appropriate conceptual size to teach to different groups of people. Note that this changes depending on how many people we can allocate in total, how much we believe reliably fits into a single human mind, and how many common features there are in curative or preventative measures (commonalities here roughly referring to useful information that is referentially contained with respect to medicine, but can not be further contained in a single sub-field or small cluster thereof).
Let me know if that makes sense.
Yeah, I wish we had some cleaner terminology for that.
Finetuning the "simulation engine" towards a particular task at hand (i.e. to find the best trade-off between breadth and depth search in strategy games, or even know how much "thinking time" or "error allowance" to allocate to a move), given limited cognitive resources, is something that I would associate with level 3 capability.
It certainly seems like learning could go into the direction of making the model of the game more useful by either improving the extent to which this model predicts/ouputs good moves or by improving the allocation of cognitive resources to the sub-tasks involved. Presumably, an intelligent system should be capable of testing which improvement vectors seem most fruitful (and the frequency with which to update this analysis), but I find myself a bit confused about whether that should count as level 3 or as level 4, since the system is reasoning about allocating resources across relevant learning processes.
Thanks!
In your example, I think it is possible that the hunter-gatherer solves the problem through pure level 2 capability, even if they never encountered this specific problem before. Using causal models compositionally to represent the current scene, and computing it to output a novel solution, does not actually require that the human updates their causal models about the world.
I am trying to distinguish agents with this sort of compositional world model from ones that just have a bunch of cashed thoughts or habits (which would correspond to level 1), and I think this is perhaps a common case where people would attribute intelligence to a system that imo does not demonstrate level 3 capability.
Of course, this would require that the human in our example already has some sufficiently decontextualised notion of knocking loose objects down, or that generally their concepts are suited to this sort of compositional reasoning. It might be worth elaborating on level 2 to introduce some measure modeling flexibility/compositionality.
I feel like this could be explained better, so I am curious if you think I am being clear.
You are probably right that I should avoid the term intelligence for the time being, but I haven't quite found an alternative term that resonates. Anyways, thanks for engaging!
Edit: I'll soon make some changes to the post to better account for this feature of level 2 algorithms to potentially solve novel problems even if no new learning occurs. It's an important aspect of why I am saying that level 3 capabilities are only indirectly related to competence.
the maximum plan length is only steps
You mean the maximum length for an efficient/minimal plan, right? Maybe good to clarify (even if obvious in this case). Just a thought.
I believe that it is very sensible to bring this sort of structure into our approach to AGI safety research, but at the same time it seems very clear that we should update that structure to the best of our ability as we make progress in understanding the challenges and potentials of different approaches.
It is a feedback loop where we make each step according to our best theory of where to make it, and use the understanding gleaned from that step to update the theory (when necessary), which could well mean that we retrace some steps and recalibrate (this can be the case within and across questions). I think this connects to what both Charlie and Tekhne have said, though I believe Tekhne could have been more charitable.
In this light, it makes sense to emphasize the openness of the theory to being updated in this way, which also qualifies the ways in which the theory is allowed to be yet incomplete. Putting more effort into clarifying how this update process should look like seems like a promising addition to the framework that you propose.
On a more specific note I felt that Q5 could just be in position 2 and maybe a sixth question would be "What is the predicted timeline for stable safety/control implementations?" or something of the sort.
I also think that phrasing our research in terms of "avoiding bad outcomes" and "controlling the AGI" biases the way in which we pay attention to these problems. I am sure that you will also touch on this in the more detailed presentation of these questions, but at the resolution presented here, I would prefer the phrasing to be more open.
"Aiming at good outcomes while/and avoiding bad outcomes" captures more conceptual territory, while still allowing for the investigation to turn out that avoiding bad outcomes is more difficult and should be prioritised. This extends to the meta-question of whether existential risk can be best adressed by focusing on avoiding bad outcomes, rather than developing a strategy to get to good outcomes (which are often characterised by a better abilitiy to deal with future risks) and avoid bad outcomes on the way there. It might rightfully appear that this is a more ambitious aim, but it is the less predisposed outlook! Many strategy games are based on the idea that you have to accumulate resources and avoid losses while at the same time improving your ability to accumulate resources and avoid losses in the future. Only focusing on the first aspect is a specific strategy in the space of possible ones, and often employed when one is close to losing. This isn't a perfect analogy in a number of ways, but serves to point out the more general outlook.
Similarly, we expect a superintelligent AGI to be out of our ability to control at some point, which invokes notions of "self-control" on part of the AGI or "justified trust" on our part - therefore, perhaps "influencing the development of the AGI" would be better, as, again, "influence" can cover more conceptual ground but can still be hardened into the more specific notion of "control" when appropriate.
Yeah, I am not super familiar with PCA, but my understanding is that while both PCA and referential containment can be used to extract lower-dimensional or more compact representations, they operate on different types of data structures (feature vectors vs. graphs/hypergraphs) and have different objectives (capturing maximum variance vs. identifying self-contained conceptual chunks). Referential containment is more focused on finding semantically meaningful and contextually relevant substructures within a causal or relational knowledge representation. It also tries to address the opposite direction, basically how to break existing concepts apart when zooming into the representations, and I am not sure if something like that is done with PCA.
I had Claude 3 read this post and compare the two. Here it is, if you are interested (keep in mind that Claude tends to be very friendly and encouraging, so it might be valuing referential containment too highly):
Similarities:
Differences: