At some point you have to deal with the fact the fact that understanding the world entails knowing lots and lots of stuff—things like “tires are usually black”, or “it’s gauche to wear white after labor day”, etc.
There seem to be only two options:
There might be a middle way between these—I think the probabilistic programming people might describe their roadmap-to-AGI that way?—but I don’t understand those kinds of plans, or if I do, then I don’t believe them.
I think the second-setup still allows for powerful AGI that's more explainable than current AI, in the same way that humans can kind of explain decisions to each other, but not very well at the level of neuroscience.
If something like natural abstractions are real, then this would get easier. I have a hard time not believing a weak version of this (e.g. human and AGI neuron structures could be totally different, but they'd both end up with some basic things like "the concept of 1").
On https://consensusknowledge.com, I described the idea of building a knowledge database that is understandable for both people and computers, that is, for all intelligent agents. It would be a component responsible for memory and interactions with other agents. Using this component, agents could increase intelligence much faster, which could lead to the emergence of the collective human superintelligence, AGI, and generally the collective superintelligence of all intelligent agents. At the same time, due to the interpretability of the database of knowledge and information, such intelligence would be much safer. Thinking performed by AI would also be much more interpretable.
Please let me know what you think about this.
I haven't read it in detail.
The hard part of the problem is that we need to have a system that can build up a good world model on it's own. There is too much stuff, such that it takes way way too long for a human to enter everything. Also I think that we need to be able to process basically arbitrary input streams with our algorithm. E.g. build a model of the world just by seeing a camera feed and the input of a microphone.
And then we want to figure out how to constrain the world model, such that if we use some planning algorithm we also designed on this w...
I feel like the thing that I'm hinting is not directly related to QACI. I'm talking about a specific way to construct an AGI where we write down all of the algorithms explicitly, whereas the QACI part of QACI, is about specifying an objective that is aligned when optimized very hard. It seems like, in the thing that I'm describing, you would get the alignment properties from a different place. You get them because you understand the algorithm of intelligence that you have written down very well. Whereas in QHCI, you get the alignment properties by successf...
I am also interested in interpretable ML. I am developing artificial semiosis, a human-like AI training process which can achieve aligned (transparency-based, interpretability-based) cognition. You can find an example of the algorithms I am making here: the AI runs a non-deep-learning algorithm, does some reflection and forms a meaning for someone “saying” something, a meaning different from the usual meaning for humans, but perfectly interpretable.
I support then the case for differential technological development:
There are two counter-arguments to this that I'm aware of, that I don't think in themselves justify not working on this.
Regarding 1, it may take several years to have interpretable ML reach capabilities equivalent to LLMs, but the future may offer surprises either in terms of coordination to pause the development of "opaque" advanced AI or of deep learning hitting a wall... at killing everyone. Let's have a plan also for the case we are still alive.
Regarding 2, interpretable ML would need to have programmed control mechanisms to be aligned. There is currently no such a field of AI safety as we do not have yet interpretable ML, but I imagine computer engineers being able to make progress on these control mechanisms (being able to make more progress than on mechanistic interpretability of LLMs). While it is true that control mechanisms can be disabled, you can always advocate for the highest security (like in Ian Hogarth's Island idea). You can then also reject this counterargument.
mishka noted that this paradigm of AI is more foomable. Self-modification is a huge problem. I have an intuition interpretable ML will exhibit a form of scaffolding, in that control mechanisms for robustness (i.e. for achieving capabilities) can advantageously double as alignment mechanisms. Thanks to interpretable ML, engineers may be able to study self-modification already in systems with limited capabilities and learn the right constraints.
Umm, I don't know how deep you've gotten into even simple non-ai machine learning. But this is based on simply wrong assumptions. Even your simplification is misleading -
If you're having a list sorting algorithm like QuickSort, you can just look at the code and then get lots of intuitions about what kinds of properties the code has
I've talked with and interviewed a lot of software developers, and it's probably fewer than 5% that really understand QuickSort including the variance in performance on pathological lists. This is trivially simple compared to large models, but not actually easy or self-explaining.
I am pretty sure that there is a program that you can write down that has the same structural property of being interpretable in this way, where the algorithm also happens to define an AGI.
I am pretty sure that this is not possible.
I've talked with and interviewed a lot of software developers, and it's probably fewer than 5% that really understand QuickSort including the variance in performance on pathological lists. This is trivially simple compared to large models, but not actually easy or self-explaining.
Well, these programmers probably didn't try to understand Quicksort. I think you can see simple dynamics such as, "oh this will always return a list that is the same size as the list that I input" and "all the elements in that list will be elements from the original list in a bijective mapping. There won't be different elements and there won't be duplicated elements or something like that." That part is pretty easy to see. And now there are some pathological cases for quick search, though I don't understand the mechanics of why they arise. However, I'm pretty sure that I can, within one hour, understand very well what these pathological cases are and why they arise, and how I might change a quick search algorithm to handle a particular pathological case well. That is, I'm not saying I look at Wikipedia and just read up on the pathological cases, but I just look at the algorithm alone and then derive the pathological cases. Maybe an hour is not enough, I'm not sure. That seems like an interesting experiment to test my claim.
I am pretty sure that there is a program that you can write down that has the same structural property of being interpretable in this way, where the algorithm also happens to define an AGI.
I am pretty sure that this is not possible.
Could you explain why you think that this is not possible? Do you really think there isn't an explicit Python program that I can write down such that within the Python program, e.g. write down the step-by-step instructions that when you follow them, you will end up building an accurate model of the world. And such that the program does not use any layered optimization like SGD or something similar. Do you think these kinds of instructions don't exist? Well, if they don't exist, how does the neural network learn things like constructing a world model? How does the human brain do it?
Once you write down your algorithm explicitly like that, I just expect that it will have this structural property I'm talking about of being possible to analyze and get intuitions about the algorithm.
but I am pretty sure that there is a program that you can write down that has the same structural property of being interpretable in this way, where the algorithm also happens to define an AGI.
Interesting. I have semi-strong intuitions in the other direction. These intuitions are mainly from thinking about what I call the Q-gap, inspired by Q Home's post and this quote:
…for simple mechanisms, it is often easier to describe how they work than what they do, while for more complicated mechanisms, it is usually the other way around.
Intelligent processes are anabranching rivers of causality: it starts and ends at highly concentrated points, but the route between is incredibly hard to map. If you find an intelligent process in the wild, and you have yet to statistically ascertain which concentrated points its many actions converge on (aka its intentionality), then this anabranch will appear as a river delta to you.
Whereas simple processes that have no intentionality just are river deltas. E.g., you may know everything about the simple fundamental laws of the universe, yet be unable to compute whether it will rain tomorrow.
That is an interesting analogy.
So if I have a simple AGI algorithm, then if I can predict where it will move to, and understand the final state it will move to, I am probably good, as long as I can be sure of some high-level properties of the plan. I.e. the plan should not take over the world let's say. That seems to be property you might be able to predict of a plan, because it would make the plan so much longer, than just doing the obvious thing. This isn't easy of cause, but I don't think having a system that is more complex would help with this. Having a system that is simple makes it simpler to analyze the system in all regards, all else equal (assuming you don't make it short by writing a code golf program, you still want to follow good design practices, and lay out the program in the obvious most understandable way).
As a story sidenote before I get into why I think tho Q-gap probably is wrong: That I can't predict that it will rain tomorrow if I have the perfect model of low-level dynamics in the universe, has more to do with how much compute I have available. I might be able to predict if it would rain tomorrow would I know the initial conditions of the universe and some very large but finite amount of compute, if the universe is not infinite?
I am not sure the Q-gap makes sense. I can have a 2D double pendulum. This is very easy to describe and hard to predict. I can make a chaotic system more complex, and then it becomes a bit harder to predict but not really by much. It's not analytically solvable for 2 joints already (according to Google).
That describing the functioning of complex mechanisms seems harder than saying what they do, might be an illusion. We as humans have a lot of abstractions in our heads to think about the real world. A lot of the things that we build mechanisms to do are expressible in these concepts. So they seem simple to us. This is true for most mechanisms we build that produce some observable output.
If we ask "What does this game program running of a computer do?" We can say something like "It creates the world that I see on the screen." That is a simple explanation in terms of observed effects. We care about things in the world, and for those things we normally have concepts, and then machines that manipulate the world in ways we want have interpretable output.
There is also the factor that we need complex programs for things where we have not figured out a good general solution, which would then be simple. If we have a complex program in the world, it might be complex because the creators have not figured out how to do it the right way.
So I guess I am saying that there are two properties of a program. Caoticness, and Kolmogorov complexity. Increasing one always makes the program less interpretable, if the other stays fixed, if we assume that we are only considering optimal algorithms, and not a bunch of halfhazard heuristics we use because we have not figured out the best algorithm yet.
I am writing it as a comment, not as an answer (the answers, I suspect, are more social; people are not doing this yet, because the methods which would work capability-wise are mostly still in their blind spots).
two counter-arguments to this
Technically, it has been too difficult to do it this way. But it is becoming less and less difficult, and various versions of this route are becoming more and more feasible.
Although, the ability to predict behavior is still fundamentally limited, because the systems like that become complex really easy (one can have very complex behavior with really small number of parameters), and because they will interact with complex world around them (so one really needs to reason about the world containing software systems like this; even if software systems themselves are transparent and interpretable, if they are smart, the overall dynamics might be highly non-trivial).
This kind of paradigm (if it works) makes it much easier to modify these systems, so it is much easier to have self-modifying AIs, or, more likely, self-modifying ecosystems of AIs producing changing populations of AI systems.
Capability-wise, this is likely to give such systems a boost competing with current systems where self-modification is less fluent and so far rather sluggish.
But this is even more foomable than the status quo. So one really needs to solve AI existential safety for a self-evolving, better and better self-modifying ecosystem of AIs, this is even more urgent with this approach than with the current mainstream.
Might this problem be easier to solve here? Perhaps... At least, with (self-)modification being this fluent and powerful, one can direct it this way and that way more easily than with more sluggish and resistant methods. But, on the other hand, it is very easy to end up with a situation where things are changing even faster and are even more difficult to understand...
I do like looking at this topic, but the safety-related issues in this approach are, if anything, even more acute (faster timelines + very fluently reconfigurable machines)...
I expect that it is much more likely that most people are looking at the current state of the art and don't even know or think about other possible systems and just narrowly focus on aligning the state of the art, not considering creating a "new paradigm", because they think that would just take too long.
I would be surprised if there were a lot of people who carefully thought about the topic and used the following reasoning procedure:
"Well, we could build AGI in an understandable way, where we just discover the algorithms of intelligence. But this would be bad because then we would understand intelligence very well, which means that the system is very capable. So because we understand it so well now, it makes it easier for us to figure out how to do lots of more capability stuff with the system, like making it recursively self-improving. Also, if the system is inherently more understandable, then it would also be easier for the AI to self-modify because understanding itself would be easier. So all of this seems bad, so instead we shouldn't try to understand our systems. Instead, we should use neural networks, which we don't understand at all, and use SGD in order to optimize the parameters of the neural network such that they correspond to the algorithms of intelligence, but are represented in such a format that we have no idea what's going on at all. That is much safer because now it will be harder to understand the algorithms of intelligence, making it harder to improve and use. Also if an AI would look at itself as a neural network, it would be at least a bit harder for it to figure out how to recursively self-improve."
Obviously, alignment is a really hard problem and it is actually very helpful to understand what is going on in your system at the algorithmic level in order to figure out what's wrong with that specific algorithm. How is it not aligned? And how would we need to change it in order to make it aligned? At least, that's what I expect. I think not using an approach where the system is interpretable hurts alignment more than capabilities. People have been steadily making progress at making our systems more capable and not understanding them at all, in terms of what algorithms they run inside, doesn't seem to be much of an issue there, however for alignment that's a huge issue.
I share your intuition. Turing already conjectured how much computing power an AGI needs and he said something little. I think the hardest part was getting to computers and AGI is just making a program that is a bit more dynamic.
I can recommend all of Marvin Minskies work. The Society Of Mind is very accessible and has an online version. In short, the mind is made of smaller sub-pieces. The important aspects are the orchestration and the architecture of these resources. And Minsky also has some stuff on how you put that into programs.
The most concrete stuff I know of:
EM-ONE: An Architecture for Reflective Commonsense Thinking Push Singh
is very concrete with code implementation. Implementing some of the layers of critics that Minsky described in "The Emotion Machine" that are a hypothesis of how common sense could be built.
Read by Aaron Sloman and Gerald Sussman isn't this super cool?
It is useful to first think of the concepts before programming something. We might be thinking of slightly different things with the word algorithm. It sounds very low level to me. While the important things are the architecture of a program, not the bricks it is made out of.
I think the problem with the things you mention is that they are just super vague, where you don't even know what is the thing that you are talking about. What does it mean that:
Most important of all, perhaps, is making such machines learn from their own experience.
Finally, we'll get machines that think about themselves and make up theories, good or bad, of how they, themselves might work.
Also, all of this seems to be some sort of vague stuff about imagining how AI systems could be. I'm actually interested in just making the AI systems and making them in a very specific way such that they have good alignment properties and not vaguely philosophizing about what could happen. The whole point of writing down algorithms explicitly, which is one non-dumb way to build AGI, is that you can just see what's going on in the algorithm and understand it and design the algorithm in such a way that it would think in a very particular way.
So it's not like, oh yes, these machines will think for themselves some stuff and it will be good or bad, it's more like, I make these machines think, how do I make them think, what's the actual algorithm to make them think, how can I make this algorithm such that it will actually be aligned. And I am controlling what they are thinking, I am controlling if it's good or bad, I am controlling if they are going to build a model of themselves, maybe that's dangerous for alignment purposes in some context and then I would want the algorithm to not want the system to build a model of themselves.
For, at that point, they'll probably object to being called machines. I think it's pretty accurate to say that I am a machine.
(Also, as a meta note, it would be very good, I think, if you do not break the lines as you did in this big text block because that's pretty annoying to block quote.)
I am somewhat baffled by the fact that I have never ran into somebody who is actively working on developing a paradigm of AGI which is targeted at creating a system that the system is just inherently transparent to the operators.
If you're having a list sorting algorithm like QuickSort, you can just look at the code and then get lots of intuitions about what kinds of properties the code has. An AGI would of course be much, much more complex than QuickSort, but I am pretty sure that there is a program that you can write down that has the same structural property of being interpretable in this way, where the algorithm also happens to define an AGI.
And this seems to be especially the case when you consider that when building the system we can build it in such a way that we have many components and these components have sub-components such that in the end, we have some pretty small set of instructions that does some specific task that is then understandable. And if you understand this component, you can probably understand how this set of instructions behaves in some larger module that uses this set of instructions.
Everything interpretability tries to do, we would just get for free in this kind of paradigm. Moreover, we could design the system in such a way that we have additional good properties. Instead of using SGD in order to find just some set of weights that performs well, that we then interpret, we could just constrain the kinds of algorithms we design in such a way that they are as interpretable as possible, such that we are subjected so strongly to the will of SGD and what algorithms it finds.
Maybe these people exist (if you are one please say hello), but I have talked to probably between 20 and 40 people who would describe themselves as doing AI alignment research and never came something like this up even remotely.
Basically, this is my current research agenda now. I'm not necessarily saying this is definitely the best thing that will save everyone and everybody should do this, but if zero people do this, it seems pretty strange to me. So I'm wondering if there are some standard arguments that I have not come across yet, while this kind of thing is actually really stupid to do.
There are two counter-arguments to this that I'm aware of, that I don't think in themselves justify not working on this.
Though I'm not even sure how much of a problem point 2 is, because that seems to be a problem in any paradigm, no matter what we do, we probably end up being able to build unaligned AGI before we know how to align it. But maybe it is especially pronounced in this kind of approach. Though consider how much effort we need to invest in order to bridge the gap from being able to build an unaligned AI to being able to build an aligned AI, in any paradigm. I think that time might be especially short in this paradigm.
I feel like what MIRI is doing doesn't quite count. At least from my limited understanding, they are trying to identify problems that are likely to come up in highly intelligent systems and solve these problems in advance, but not necessarily advancing <interpretable/alignable> capabilities in the way that I am imagining. Though I do, of course, have no idea about what they're doing in terms of the research that they do not make public.