Meta commentary: this post is a great example of how to do the very earliest stages of conceptual research. Well done.
Nice post. I agree that a crucial part of AGI alignment should involve routing an AI's knowledge of human values to its own internal motivational circuitry, such that as its knowledge of human needs/goals/drives/preferences grows, so too does its alignment to those things. One key to this part of the problem may be to build in structural and inductive biases that steer the AI toward less inscrutable models.
I would say that to "know" something necessitates being able to make accurate predictions related to that thing. For most learning systems, this would imply developing some sort of generative or predictive model of its training data. In your dog/fish example, this might be realized with something like a conditional GAN, maybe combined with an autoencoder, where "knowing" the class of a sample allows the model to predict features of the sample (e.g., "fish" class -> there will be fins about here and scales about here; "dog" class -> there will be three eyes on the face, furry texture on the body, etc.). Combining the class label with some sort of latent-space representation should enable it to closely reproduce the full image.
The "knowledge" here is contained less in the class labels and latent space representations and more in the parameters and structure of the generative model, which is where it actually learned the generative/causal structure of its training data. This kind of knowledge allows such models to do things like inpainting, denoising, super-resolution, and animation of an image, generating information that was not in its inputs but that it predicts "ought" to be there based on what it has learned before.
This idea is also related to the predictive coding theory of the brain, where perception happens by constantly trying to generate predictions of what the senses will receive and continuously updating based on prediction errors. Again, "knowledge" exists in the generative models and causal graphs that the brain uses to make these predictions.
Thanks! I get your arguments about "knowledge" being restricted to predictive domains, but I think it's (mostly) just a semantic issue. I also don't think the specifics of the word "knowledge" are particularly important to my points which is what I attempted to clarify at the start, but I've clearly typical-minded and assumed that of course everyone would agree with me about a dog/fish classifier having "knowledge", when it's more of an edge-case than I thought! Perhaps a better version of this post would have either tabooed "knowledge" altogether or picked a more obviously-knowledge-having model.
Well, it certainly has mutual information with the training data, even if it only acts as a classifier (actually, classifiers can be seen as inverse generative models, so there is some generative-ish information there, as well). From that perspective, your arguments certainly hold. Although, I'm not sure if "mutual information" is precisely what you're going for, either. Yes, I agree, I should have tabooed "knowledge" in how I read it.
Epistemic status: my own thoughts I've thought up in my own time. They may be quite or very wrong! I am likely not the first person to come to these ideas. All of my main points here are just hypotheses which I've come to by the reasoning stated below. Most of it is informal mathematical arguments about likely phenomena and none is rigorous proof. I might investigate them if I had the time/money/programming skills. Lots of my hypotheses are really long and difficult-to-parse sentences.
I think this question is bad.
It's too great of a challenge. It asks us (implicitly) for a mathematically rigorous definition which fits all of our human feelings about a very loaded word. This is often a doomed endeavour from the start, as human intuitions don't neatly map onto logic. Also, humans might disagree on what things count as or do not count as knowledge. So let's attempt to right this wrong question:
I think this is much better.
We limit ourselves to systems which can definitely be said to "know" something. This allows us to pick a starting point. This might be a human, GPT-3, or a neural network which can tell apart dogs and fish. In fact this will be my go-to answer for the future. We also don't need to perfectly specify the process which generates knowledge all at once, only comment on its likely properties.
Properties of "Learning"
Say we have a very general system, with parameters θ, with t representing time during learning. Let's say they're initialized as θ0 according to some random distribution. Now it interacts with the dataset which we will represent with X, taken from some distribution over possible datasets. The learning process will update θ0, so we can represent the parameters the parameters after some amount of time as θ(θ0; X; t). This reminds us that the set of parameters depends on three things: the initial parameters, the dataset, and the amount of training.
Consider θ(θ0; X; 0). This is trivially equal to θ0, and so it depends only on the choice of θ0. The dataset has had no chance to affect the parameters in any way.
So what about as t→∞? We would expect that θ∞(θ0; X)=θ(θ0; X; ∞) depends mostly on the choice of X and much less strongly on θ0. There will presumably be some dependency on initial conditions, especially for very complex models like a big neural network with many local minima. But mostly it's ω which influences θ.
So far this is just writing out basic sequences stuff. To make a map of the city you have to look at it, and to learn your model has to causally entangle itself with the dataset. But let's think about what happens when ω is slightly different.
Changes in the world
So far we've represented the whole dataset with a single letter X, as if it were just a number or something. But in reality it will have many, many independent parts. Most datasets which are used as inputs to learning processes are also highly structured.
Consider the dog-fish discriminator, trained on the dataset Xdog/fish. The system θ∞(θ0; Xdog/fish) could be said to have "knowledge" that "dogs have two eyes". One thing this means if we instead fed it an X which was identical except every dog had three eyes (TED) then the final values of θ would be different. The same is true of facts like "fish have scales", "dogs have one tail". We could express this as follows:
θ∞(θ0; Xdog/fish+ΔXTED)
Where ΔXTED is the modification of "photoshopping the dogs to have three eyes". We now have:
θ∞(θ0; Xdog/fish+ΔXTED)=θ∞(θ0; Xdog/fish)+Δθ∞(θ0; Xdog/fish; ΔXTED)
Now let's consider how Δθ∞(θ0; X; ΔX) behaves. For lots of choices of ΔX it might just be a series of random changes tuning the whole set of θ values. But from my knowledge of neural networks, it might not be. Lots of image recognizing networks have been found to contain neurons with specific functions which relate to structures in the data, from simple line detectors, all the way up to "cityscape" detectors.
For this reason I suggest the following hypothesis:
Impracticalities and Solutions
Now it would be lovely to train all of GPT-3 twice, once with the original dataset, and once in a world where dogs are blue. Then we could see the exact parameters that lead it to return sentences like "the dog had [chocolate rather than azure] fur". Unfortunately rewriting the whole training dataset around this is just not going to happen.
Finding the flow of information, and influence in a system is easy if you have a large distribution of different inputs and outputs (and a good idea of the direction of causality). If you have just a single example, you can't use any statistical tools at all.
So what else can we do? Well we don't just have access to θ∞. In principle we could look at the course of the entire training process and how θ changes over time. For each timestep, and each element of the dataset X, we could record how much each element of θ is changed. We'll come back to this
Let's consider the dataset as a function of the external world: X(Ω). All the language we've been using about knowledge has previously only applied to the dataset. Now we can describe how it applies to the world as a whole.
For some things the equivalence of knowledge of X and Ω is pretty obvious. If the dataset is being used for a self-driving car and it's just a bunch of pictures and videos then basically anything the resulting parameterised system knows about X it also knows about Ω. But for obscure manufactured datasets like [4000 pictures of dogs photoshopped to have three eyes] then it's really not clear.
Either way, we can think about Ω as having influence over X the same way as we can think about X as having influence over θ∞. So we might be able to form hypotheses about this whole process. Let's go back to Xdog/fish. First off imagine a change Ωnew=Ω+ΔΩ, such as "dogs have three eyes". This will change some elements of X more than others. Certain angles of dog photos, breeds of dogs, will be changed more. Photos of fish will stay the same!
Now we can imagine a function Δθ(θ0; X(Ω); ΔX(Ω; ΔΩ)). This represents some propagation of influence from Ω→X→θ. Note that the influence of Ω on X is independent of our training process or θ0. This makes sense because different bits of the training dataset contain information about different bits of the world. How different training methods extract this information might be less obvious.
The Training Process
During training, θ(t) is exposed to various elements of X and updated. Different elements of X will update θ(t) by different amounts. Since the learning process is about transferring influence over θ from θ0 to Ω (acting via X), we might expect that for a given element of X, it has more "influence" over the final values of the elements of θ which were changed the most due to exposure to that particular element of X during training.
This leads us to a second hypothesis:
Which is equivalent to:
For the dog-fish example: elements of parameter space which have updated disproportionately when exposed to photos of dogs that contain the dogs' heads (and therefore show just two eyes), will be more likely to contain "knowledge" of the fact that "dogs have two eyes".
This naturally leads us to a final hypothesis:
Therefore
Motivation
I think an AI which takes over the world will have a very accurate model of human morality, it just won't care about it. I think that one way of getting the AI to not kill us is to extract parts of the human utility-function-value-system-decision-making-process-thing from its model and tell the AI to do those. I think that to do this we need to understand more about where exactly the "knowledge" is in an inscrutable model. I also find thinking about this very interesting.