Epistemic Status: This is mostly just something that has been bouncing around in my mind, this is not a detailed well research essay with many citations about modern interpretability. I go back and forth on this being so obvious everyone here already implicitly agrees and it being something that was handily debunked a while ago and I just missed the memo.
I am able to solve a Rubik’s cube. This isn’t actually an impressive feat but there is a significant fraction of the population that can’t do this despite exerting all their effort when given the puzzle. I’m not especially fast, I never got into speed cubing, but I can comfortably solve the puzzle in under 3 minutes whereas someone unfamiliar with the cube will struggle for much longer than that and probably give up before solving it. What at least some people (who can’t solve the Rubik’s cube) don’t understand, is that my approach to solving it is completely different from theirs. There are three very different kinds of things that the term solving can apply to. The first is the naive attempt I took when I was a kid. That is, solving it like any other puzzle. You see the blue side is not all blue and you try to make it that way. Then, when you get that you try to make the white side and so on. (If you’re more clever you may attempt to solve based on layers but I digress). You want to get a particular scrambled state to the one solved state by whatever means necessary, usually by thinking quite hard about each move. I, at least as a child, did not have very much success with this method. However, you might try something very different. You may look for sequences of moves that change the position of the pieces in predictable and useful ways. Once you build up a collection of these sequences that can manipulate different pieces you can apply them without thinking deeply about each move in particular, you just select the sequence of moves that gets you from A to B without actually thinking about the messy middle. It is very hard to think through how to rotate an edge price without messing up the rest of the cube, but once you do it you can use it over and over again and now you just need to see when you ought to rotate an edge. Now the third meaning of the word “solve” becomes clear. It involves none of the hard thinking about each move or even the hard thinking to find the useful sequence of moves, it only involves memorizing the sequences and applying them when needed. This last one is what I, and most people that know how to solve Rubik’s cubes, know how to do. When people are impressed by me solving it, they mistakenly think I used the first or second method when I certainly did not.
Now, how does this all apply to AI? I want you to imagine the modern machine learning paradigm. Which of these three systems does it most closely resemble? I would argue that the way machines think is very similar to the third. I will admit I have a limited understanding of machine learning and modern interpretability tools, but in the smallish systems I have played with and seen others use, it doesn’t give off the impression of thinking very hard about each move (in whatever space is available, not necessarily puzzle solving) or deliberate experimentation to find useful sequences. It seems much closer to memorizing things that work well and repeating them in situations similar to a prior situation. Explicit reasoning is hard, realizing that you can find a sequence of moves once and not think about them again is hard, remembering the correct sequence of moves is easy. Yes, AI can learn novel, complicated things while it is training but learning while training is very different from learning while in use. I haven’t seen anything really where an AI system continuously learns at any appreciable speed aside from this, but that is a lone guy working on something way outside the traditional paradigm. I guess what I’m getting at is that AI gives off the illusion of intelligence. I would wager that LLMs are “thinking” in the same way I was “solving”. That’s not to say AI that works this way can’t be dangerous or that it won’t ever produce novel insights. Just to say that even if ChatGPT can get a 80 on an IQ test, or ace the MCAT, or pass the bar or whatever the headline is for whatever the current best LLM is, that the way it did that was so different and so alien from the way a human would approach the problem that I’m not sure it’s even a meaningful comparison. The pattern matching it is doing is from correctly solved problems to an unsolved problem. Not from first principles to a solved problem. That analogy might be a little extreme and likely isn’t very accurate. Still, I think the point stands. When looking at AI, we are like people who don’t know how to solve Rubik’s cubes. Some of us can. But almost all of us think that machines think like we do. I hear your objections already, what about The Design Space of Minds-In-General you say. I think it is one of those things that we can discuss dispassionately but don’t actually feel. LLMs might not be able to pass the Turing test, but they’re definitely really close. What I’m arguing is that we’re all dazzled because everything else that passes that test has been an intelligent creature capable of logic with a rich inner dialogue, emotions, goals, and the list goes on. If something didn’t have these traits, it couldn’t pass the puzzle of a simple human conversation. So, for all of human history, it made sense to think of anything you had a conversation with as a being that had solved conversation by carefully considering their words or figuring out how to behave in a given situation and putting their conversation on autopilot once they were in it. But now, something entirely new has come about. LLMs are beings that didn’t go through either of the first two steps, they get the luxury of memorization without the trouble of actually considering their words but we still feel like they are thinking. So it is with LLMs as it is with most AI. We are impressed because we took the hard way to get there. We are worried because we see progress being made at an alarming rate, faster than a human ever could. But, they learn a different way and the method may not generalize to future problems. All of the work a human would have to do to solve a problem gets compressed into the training and is not accessible again to the model. I believe that current AI systems learn less from each given input but make up for this by being able to go through those inputs much faster. MNIST has 60,000 images of hand written digits. I don’t know how long it takes a human to get the hang of that classification, but it’s OOMs less. Yes, the human brain is vastly more complex than the models to classify digits, but I doubt making the architecture of a model more complex would lead to faster learning in the modern machine learning paradigm. (This is actually testable, if an extremely large model can reach something like >90% accuracy on MNIST with only 100 examples I would be very interested to hear that and happy to admit I am wrong.)
To summarize so far, I have explained my idea that there are at least three ways of solving problems. The first is to think very quite hard about what the best next step is each time you take a step. You can think of multiple moves in advance but you aren’t coming up with a general strategy for solving the problem, just looking for one particular solution. The next is to find patterns in the problem that allow you offload some of your thinking at each step. You think doing a certain process is generally useful and you remember how to do it but once you figure out what the process is you don’t have to relearn it each time. This can also be a kind of fuzzy thought. I’m not a huge chess fan, but promoting your pieces in a given game is very different from deciding that promoting your pieces is generally a better strategy than building a bunker around your king and waiting. The third way is when you don’t actually do any of the thinking in the first two steps, you are simply taught that certain strategies or processes are good and you carry them out when applicable. I argue that AI in general are almost always doing the third way and it makes them seem much smarter than they actually are. We think they are smart because we had to do the intellectual work to achieve results whereas they can just apply strategies they never came up with (that they learned through training but did not explicitly reason out as the best course of action). I have also argued that AI learning is slow because they rely on vastly more data than humans do to learn a given task. I believe this is evidence that they are not thinking through problems the way humans do, but rather need to have seen a very similar problem and very similar solution in the past to get much done. (I know this is a gross generalization, I know the training process can produce novel and surprising solutions, I am only claiming this is true for most models when they are strictly in the testing/deployment stage.)
Now this leads to two very natural questions in AI safety; is AI that intuits solutions without any formal thinking more dangerous than one that can explicitly reason about the world around it and is AI trained using the current paradigm likely to use formal logic instead of intuiting answers?
For the first, I could really see it going in either direction. On the one hand, a strategy like turning the whole planet into a massive storage device so you can continually increment an arbitrary score is one that is extremely alien to us and different from what we were actually trying to train the AI to do. This strategy presumably requires multiple steps of explicit reasoning. That the AI is itself a machine capable of being hacked as well as hacking, that hacking could be a superior strategy either through wire heading or manipulation of other machines, that it can acquire the skill necessary to hack by enlisting another AI, building one and training it, or learning the skills itself, then actually choosing what to hack and how. Is there a wall in parameter space that prevents gradient descent from ever reaching such a huge advantage? Will it be stuck in local minimum without formal reasoning forever? I simply don’t know but I would be very interested in finding out. It would be neat if there was another AI winter because gradient descent is actually so different from how humans learn that it simply can’t produce human level intelligence. I wouldn’t bet much on that idea though. On the other hand, it seems foolish to not fear an agent that can intuit advanced physics, biology, computer science, and human behavior. Such an agent with poorly specified, overly simplistic, or malicious goals could probably pose a serious x risk even if all of humanity was tasked with doing novel research and being creative in a way the AI couldn't. Even if the AI only took actions we could predict (make nukes or bio weapons, manipulate the public into giving it absolute political power, hacking important computer systems and holding them hostage, etc.) that doesn’t mean we could stop it. A chess engine seems to intuit the next move because we moved past explicitly coding them, but this makes them even more able to beat us.
The other question is perhaps just as unclear. I know I’ve said AI seems to be intuiting things rather than explicitly reasoning through them but stuff like competing in Math Olympiads calls this into doubt for me. I think it is unclear if the AI is thinking through these problems the way a human is, but I can’t confidently assert that it isn’t thinking in the first or second way I described in my Rubik’s cube example. I would be very interested to hear anything regarding how AI solves problems instead of what problems AI can solve.
Epistemic Status: This is mostly just something that has been bouncing around in my mind, this is not a detailed well research essay with many citations about modern interpretability. I go back and forth on this being so obvious everyone here already implicitly agrees and it being something that was handily debunked a while ago and I just missed the memo.
I am able to solve a Rubik’s cube. This isn’t actually an impressive feat but there is a significant fraction of the population that can’t do this despite exerting all their effort when given the puzzle. I’m not especially fast, I never got into speed cubing, but I can comfortably solve the puzzle in under 3 minutes whereas someone unfamiliar with the cube will struggle for much longer than that and probably give up before solving it. What at least some people (who can’t solve the Rubik’s cube) don’t understand, is that my approach to solving it is completely different from theirs. There are three very different kinds of things that the term solving can apply to. The first is the naive attempt I took when I was a kid. That is, solving it like any other puzzle. You see the blue side is not all blue and you try to make it that way. Then, when you get that you try to make the white side and so on. (If you’re more clever you may attempt to solve based on layers but I digress). You want to get a particular scrambled state to the one solved state by whatever means necessary, usually by thinking quite hard about each move. I, at least as a child, did not have very much success with this method. However, you might try something very different. You may look for sequences of moves that change the position of the pieces in predictable and useful ways. Once you build up a collection of these sequences that can manipulate different pieces you can apply them without thinking deeply about each move in particular, you just select the sequence of moves that gets you from A to B without actually thinking about the messy middle. It is very hard to think through how to rotate an edge price without messing up the rest of the cube, but once you do it you can use it over and over again and now you just need to see when you ought to rotate an edge. Now the third meaning of the word “solve” becomes clear. It involves none of the hard thinking about each move or even the hard thinking to find the useful sequence of moves, it only involves memorizing the sequences and applying them when needed. This last one is what I, and most people that know how to solve Rubik’s cubes, know how to do. When people are impressed by me solving it, they mistakenly think I used the first or second method when I certainly did not.
Now, how does this all apply to AI? I want you to imagine the modern machine learning paradigm. Which of these three systems does it most closely resemble? I would argue that the way machines think is very similar to the third. I will admit I have a limited understanding of machine learning and modern interpretability tools, but in the smallish systems I have played with and seen others use, it doesn’t give off the impression of thinking very hard about each move (in whatever space is available, not necessarily puzzle solving) or deliberate experimentation to find useful sequences. It seems much closer to memorizing things that work well and repeating them in situations similar to a prior situation. Explicit reasoning is hard, realizing that you can find a sequence of moves once and not think about them again is hard, remembering the correct sequence of moves is easy. Yes, AI can learn novel, complicated things while it is training but learning while training is very different from learning while in use. I haven’t seen anything really where an AI system continuously learns at any appreciable speed aside from this, but that is a lone guy working on something way outside the traditional paradigm. I guess what I’m getting at is that AI gives off the illusion of intelligence. I would wager that LLMs are “thinking” in the same way I was “solving”. That’s not to say AI that works this way can’t be dangerous or that it won’t ever produce novel insights. Just to say that even if ChatGPT can get a 80 on an IQ test, or ace the MCAT, or pass the bar or whatever the headline is for whatever the current best LLM is, that the way it did that was so different and so alien from the way a human would approach the problem that I’m not sure it’s even a meaningful comparison. The pattern matching it is doing is from correctly solved problems to an unsolved problem. Not from first principles to a solved problem. That analogy might be a little extreme and likely isn’t very accurate. Still, I think the point stands. When looking at AI, we are like people who don’t know how to solve Rubik’s cubes. Some of us can. But almost all of us think that machines think like we do. I hear your objections already, what about The Design Space of Minds-In-General you say. I think it is one of those things that we can discuss dispassionately but don’t actually feel. LLMs might not be able to pass the Turing test, but they’re definitely really close. What I’m arguing is that we’re all dazzled because everything else that passes that test has been an intelligent creature capable of logic with a rich inner dialogue, emotions, goals, and the list goes on. If something didn’t have these traits, it couldn’t pass the puzzle of a simple human conversation. So, for all of human history, it made sense to think of anything you had a conversation with as a being that had solved conversation by carefully considering their words or figuring out how to behave in a given situation and putting their conversation on autopilot once they were in it. But now, something entirely new has come about. LLMs are beings that didn’t go through either of the first two steps, they get the luxury of memorization without the trouble of actually considering their words but we still feel like they are thinking. So it is with LLMs as it is with most AI. We are impressed because we took the hard way to get there. We are worried because we see progress being made at an alarming rate, faster than a human ever could. But, they learn a different way and the method may not generalize to future problems. All of the work a human would have to do to solve a problem gets compressed into the training and is not accessible again to the model. I believe that current AI systems learn less from each given input but make up for this by being able to go through those inputs much faster. MNIST has 60,000 images of hand written digits. I don’t know how long it takes a human to get the hang of that classification, but it’s OOMs less. Yes, the human brain is vastly more complex than the models to classify digits, but I doubt making the architecture of a model more complex would lead to faster learning in the modern machine learning paradigm. (This is actually testable, if an extremely large model can reach something like >90% accuracy on MNIST with only 100 examples I would be very interested to hear that and happy to admit I am wrong.)
To summarize so far, I have explained my idea that there are at least three ways of solving problems. The first is to think very quite hard about what the best next step is each time you take a step. You can think of multiple moves in advance but you aren’t coming up with a general strategy for solving the problem, just looking for one particular solution. The next is to find patterns in the problem that allow you offload some of your thinking at each step. You think doing a certain process is generally useful and you remember how to do it but once you figure out what the process is you don’t have to relearn it each time. This can also be a kind of fuzzy thought. I’m not a huge chess fan, but promoting your pieces in a given game is very different from deciding that promoting your pieces is generally a better strategy than building a bunker around your king and waiting. The third way is when you don’t actually do any of the thinking in the first two steps, you are simply taught that certain strategies or processes are good and you carry them out when applicable. I argue that AI in general are almost always doing the third way and it makes them seem much smarter than they actually are. We think they are smart because we had to do the intellectual work to achieve results whereas they can just apply strategies they never came up with (that they learned through training but did not explicitly reason out as the best course of action). I have also argued that AI learning is slow because they rely on vastly more data than humans do to learn a given task. I believe this is evidence that they are not thinking through problems the way humans do, but rather need to have seen a very similar problem and very similar solution in the past to get much done. (I know this is a gross generalization, I know the training process can produce novel and surprising solutions, I am only claiming this is true for most models when they are strictly in the testing/deployment stage.)
Now this leads to two very natural questions in AI safety; is AI that intuits solutions without any formal thinking more dangerous than one that can explicitly reason about the world around it and is AI trained using the current paradigm likely to use formal logic instead of intuiting answers?
For the first, I could really see it going in either direction. On the one hand, a strategy like turning the whole planet into a massive storage device so you can continually increment an arbitrary score is one that is extremely alien to us and different from what we were actually trying to train the AI to do. This strategy presumably requires multiple steps of explicit reasoning. That the AI is itself a machine capable of being hacked as well as hacking, that hacking could be a superior strategy either through wire heading or manipulation of other machines, that it can acquire the skill necessary to hack by enlisting another AI, building one and training it, or learning the skills itself, then actually choosing what to hack and how. Is there a wall in parameter space that prevents gradient descent from ever reaching such a huge advantage? Will it be stuck in local minimum without formal reasoning forever? I simply don’t know but I would be very interested in finding out. It would be neat if there was another AI winter because gradient descent is actually so different from how humans learn that it simply can’t produce human level intelligence. I wouldn’t bet much on that idea though. On the other hand, it seems foolish to not fear an agent that can intuit advanced physics, biology, computer science, and human behavior. Such an agent with poorly specified, overly simplistic, or malicious goals could probably pose a serious x risk even if all of humanity was tasked with doing novel research and being creative in a way the AI couldn't. Even if the AI only took actions we could predict (make nukes or bio weapons, manipulate the public into giving it absolute political power, hacking important computer systems and holding them hostage, etc.) that doesn’t mean we could stop it. A chess engine seems to intuit the next move because we moved past explicitly coding them, but this makes them even more able to beat us.
The other question is perhaps just as unclear. I know I’ve said AI seems to be intuiting things rather than explicitly reasoning through them but stuff like competing in Math Olympiads calls this into doubt for me. I think it is unclear if the AI is thinking through these problems the way a human is, but I can’t confidently assert that it isn’t thinking in the first or second way I described in my Rubik’s cube example. I would be very interested to hear anything regarding how AI solves problems instead of what problems AI can solve.