This seems wrong to me in some important ways (at least as general theoretical research advice). Like, some of the advice you give seems to anti-predict important scientific advances.
Generally, unguided exploration is seldom that useful.
Following this advice, for instance, would suggest that Darwin not go on the Beagle, i.e., not spend five years exploring the globe (basically just for fun) as a naturalist. But his experiences on the Beagle were exactly what led him to the seeds of natural selection, as he began to notice subtleties like how animals changed ever so slightly as one moves up a continent. It also seems like it screens out a bunch of Faraday’s experimental work on electricity, much of which he did because it seemed interesting or fun, rather than backchaining from some predetermined goal. Like, he has an entire lecture series on candles, which was mostly just him over and over saying “And isn’t it weird that this thing happens, too?? What happens if we change this?” And they’re great, and a lot of that exploratory work laid the groundwork for Maxwell’s later work on electromagnetism.
Cutting off research avenues that are fun to think about, but ultimately not that productive.
Similarly, I think this is one of the main failure modes with modern scientific research. When I look at academia one of the things I’m most hoping for is that people follow their taste more, and that they have more fun! Because often things that are open-ended and fun to play around with hold a deeper kind of logic that you’re attracted to, but haven’t articulated yet. If you only stick to things that seem immediately productive then you (roughly) never find truly novel or cool ideas. E.g., both Babbage and Shannon tinkered around with different coding type projects when they were younger (cipher cracking and barbed wire telegraphs, respectively), and I think it’s not crazy to assume that this sort of playing around with representing information abstractly may have helped with their later, more ambitious projects (general computers, information theory). Also, many Nobel prize winners say they wouldn’t have been able to do their seminal in the current environment because, e.g., “Today I wouldn’t get an academic job. It’s as simple as that. I don’t think I would be regarded as productive enough.” (Higgs). Certainly, some things are dead ends and it can be a bit hard to know that in advance, but if you prematurely screen off all of them you screen off the great ideas, too.
I think Altman puts it nicely, here: “Good ideas—actually, no, great ideas are fragile. Great ideas are easy to kill…. All the best ideas when I first heard them sound bad. And all of us, myself included, are much more affected by what other people think of us and our ideas than we like to admit. If you are just four people in your own door, and you have an idea that sounds bad but is great, you can keep that self-delusion going. If you’re in a coworking space, people laugh at you, and no one wants to be the kid picked last at recess. So you change your idea to something that sounds plausible but is never going to matter. It’s true that coworking spaces do kill off the very worst ideas, but a band-pass filter for startups is a terrible thing because they kill off the best ideas, too.” (Emphasis mine). Likewise, I think it is perhaps quite load-bearing the way that many great scientists spent significant portions of their thinking years alone (famously, Newton did this when he came up with Principia, but Darwin and Shannon too, etc.)
On timescales of days and weeks, you should be able to point to concrete examples that constitute "units of progress" towards your final goal.
This also feels pretty wrong to me. Certainly that would be nice and perhaps something to try to aim for, but I don’t think it’s always the case and I don’t think the lack of it is that strong of evidence in favor of “not making progress.” Again, using Darwin as an example—after he noticed that species were mutable he spent about a year and a half trying to figure out why. He had one main insight a few months in—that breeders introduced changes via artificial selection—but he didn’t put it together for some time the way that nature could act as a selector. And in that year between “artificial” and “natural” selection, I would not say that he was making obvious, concrete progress on the solution because the solution wasn’t made from obvious steps. He had the right questions, and he read a lot, wrote a lot, talked to breeders, etc., but mostly he just held onto his confusion for a long time. And then one day in a flash of insight, shortly after reading Malthus, the solution came to him in a carriage ride. Certainly not all research looks like this, but I do think it’s an illustrative example of how good theoretical work can come out of non-obvious units of progress.
I know at the beginning you mentioned that this is advice for a particular kind of research from your perspective, and I do think that it’s useful in certain domains. But I worry it’s easy to forget, at the end of a document with many high-level tips, that it’s not general advice on how to do good theoretical alignment work, period. And because I do think that some of this advice anti-predicts great scientific work—in particular the sort that I think alignment is currently most lacking, and the sort that would be the most helpful, were we to have it—I wanted to push back a bit on the idea that many people might walk away with, i.e., that this is general advice for theoretical work in alignment.
A shrug of robed shoulders. "Where do new books come from, Mr. Potter? Those who read many books sometimes become able to write them in turn. How? No one knows."
Funny enough we do now have a pretty plausible model of how this works - in the form of GPT3 and similar LLMs (which are surprisingly similar to linguistic cortex for the same reason that large deep vision models are similar to visual cortex).
Train a big (ANN or BNN) on sensory stream prediction of text and it ... generates text! In the human case this is just our internal monologue, which we (or some of us, with additional training) can then additionally steer/branch/backtrack/record/edit into higher quality stories because we also have general planning capability.
I'd be interested to know how much you (or other readers) think this content carries over to other areas of research that aren't so specifically "the kind of theory ARC does". For example;
On timescales of days and weeks, you should be able to point to concrete examples/algorithms that constitute “units of progress” towards your final goal.
Is "days or weeks" the right scale here for, say, research in computational complexity? Or other alignment research?
This was really helpful and fun to read. I'm sure it was nontrivial to get to this level of articulation and clarity. Thanks for taking the time to package it for everyone else to benefit from.
"Terrance Tao" should be "Terence Tao"
"while the x OR y would be bad" should maybe be "while 'x AND y' would be bad"?
Most of the content in this document came out of extensive conversations with Paul Christiano.
(This document describes one way of thinking about how to do one particular type of research. There are other ways to productively do this kind of research, and other productive kinds of research. Consider this document peppered with phrases like “from my perspective”, “I think”, “my sense is”, etc.)
A lot of people have a vague mental picture of what empirical research looks like, which consists of exploring data, articulating hypotheses about the data, and running experiments that potentially falsify the hypothesis. I think people lack a similarly mechanistic picture of what theoretical research looks like, which results in them not knowing how to do theory, being skeptical of the possibility of theoretical progress, etc. I do think the difficulty of getting high-quality real-world feedback makes theoretical research more difficult than empirical research, but I think it’s possible to get enough real-world feedback when doing theory that you can still expect to make steady progress.
Currently, I think many people think of theory as “someone sits in a room and has a brilliant insight to solve the problem.” Instead, I think a more accurate picture is very similar to the picture one has for empirical research: the theorist explores some data, gradually builds an intuitive sense of what’s going on, articulates hypotheses that capture their intuition, and falsifies their hypotheses by testing them against data, all the while iteratively building up their understanding.
The key difference between the empirical researcher and the theoretical researcher is while the empirical researcher can build intuition and falsify hypothesis by considering real-world data, the theorist, although ultimately be grounded in the real world, must build intuition and falsify hypothesis by considering thought experiments, simple toy examples, computations, etc. Even mathematics, the purest of intellectual pursuits, roughly follows this process of iteration. Terence Tao:
How to do research
The Research methodology section of the ELK report and Paul’s My research methodology both articulate a high level picture of the basic theoretical research loop, but lack details about what steps one takes besides “propose solutions” and “generate counterexamples”
To lend more color to these vague descriptions and provide examples comprehensible to readers without an extensive background in number theory, I will describe what I think of as “modes” of research, articulate some key questions that get asked, and provide (stylized) historical examples related to ELK. I will approach this from the perspective of designing an algorithm, but this basic description will apply to many possible endeavors.
Suppose I’m trying to design an algorithm to accomplish task T (e.g. elicit latent knowledge) over a variety of situations S (e.g. ways my AI could look internally). My goal as a theorist is to develop a sufficiently accurate, unified, and precise intuition of how I hope to accomplish task T in every situation S such that I can just formalize the rules I’m using in my head into an algorithm and my problem has been solved. You might also say that the goal of someone trying to discover the laws of nature is to develop a precise enough model of nature in their head that they can just write it down and they have a law of nature.
This process of algorithm development roughly proceeds as follows:
I think of this process as roughly having 4 key mode:
I will describe these modes as happening in sequence, but in practice they’re all happening at the same time, just with different amounts of emphasis. It’s common to spend a day or two in a particular mode. I would begin worrying if I spent more than ~three days trying to do a particular step in isolation, without the feedback loop that comes from transitioning between steps.
Figuring out what you want to happen in real-world cases
Ultimately, your intuition for what you want your algorithm to do has to be anchored on what you actually wanted to happen in the real world. The way to develop this intuition is to roughly “solve” the task by hand in many examples, and let the examples wash over you and percolate into your intuition. The key move in this research mode is asking questions of the form “Suppose the world was in situation S, what would the accomplishing task T look like?”
Questions that I find helpful to ask myself:
Your goal in this mode is to develop an extremely precise sense of exactly what you do/don’t want to happen in a handful of cases to serve as the final arbiter for whether you’ve succeeded at developing an algorithm. It’s generally okay to be extremely unsure of what you want to happen in a large number of cases/not know exactly how to handle a lot of scenarios. However, it’s often worth spending some time trying to articulate high-level hopes for various cases that you’re confused about how to handle (e.g. Indirect normativity: defining a utility function)
ELK Examples
Historically, we started considered cases like:
A small breakthrough was when we articulated the Game of Life Example, which gave us an extremely precise sense of what we wanted to happen in at least one case.
Other cases that one might consider:
Potshot algorithms
Ideally, once you’ve developed a precise sense of what you want to happen for real-world cases, you can iterate on algorithms by checking to see if they do what you want. Unfortunately, often times real-world cases are too complicated, which means that:
It’s still often worth trying to directly solve real-world cases by proposing “potshot” algorithms. Here are two reasons:
Translating what you want in real-world cases into desiderata for simple cases
Once you’re satisfied that potshot approaches are unlikely to work, the hard work of iteration can begin. Since you can’t iterate against real world cases, you must develop a sense of what you want to happen in cases simple enough that you can work through by hand. The way to do this is to try to build simple toy models of real-world situations, and then transfer your hard-won intuition about what you want to happen in those cases onto the simple toy cases.
Questions that I find helpful to ask myself:
Your goal in this mode is to develop an extremely precise sense of what you want to happen for a handful of examples that are simple enough for you to write down formally, evaluate by hand, etc. Again, it’s generally okay to be extremely unsure of what you want to happen in all but a handful of cases.
ELK Example
I’m currently considering cases like:
It seems clear that 'd' would be a good direct translator in this case, while 'x AND y' would be bad. But is d AND ¬hx AND ¬hy also an acceptable direct translator?
This example can be extended in a few ways:
Articulating an algorithm for solving simple cases
Once you have a sufficiently precise sense of what you want to happen you want to articulate a general algorithm that doesn’t special-case the cases where you know what you want, but nevertheless has the desired behavior anyway. The way to do this is to consider the reasoning that let you decide what you wanted and try to develop underlying rules or natural generalizations. Often, the process of trying to articulate a general algorithm will point out ambiguities in your sense of what you want to happen, leading to substantial revision/sharpening of your intuition.
Questions that I find helpful to ask myself:
Often, when you try to articulate the general rules behind what you’re to “solve” simple cases, you’ll find that you were accidentally special casing one of those simple cases, and you can’t quite see the connection between what you did in case 1 and what you did in case 2. In these situations, you can:
Generally, I think that if you can intuitively “solve” every case, then your intuition must be reliably executing some algorithm that solves every case. Your job is to just sharpen your intuition until it has unified, and extract the algorithm it’s executing.
ELK Examples
Unfortunately, describing examples in detail would require too much context for me to write down :(.
Finding cases where your algorithm doesn’t do what you want
After you have an algorithm, you want to articulate a case where it doesn’t do what you want. This is generally much easier than other parts of the process, because you have a precise sense of what you want and a precise algorithm.
Questions that I find helpful to ask myself:
ELK Examples
Again, unfortunately all the detailed examples I can think of require too much context for me to write down. Eliciting Latent Knowledge has many “worst-case” counterexamples to “potshot” algorithms that might give a general feel, but you typically wanting to be working more precisely than that.
Other random tips