How do we know that "good research" is good? (aka "direct evaluation" vs "eigen-evaluation")

Ruby

AI Alignment is my motivating context but this could apply elsewhere too.

The nascent field of AI Alignment research is pretty happening these days. There are multiple orgs and dozens to low hundreds of full-time researchers pursuing approaches to ensure AI goes well for humanity. Many are heartened that there's at least some good research happening, at least in the opinion of some of the good researchers. This is reason for hope, I have heard.

But how do we know whether or not we have produced "good research?"

I think there are two main routes to determining that research is good, and yet only one applies in the research field of aligning superintelligent AIs.

"It's good because it works"

The first and better way to know that your research is good is because it allows you to accomplish some goal you care about^[1] [1]. Examples:

My work on efficient orbital mechanics calculation is good because it successfully lets me predict the trajectory of satellites.
My work on the disruption of cell signaling in malign tumors is good because it helped me develop successful anti-cancer vaccines.
My work on solid-state physics is good because it allowed me to produce superconductors at a higher temperature and lower pressure than previously attained.^[2]

In each case, there's some outcome I care about pretty inherently for itself, and if the research helps me attain that outcome it's good (or conversely if it doesn't, it's bad). The good researchers in my field are those who have produced a bunch of good research towards the aims of the field.

Sometimes it's not clear-cut. Perhaps I figured out some specific cell signaling pathways that will be useful if it turns out that cell signaling disruption in general is useful, and that's TBD on therapies currently being trialed and we might not know how good (i.e. useful) my research was for many more years. This actually takes us into what I think is the second meaning of "good research".

"It's good because we all agree it's good"

If our goal is successfully navigating the creation of superintelligent AI in a way such that humans are happy with the outcome, then it is too early to properly score existing research on how helpful it will be. No one has aligned a superintelligence. No one's research has contributed to the alignment of an actual superintelligence.

At this point, the best we can do is share our predictions about how useful research will turn out to be. "This is good research" = "I think this research will turn out to be helpful". "That person is a good researcher" = "That person produces much research that will turn out to be useful and/or has good models and predictions of which research will turn out to help".

To talk about the good research that's being produced is simply to say that we have a bunch of shared predictions that there exists research that will eventually help. To speak of the "good researchers" is to speak of the people who lots of people agree their work is likely helpful and opinions likely correct.

Even if the predictions are based on reasoning that we scrutinize and debate extensively, they are still predictions of usefulness and not observations of usefulness.

Someone might object that there's empirical research that we can see yielding results in terms of interpretability/steering or demonstrating deception-like behavior and similar. While you can observe an outcome there, that's not the outcome we really care about of aligning superintelligent AI, and the relevance of this work is still just prediction. It's being successful at kinds of cell signaling modeling before we're confident that's a useful approach.

More like "good" = "our community pagerank Eigen-evaluation of research rates this research highly"

It's a little bit interesting to unpack "agreeing that some research is good". Obviously, not everyone's opinion matters equally. Alignment research has new recruits and it has its leading figures. When leading figures evaluate research and researchers positively, others will tend to trust them.

Yet the leading figures are only leading figures because other people agreed their work was good, including before they were leading figures with extra vote strength. But now that they're leading figures, their votes count extra.

This isn't that much of a problem though. I think the way this operates in practice is like an "Eigen" system such as Google's PageRank and the proposed ideas of Eigenmorality and Eigenkarma^[3].

Imagine everyone starts out with equal voting strength in the communal research evaluation. At t1, people evaluate research and the researchers gain or lose respect,. This in turn raises or lowers their vote strength in the communal assessment. With further timesteps, research-respect accrues to certain individuals who are deemed good or leading figures, and whose evaluations of other research and researchers are deemed especially trustworthy.

Name recognition in a rapidly growing field where there isn't time for everyone to read everything likely functions to entrench leading figures and canonize their views.

In the absence of the ability to objectively evaluate research against the outcome we care about, I think this is a fine way, maybe the best way, for things to operate. But it admits a lot more room for error.

Four reasons why tracking this distinction is important

Remembering that we don't have good feedback here

Operating without feedback loops is pretty terrifying. I intend to elaborate on this in future posts, but my general feeling is humans are generally poor at make predictions several steps out from what we can empirically test. Modern science is largely the realization that to understand the world, we have to test empirically and carefully^[4]. I think it's important to not forget that's what we're doing in AI alignment research, and recognizing that good alignment research means predicted useful rather concretely evaluated as useful is part of that.

Staying alert to degradations of the communal Eigen-evaluation of research

While in the absence of direct feedback this system makes sense, I think it works better when everyone's contributing their own judgments and starts to degrade when it becomes overwhelmingly about popularity and who defers to who. We want the field more like a prediction market and less like a fashion subculture.

Recognizing and compensating for the fact that a domain where feedback is coming exclusively from other people has a stronger incentives to whatever is currently popular

There's less incentive to try very different ideas, since even if those ideas would work eventually, you won't be able to prove it. Consider how a no-name could come along and prove their ideas of heavier-than-air flight are correct by just building a contraption that clearly flies, vs. convincing people your novel conceptual alignment ideas are any good is a much longer uphill battle.

Maintaining methods for top new work to gain recognition

Those early on the scene had the advantage of there was less stuff to read back then, so easier to get name recognition for your contributions. Over time, there's more competition and I can see work of equal or greater caliber having a much harder time getting broadly noticed. Ideally, we've got curation processes in place that mean someone could become an equally-respected leading figure as those of yore, even now, for about equal goodness (as judged by the eigen-collective, of course).

Some final points of clarification

I think this is a useful distinction pointing at something real. Better handles for the types of research evaluation might be direct-outcome-evaluation vs communal-estimation-prediction.
This distinction makes more sense where there's an element of engineering towards desired outcomes vs a more purely predictive science.
I haven't spent much time thinking about this, but I think the distinction applies in other fields where some of the evaluation is direct-outcome and some is communal-estimation. Hard sciences are more on the latter compared to social sciences which have more communal-estimation.
- AI Alignment is just necessarily at an extreme end of split between the two.
- For fields that can evaluate empirically their final outcome at all, there's maybe a kind of "slow feedback loop" that periodically validates or invalidates the faster communal-estimation that's been happening.
In truth, you actually never fully escape communal evaluation, because even with concrete empirical experiments, the researcher community must evaluate and interpret the experiments within an agreed-upon paradigm (via some Eigen-evaluation process, also thanks Hume). However, the quantitative difference gets so large it is basically qualitative.
There are assumptions under which intermediary results (e.g. bunch of SAE outputs) in AI Alignment are more valuable and more clearly constitute progress. However, I don't think they change the field from being fundamentally driven by communal-estimation. They can't, because belief in the value of intermediary outputs and associated assumptions is itself coming from [contested/controversial] communal-estimation, not something validated with reference to the outcomes.
- I can imagine people wanting to talk about timelines and takeoff speeds here as being relevant. At the end of day, those are also still in the communal-estimation, and questions with disagreement in the community.
I think it's a debate worth having about how good vs bad the communal estimation is relative to direct-outcome evaluation. My strongest claim in this post is that this is a meaningful distinction. It's a secondary claim for me that communal-estimation is vastly more fallible, but I haven't actually argued that with particular rigor in this post.
I first began thinking about all of this when trying to figure out how to build better infrastructure for the Alignment research community. I still think projects along the lines of "improve how well the Eigen-evaluation process happens" are worth effort.
- Thinking "Eigen-evaluation" caused me to update on the value of mechanism not just of people adding more ideas to the collective, but also how they critique them. For example, I've updated more in favor of the LessWrong Annual Review for improving the community's Eigen-evaluation.

^{^}
Arguably most scientific work is simply about being able to model things and make accurate predictions, regardless of whether those predictions are useful for anything else. In contrast to that, alignment research is more of an engineering discipline, and the research isn't just about predicting some event, but being able to successfully build some system. Accordingly, I'm choosing examples here that also sit at the juncture between science and engineering.
^{^}
Yes, I've had a very diverse and extensive research career.
^{^}
I also model social status as operating similarly.
^{^}
Raemon's recent recent post provides a cute illustration of this.
^{^}
A concrete decision that I would make differently: in a world where we are very optimistic about alignment research, we might put more effort into getting those research results put to use in frontier labs. In contrast, in pessimistic worlds where we don't think we have good solutions, overwhelmingly effort should go into pauses and moratoriums.

I didn't read most of the post but it seems like you left out a little known but potentially important way to know whether research is good, which is something we could call "having reasons for thinking that your research will help with AGI alignment and then arguing about those reasons and seeing which reasons make sense".

Not left out! That's what "we agree"/"eigen-evaluation" consists of.

My point is that's crucially different from when we're able to directly observe the research being useful for our goal.

It's almost orthogonal to eigen-evaluation. You can arrive at consensus in lots of ways.

Can you give an example you think is a different way? My guess is I will consider it to fall under eigen-evaluation too.

A different way of arriving at consensus? I'm kind of annoyed that there's apparently a practice of not proactively thinking of examples, but ok:

If ~everyone is deferring, then they'll converge on some combination of whoever isn't deferring and whatever belief-like objects emerge from the depths in that context.
If ~everyone just wishes to be paid and the payers pay for X, then ~everyone will apparently believe X.
If someone is going around threatening people to believe X, then people will believe X.

Deferring is straightforwardly a thing you can do with your "vote" in eigen-evaluation. As I wrote:

While in the absence of direct feedback this system makes sense, I think it works better when everyone's contributing their own judgments and starts to degrade when it becomes overwhelmingly about popularity and who defers to who.

Perhaps the word "evaluation" in there is what's misleading.

Being paid or threatened feel so degenerate (and to result in professions of belief) that I hadn't really considered them. Still, suppose there are different people paying or voting in different directions, I think how those net out in what's regarded as "good" will be via an eigen-evaluation process.

On second thought, I do think payment/coercion might be what "people believe X because it's advantageous" is equivalent to. For example, they end up favoring views/research X because that gets you more access to resources (researchers in X are better funded, etc).

Meta; I think it's good to proactively think of examples if you can, and good to provide them too.

My position is approx "whenever there's group aggregate belief, it arises from an eigen- process". (True even when you've got direct-evaluation, though so quantitatively different as to be qualitatively different.)

Predicting that whatever you say will also be eigen-evaluation according to me makes it hard to figure what you think isn't.

ETA: This perhaps inspires me to write a post arguing for this larger point. Like it's the same mechanism with "status", fashion, and humor too.

If everyone calculates 67*23 in their head, they'll reach a partial consensus. People who disagree with the consensus can ask for an argument, and they'll get a convincing argument which will convince them of the correct answer; and if the argument is unconvincing, and they present a convincing argument for a different answer, that answer will become the consensus. We thus arrive at consensus with no eigening. If this isn't how things play out, it's because there's something wrong with the consensus / with the people's epistemics.

Hmm, okay, I think I've made an update (not necessarily to agree with you entirely, but still an update on my picture, so thanks).

I was thinking that if a group of people all agree on particular axioms or rules of inferences, etc., then that will be where eigening is occurring even if given sufficiently straightforward axioms, the group members will achieve consensus without further eigening. But possibly you can get consensus on the axioms just via selection and via individuals using their inside-view to adopt them or not. That's still a degree of "we agreed", but not eigening.

Huh. Yeah, that's an interesting case which yeah, plausibly doesn't require any eigening. I think the plausibility comes from it being a case where someone can so fully do it from their personal inside view (the immediate calculation and also their belief in how the underlying mathematical operations ought to work).

I don't think it scales to anything interesting (def not alignment research), but it is conceptually interesting for how I've been thinking about this.

No, it's the central example for what would work in alignment. You have to think about the actual problem. The difficulty of the problem and illegibility of intermediate results means eigening becomes dominant, but that's a failure mode.

Interesting to consider it a failure mode. Maybe it is. Or is at least somewhat.

I've got another post on eigening in the works, I think that might provide clearer terminology for talking about this, if you'll have time to read it.

I agree that eigening isn't the key concept for alignment or other scientific process. Sure you could describe any consensus that way, but they could be either very good or just awful depending on how much valid analysis went into each step of doing that eigening. In a really good situation, progress toward consensus is only superficially describable as eigening. The real progress is happening by careful thinking and communicating. The eigening isn't happening by reputation but by quality of work. In a bad field, eigening is doing most of the work.

Referring to them both as eigening seems to obscure the difference between good and bad science/theory creation.

But yeah if you mean "I don't think it scales to successfully staking out territory around a grift" that seems right.

I'd guess "make sense" means something higher-bar, harder-to-achieve to you than it does to most people. Are there other ways to say "make sense" that pin down the necessary level of making sense you mean to refer to?

This is a reasonable question, but seems hard to answer satisfyingly. Maybe something with a similar spirit to "stands up to multiple rounds of cross-examination and hidden-assumption-explicitization".

Personally, I'd say that at least part of what I would categorize under 'make sense' is objective. Namely, that you have in your scientific proposal a mechanism of action and consequence which is logically coherent. As in, an evaluation of the abstract symbolic logic of the proposal. Meeting that should be a sort of 'minimum bar' for considering a scientific proposal as worth discussing.

However, there will always be complications in the world which can't be simplified down to logical assertions, so that's only the start of the journey. However, you should feel free to reject proposals which start out making arguments that contradict themselves.

I have temporarily given this post a title that seemed reasonable, as its title was broken. Hope it's OK as a placeholder, @Ruby!

Ah, hmm. I understand why what happened happened, though we should prevent it.

ideas of Eigenmorality and Eigenkarma^[3].

broken links

Not left out! That's what "we agree"/"eigen-evaluation" consists of.

My point is that's crucially different from when we're able to directly observe the research being useful for our goal.

It's almost orthogonal to eigen-evaluation. You can arrive at consensus in lots of ways.

Can you give an example you think is a different way? My guess is I will consider it to fall under eigen-evaluation too.

A different way of arriving at consensus? I'm kind of annoyed that there's apparently a practice of not proactively thinking of examples, but ok:

If ~everyone is deferring, then they'll converge on some combination of whoever isn't deferring and whatever belief-like objects emerge from the depths in that context.
If ~everyone just wishes to be paid and the payers pay for X, then ~everyone will apparently believe X.
If someone is going around threatening people to believe X, then people will believe X.