AI for AI for Epistemics

owencb; Lukas Finnveden

We feel conscious that rapid AI progress could transform all sorts of cause areas. But we haven’t previously analysed what this means for AI for epistemics, a field close to our hearts. In this article, we attempt to rectify this oversight.

Summary

AI-powered tools and services that help people figure out what’s true (“AI for epistemics”) could matter a lot.

As R&D is increasingly automated, AI systems will play a larger role in the process of developing such AI-based epistemic tools. This has important implications. Whoever is willing to devote sufficient compute will be able to build strong versions of the tools, quickly. Eventually, the hard part won’t be building useful systems, but making sure people trust the right ones, and making sure that they are truth-tracking even in domains where that’s hard to verify.

We can do some things now to prepare. Incumbency effects mean that shaping the early versions for the better could have persistent benefits. Helping build appetite among socially motivated actors with deep pockets could enable the benefits to come online sooner, and in safer hands. And in some cases, we can identify particular things that seem likely to be bottlenecks later, and work on those directly.

Background: AI for epistemics

AI for epistemics — i.e. getting AI systems to give more truth-conducive answers, and building tools that help the epistemics of the users — seems like a big deal to us. Some past things we’ve written on the topic include:

These past articles mostly take the perspective of “how can people build AI systems which do better by these lights?”. But maybe we should be thinking much more about what changes when people can use AI tools to do increasingly large fractions of the development work!

The shift in what drives AI-for-epistemics progress

Right now, AI-for-epistemics tools are constrained by two main bottlenecks: the quality of the underlying AI systems, and whether people have invested serious development effort in building the tools to use those systems.

The balance of bottlenecks is changing. Two years ago, the quality of underlying AI systems was the central bottleneck. Today, it is much less so — many useful tools could probably work based on current LLMs. It is likely still a constraint on how good the systems can be, and will remain so for a while even as the underlying models get stronger, but it is less of a fundamental blocker. Development investment has therefore become a bigger bottleneck — there are a number of applications which we are pretty confident could be built to a high usefulness level today, and just haven’t been (yet).

But bottlenecks will continue to shift. AI is increasingly driving research and software development. As AI systems get stronger, it may become possible to turn a large compute budget into a lot of R&D. This could include product design, engineering, experiment design, direction-setting, etc. Actors with lots of compute could direct this towards building epistemic tools.

Therefore, as AI-driven R&D accelerates, other inputs to AI for epistemics are more likely to become key bottlenecks:

Compute. Automated R&D may require a lot of compute. This could be for inference (running the analogues of human researchers); for running experiments; and perhaps for training specialized AI systems. This means the actors who can build the best epistemic tools may be those with deep pockets.
Adoption and trust. Even very good tools don’t help if nobody uses them, or if the wrong people use them and the right people don’t. Adoption is partly a function of trust, and trust is partly a function of adoption — early tools shape what people come to rely on.
Ground truth evaluation. To make an epistemic tool good, you need some signal for what “good” means. This already shapes AI applications a lot — part of the reason coding agents are so good is that there’s great access to ground truth about what works.
- For some epistemic applications this is relatively straightforward (e.g. forecasting accuracy). For others it’s hard (e.g. what makes a conceptual clarification actually clarifying, rather than just satisfying?).
- Most tools can probably reach a certain degree of usefulness without running into this problem, just piggybacking on base models making generally sensible judgements.
- We can expect it to bite when you try to make them very good: if you don’t have a way of assessing quality, it could be hard to push to objectively excellent levels.
- One basic solution is to rely on human judgement: either via humans providing labels and demonstrations to train against, or via human developers exercising their judgement in other parts of the process (such as when defining scaffolds). But this becomes disproportionately more expensive as R&D becomes more automated.

These basic points are robust to whether R&D is fully automated, or “merely” represents a large uplift to human researchers. But the most important bottlenecks will vary across applications and will continue to shift over time.

What this unlocks

Automated R&D means that strong “AI for epistemics” tools could come online on a compressed timeline.

This is an exciting opportunity! Upgrading epistemics could better position us to avoid existential risk and navigate through the choice transition well.

If everything is moving fast, it may matter a lot exactly what sequence we get capabilities in. It may therefore be crucial to make serious investments in building these powerful applications (rather than wait until such time as they are trivially cheap).

Risks from rapid progress in AI for epistemics

There are also a number of ways that rapid (and significantly automated) progress in AI-for-epistemics applications could go wrong. We need to be tracking these in order to guard against them.

In our view, the two biggest risks are:

Epistemic misalignment: because of ground truth issues, powerful tools steer our thoughts in directions other than those which are truth-tracking, in ways that we fail to detect
Trust lock-in: if a lot of people buy into trusting tools or ecosystems that don’t deserve that trust, this might be self-perpetuating if these continue to recommend themselves

Epistemic misalignment

Depending on when they bite, ground truth problems as discussed above could be bottlenecks, or active sources of risk. They are bottlenecks if they prevent people from building strong versions of tools. They could become risks if the methods are good enough to allow for bootstrapping to something strong, but end up pointing in the wrong direction. This is essentially Goodhart’s law — we might get something very optimized for the wrong thing (and without even knowing how to detect that it’s subtly wrong).

In the limit, this could lead to humans or AI systems making extremely consequential decisions based on misguided epistemic foundations. For example, they might give over the universe to digital minds that are not conscious — or in the other direction, fail to treat digital minds with the dignity and moral seriousness they deserve. Wei Dai has written about this concern in terms of the importance of metaphilosophy. We agree that there is a crucial concern here.

This could come separately from or together with risks from power-seeking misaligned AI. Epistemic tools could be systematically misleading without being power-seeking. But if some AI systems are misaligned and power-seeking, there’s an additional concern where AI systems could mislead us in ways specifically designed to disempower us whenever we are unable to check their answers.

Some approaches to the ground truth problem may involve using AI systems to make judgements about things. This introduces a regress problem: how can we ensure that subtle errors in the first AI systems shrink rather than compound into worse problems as the process plays out? (We return to this in the interventions section below.)

Trust lock-in

Trust and adoption tend to reinforce each other — people adopt tools they trust, and widely-adopted tools accumulate trust. This is normally fine. It could become a problem if the tools that win early trust don’t deserve it, but incumbency effects make them hard to displace.

This could happen in several ways. An actor with a particular agenda could build something that purports to function as a neutral epistemic aid but is shaped to further their agenda by manipulating others. Or, less perniciously but perhaps more likely, an early-but-mediocre tool could accumulate trust and adoption before better alternatives exist, reinforced by commercial incentives which mean it talks itself up and rival tools down. In either case, the result could be an epistemic ecosystem that’s hard to dislodge even once better options are available.

Other risks

Those two risks are not the only concerns. We are also somewhat worried about epistemic power concentration (where whoever has the best epistemic tools leverages their information advantage into better financial or political outcomes, and continues to stay ahead epistemically), and epistemic dependency (where people relying on AI tools gradually atrophy in their critical reasoning — exacerbating other risks). There may be more that we are not tracking.

Interventions

What should people who care about epistemics be doing now, in anticipation of a world where AI-driven R&D can be directed at building epistemic tools?

Build appetite for epistemics R&D among well-resourced actors

If you need big compute budgets to build great epistemic tools, you’ll ideally want support from frontier AI companies, major philanthropic funders, or governments. But they may not currently see this as a priority. Building the case that this matters, and helping these actors develop good taste about which tools to prioritize and how to design them well, could shape what gets built when automated R&D becomes powerful enough to build it.

Anticipate future data needs

Some epistemic tools will need training data that doesn’t yet exist and may not be trivial to generate. There are three strategies here:

Collecting or creating data or training environments now for future use
- E.g. if you think you want access to a lot of human judgements about what wise decisions look like, you could go out and curate that dataset.
Establishing pipelines to collect data over time
- E.g. if you want to automate a certain type of research, you could record internal discussions from researchers working on this.
Designing processes for automated data creation
- E.g. if you could design a self-play loop where we have good reason to believe that scaling up compute will lead to genuinely truth-tracking performance, this could set the stage for later rapid improvement at the core capability.

The first two are especially great to work on now because they involve actions at human time-scales. (They may not be proportionately sped up by having more AI labor available.) The third is great to work on because there’s some chance that models will become capable of growing a lot from the right self-play loop before they become capable enough to come up with the idea themselves.

Figure out what could ground us against epistemic misalignment

If powerful epistemic tools could be subtly misaligned with truth-conduciveness in ways we can’t easily detect, we should figure out what this could look like! We expect this might benefit from a mix of theoretical work (what does it even mean for an epistemic tool to be well-calibrated in domains without clear ground truth?^[1]) and practical work (studying how current tools fail, building evaluation methods). Ultimately we don't have a clear picture of what the solutions look like, but this seems like an important topic and we are keen for it to get more attention soon.

Drive early adoption where adoption is the key bottleneck

For some applications, we might expect that the main constraint on impact will be whether anyone uses them. In these cases, getting early versions into use — even if they’re not yet very good — could build familiarity and surface real-world feedback. (This could also drive appetite for further development.)

In theory, this could be in tension with avoiding bad trust lock-in. But in practice, it’s not clear that bad trust lock-in becomes any likelier if tools in a specific area are developed earlier rather than later. Some tool is still going to get the first-mover advantage.^[2]

Support open and auditable epistemic infrastructure

To guard against trust lock-in, we want to make it easy for people to distinguish between tools which are genuinely doing the good trustworthy thing, and tools which may not be (but claim to be doing so). To that end, we want ways for people and communities to audit different systems — understanding their internal processes and measuring their behaviours. The goal is that if disputes arise about which tools are actually trustworthy, there’s an inspectable audit trail that can resolve them. In turn, this should reduce the incentives to create misleading tools in the first place.

Support development in incentive-compatible places

The incentives of whoever builds epistemic tools could matter — through thousands of small design decisions, through choices about what to optimize for, and through decisions about access and pricing. Development in organizations whose incentives are aligned with the public good (rather than with engagement, profit, or political influence) reduces the risk that tools are subtly shaped to serve the builder’s interests.

Ideally, you’d spur development among actors who are both well-resourced (as just discussed) and whose incentives are aligned with the public good. In practice, it may be difficult to find organizations that are excellent on both. A plausible compromise is for less-resourced organizations with better incentives to focus on publicly available evaluation of epistemic tools. This could be cheaper than producing them from scratch, and it could create better incentives for the larger actors.

Examples

Forecasting

Automated R&D will probably be able to improve forecasting tools without severe ground truth problems, so epistemic misalignment is less of a concern.^[3] Appetite for investment probably already exists, and adoption should be significantly helped by the ability of powerful tools to develop an impressive, legible track record.

The most useful near-term investment might be in data infrastructure. For instance, LLMs trained with strict historical knowledge cutoffs could enable much better science of forecasting by allowing methods to be tested against questions whose answers the system genuinely doesn’t know.

Misinformation tracking

Trust lock-in is the central concern. A tool that becomes widely trusted for adjudicating what’s true has enormous influence, and if that trust is misplaced it could be very hard to dislodge. Open and auditable approaches are especially important here.

Because of the trust lock-in concern, the automation of R&D may exacerbate challenges. Currently, building good misinformation-tracking tools requires editorial judgement and domain expertise — things responsible actors tend to have more of. Automation shifts the bottleneck towards compute, which is more symmetrically available. This could increase the urgency of getting started on these tools and driving adoption early.

Automating conceptual research

This is the case where epistemic misalignment is most concerning. Ground truth is extremely hard — what makes a conceptual clarification actually clarifying rather than just satisfying? Humans are poor judges of this in real time, so e.g. a training process that rewards outputs humans find helpful could easily optimize for persuasiveness rather than truth-tracking.

One plausible direction here is to research training regimes (such as self-play loops) that we have some reason to believe should ground to truth-tracking, with specific attention to how they could go wrong. Adoption could be an issue, but we’re also worried about the other direction, with adoption coming too easily before we have good ways of evaluating whether the tools are actually helping.

This article was created by Forethought. See the original on our website.

^{^}
Epistemic misalignment issues may also appear in areas where ground truth is well-defined but hard to access, such as very long-run forecasts. Theoretical work also seems valuable for such areas (because it’s unclear how to evaluate and train for good performance by default).
^{^}
In fact, it might be bad if people who are worried about bad trust lock-in select themselves out of getting that first-mover advantage.
^{^}
Although at some quality level, we have to start worrying about self-affecting prophecies. AI forecasters will have to be very trusted indeed before that becomes a serious issue, which gives us a lot of time to figure out how best to handle the issue.

Excellent post. AI for better epistemics is one ray of hope.

I wrote about this in Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities. Abram Demski wrote about it in Anti-Slop Interventions? and elsewhere. Steve Byrnes raised an excellent issue with this whole approach and hope in this comment thread on that post.

The problem is this: improving AI's epistemics in a general way will also improve its capabilities, perhaps just as much as you improved its epistemics. Doing good research is in large part a problem of epistemics. Discerning accurately what the problem is, what is known about it, and what can be inferred is much of the hard part of a research project.

I doubt the reality is quite as bad as a 1:1 capabilities and epistemics tradeoff by necessity.

You say that this can probably be done with current systems. I'm curious what you're thinking of. I think this is true; for instance, the Google Co-Scientist project seemed to produce amazing results from Gemini 2.0 two years ago. It used extensive scaffolding for better hypothesis generation and testing, and used a lot of run-time compute.

So I'm not sure what to do with this tension. I am definitely hopeful that improvements in epistemics will add some sanity to the AGI race, even if those improvements are just byproducts of capabilities.

I'm also somewhat optimistic that market pressures will reduce slop and improve epistemics. Wider business adoption will increase the incentives to have systems that are correct rather than pleasant to use (sycophantic).

I'm curious about your thought on those questions, since you've clearly thought about this a lot.

(I think this is a mostly serious post with some jokey nods to April Fools' Day?)

Wei Dai has written about this concern in terms of the importance of metaphilosophy. We agree that there is a crucial concern here.

I have noticed that even people who share my general concern don't seem to like to frame it the way that I do, i.e., as a need to solve metaphilosophy. For example in this post you never mention "metaphilosophy" again or talk about trying to understand the nature of philosophy. I'm pretty curious why that is.

(By "need" I mean it seems to be the only way to achieve high justified confidence that the concern has been addressed, not that the world is certainly doomed if we don't solve metaphilosophy. I can see various ways that we "get lucky" and the problem kind of solves itself.)

One way we've been historically "lucky" is that there was a fairly high correlation between philosophical competence and technological competence, on a cultural level, most prominently in the example of England both inventing analytical philosophy and starting the Industrial Revolution. (But continental Europe historically and China today offer some counter evidence, as they're technologically competitive without having a comparably competent philosophical tradition.) It seems like a key strategic question here is how high this correlation will be going forward, by default.

A couple of reasons to suspect the correlation might not hold up:

It may be that philosophy gave some one-time advantages to technological competence (via e.g. logic, philosophy of math, philosophy of science) in the past, but it won't be needed much in the future for continued technological progress. (This seems to explain why China can be technologically competitive today while making little apparent philosophical progress.)
Maybe the correlation is related to human mind architecture or constraints and AI will be different. This seems plausible as AI capabilities are already more "jagged" than humans today.

(And of course even if the correlation does hold up, it may not be strong enough or apply at the right level of organization to make a difference.)

I have noticed that even people who share my general concern don't seem to like to frame it the way that I do, i.e., as a need to solve metaphilosophy. For example in this post you never mention "metaphilosophy" again or talk about trying to understand the nature of philosophy. I'm pretty curious why that is.
(By "need" I mean it seems to be the only way to achieve high justified confidence that the concern has been addressed, not that the world is certainly doomed if we don't solve metaphilosophy. I can see various ways that we "get lucky" and the problem kind of solves itself.)

In between " 'we get lucky' and the problem kind of solves itself" and "we solve metaphilosophy and achieve high justified confidence", there's "we do a bunch of things that we think will help on the margin without leading to high confidence that the problem gets solved, and partially as a result of our interventions and partially due to luck, things turn out fine". That's more what I'm aiming at. And this doesn't require tackling or solving metaphilosophy directly (which seems really difficult!), which is probably why I don't use the term that much.

In between " 'we get lucky' and the problem kind of solves itself" and "we solve metaphilosophy and achieve high justified confidence", there's "we do a bunch of things that we think will help on the margin without leading to high confidence that the problem gets solved, and partially as a result of our interventions and partially due to luck, things turn out fine". That's more what I'm aiming at. And this doesn't require tackling or solving metaphilosophy directly (which seems really difficult!), which is probably why I don't use the term that much.

Thanks for answering this! Aren't you worried that by presenting this "in between" approach without mentioning that it would still require some amount of luck, or equivalently still incur some amount of risk (of potentially catastrophic philosophical failure/error), it can be misleading for people who might read the post without themselves specializing in this area? I'm thinking of e.g. AI company leaders who might decide to push ahead with AI development or deployment without realizing that they're incurring this kind of risk, or voters/politicians who read this and think "this doesn't seem so hard, the AI companies can probably handle this."

BTW how high do you think this remaining risk is (assuming the "in between" approach is pursued/resourced to a high degree)?

Aren't you worried that by presenting this "in between" approach without mentioning that it would still require some amount of luck, or equivalently still incur some amount of risk (of potentially catastrophic philosophical failure/error), it can be misleading for people who might read the post without themselves specializing in this area?

I don't want to mislead people, so I guess it's just a question about how people interpret posts like this. I suppose I would worry about this if I was presenting something framed like a decisive solution. But I'd think it's pretty normal for posts to talk about a problem and present some promising-seeming interventions without thereby implying that the problem would be entirely solved if the interventions got carried out. (E.g.: If I read a post about climate change that suggested some interventions, I wouldn't assume that those interventions would necessarily solve the whole problem.)

It also feels relevant that I don't have a prescription for actions anyone could take to predictably achieve high justified confidence that the problem was solved. I don't think that discovering a solution to metaphilosophy would be sufficient, because a big part of the problem is that people might not care about doing good philosophy even if a solution existed. I think that slower AI development (including a global pause) would probably be helpful on the current margin, for this risk, but I don't think that a very long pause would get the risk down to very low levels. (There's just a bunch of stuff that influence societal epistemics and its hard to know whether it's heading in a good or bad direction on the time-scale of decades, at present technology levels. And I expect societal epistemics to have a big influence on the risk here.)

I agree the factors you mention are relevant, but in my mind they don't reduce the risk of misinterpretation to be low enough to make it not worth adding a sentence or two of explicit statement or explanation of the remaining risks. I think the main difference with something like climate change is that the latter is much more well-known and anyone reading an article about a partial solution is highly likely to already have a good idea of the overall shape of the problem (thus making the analogous misinterpretation very unlikely), and this can't be said for an article on "AI for epistemics".

I don't think that discovering a solution to metaphilosophy would be sufficient, because a big part of the problem is that people might not care about doing good philosophy even if a solution existed.

Yeah I worry about this a lot too, but solving metaphilosophy can plausibly help substantially here, similar to how solving (in large part) the philosophy of math and science has (directly or indirectly) made many more people care about doing good math and science. Directly, it seems a lot easier to care about something if you actually understood what it really is, as that would likely tell you a lot about why it might be valuable. Indirectly, it would likely speed up philosophical progress and make it more prestigious, less contentious, appear less wasteful/pointless (to many), etc.

I completely agree with this notion and yet have found no particular medium to engage in the meta-philosophy - and have been pretty disappointed to find the consistency under which this discourse medium largely rejects it.

Every time I seem to engage these notions in comments or quick takes I get absolutely karma-drained, until I have to pause, comment on some social phenomena as a quick karma pump, before being allowed again to talk about what I consider really matters.

For example, I'll question the definition or goals of alignment at the SI limit, and be promptly down-voted by practitioner's and told I don't understand what alignment is - because of how its defined in the field today, practically.

This is a frustrating notion - when the gap between modern practical considerations and SI at the limit seems an epistemic gap orders of magnitude bigger than that between any modern philosophy and engineering department.

Discussions over what the goal of alignment should be, or alignment as it pertains to SI, remain in the space of meta-ethics and philosophy for now - and it seems like a form of semantic high-jack has laid claim to the SI conversation, that re-framed meta-ethical or meta-philosophical digressions as 'non instrumental' (to what exactly?) or otherwise uninformed navel gazing.

But, IMO, the highest value alignment problem one could solve is that of the unrescuability of moral internalism, but all my attempts to engage the notion are fruitless. Reddit philosophy is a bit watered down and I can't seem to find a third space here.

Any suggestions on where to go to become useful to these ends?

Um, I see a lot of people raising similar issues here and being upvoted. I wonder if it might be the way you're raising these rather than the topic. I think you're in the right place and should consult an AI about how to communicate so your points are appreciated.

But continental Europe historically and China today offer some counter evidence, as they're technologically competitive without having a comparably competent philosophical tradition.

continental europe historically seems like a clear example of high technological competence together with high philosophical competence (both measured relative to the time)