An Alignment Journal: Coming Soon

Dan MacKinlay; JessRiedel; Edmund Lau; Daniel Murfet; Scott Aaronson; Jan_Kulveit; david reinstein; Alexander Gietelink Oldenziel; Marcus Hutter

tl;dr We’re incubating an academic journal for AI alignment: rapid peer-review of foundational Alignment research that the current publication ecosystem underserves. Key bets: paid attributed review, reviewer-written synthesis abstracts, and targeted automation. Contact us if you’re interested in participating as an author, reviewer, or editor, or if you know someone who might be.

Experimental Infrastructure for Foundational Alignment Research

This is the first in a series of “build-in-the-open” updates regarding the incubation of a new peer-reviewed journal dedicated to AI alignment. Later updates will contain much more detail, but we want to put this out soon to draw community participation early. Fill out this form to express your interest in participating as an author, reviewer, editor, developer, manager, or board member, or to recommend someone who might be interested.

The Core Bet

Peer review is a crucial public good: it applies scarce researcher time to sort new ideas for focused attention from the community, but is undersupplied because individual reviewers are poorly incentivized. Peer review in alignment research is particularly fragmented. While some parts of the alignment research community are served by existing venues, such as journals and ML conferences, there are significant gaps. These gaps arise from a combination of factors including the lack of appropriate reviewer pools for some kinds of work. Moreover, none of these institutions move as fast we we think they could in this era, mainly because of inertia. Various preprint servers and online forums avoid these problems, but generally at the expense of quality certification and institutional legitimacy. Furthermore, their review coverage can suffer when attention is misallocated due to trends and hype.

Our bet is that we can create a venue that provides institutional leverage (coordination, compensation) and legibility (citations, archival records, stable indexing) without the institutional friction that kills speed. Instead we can operate a small, agile scale that provides dedicated tooling and rapid experimentation.

Operational Design

We are designing the journal around a few specific, high-leverage hypotheses:

Reviewer Attention as the Scarce Resource: The “uncompensated committee” model is flawed. We are experimenting with attributed and paid peer review, calibrated for quality and speed. We will invest in getting a reviewer’s full, focused attention.
The “Reviewer Abstract”: Instead of a binary Accept/Reject, or alternatively an undigested public transcript of the review discussion, we will output signals with higher information density in the review process.^[1] Accepted papers will ship with a reviewer-written guide: Who is this for? What is the core contribution? What are the specific caveats?
Automation: We are betting that targeted use of LLM-powered automation can ease several steps of the editorial cycle by, e.g., flagging checkable errors, identifying and filtering candidate reviewers, auditing reviewer comments against the paper's actual content, preemptively asking authors to consider addressing likely reviewer objections, and preparing multi-format publication. Our goal is to avoid wasting editor, author, and reviewer labor on mundane work and make decisions more auditable and reversible.

Our forthcoming formal description of the journal will have much more detail. Contact us to help shape it.

Scope

“AI Alignment” is a broad and often contested label. To provide a high-signal environment from day one, we are making a deliberate choice regarding our starting point:

Initial Focus: Foundational Research. At launch, we will lean toward works that contribute conceptual and theoretical understanding of AI Alignment. This includes, but is not limited to advances in the theory of agents, formal safety proofs, computational and learning theoretic properties of AI models, scalable oversight, theoretical underpinnings of interpretability, as well as empirical work that informs any of the above. We’ve chosen this because it is an area commonly reported to be under-served by the current conference cycle.^[2]
Gap Strategy: Our priority is work that would benefit from a more rigorous assessment than a blog post but which doesn’t have the right shape for a high chance of acceptance at an ML conference (e.g. which does not advance capability metrics on a widely-accepted benchmark etc). We want to build a home for the careful, often difficult-to-evaluate foundational work that the field relies on for long-term progress. In the long term, the exact shape of this gap will be determined by the editorial board.
Academic truth-seeking : Within the topic area scope, papers will be assessed primarily on both theoretical soundness, and the question Does the work deepen our understanding? Although we are motivated to found a journal by the desire for a flourishing future, it is neither feasible nor appropriate for an academic venue to assess papers against their adherence to any political agenda, nor to place excessive certainty in our ability to assess their long-term wider-world impact. We will maintain an ethics review for demonstrable, immediate harms.

This is just a starting point. The current team is not the final arbiter of what constitutes “alignment” for all time. While we are setting the initial direction to get the engine running, the long-term responsibility for expanding, narrowing, or shifting the scope will belong to the editorial board. Our job right now is to build a vessel sturdy enough to support those debates.

Governance

This project is in its incubation phase. As the “plumbing” of the journal grows, editorial and strategic authority will be taken up by an editorial board of respected researchers from the alignment community. The journal will be philanthropically funded, so our funders will naturally influence on how the journal develops, but we are committed to building a self-sustaining, public-good institution that belongs to the field.

Advisory board

We are grateful for the advice and support from the initial members of our advisory board:

Geoffrey Irving is Chief Scientist at the UK AI Security Institute. He previously led the Scalable Alignment Team at DeepMind, led the Reflection Team at OpenAI, and co-led neural-network theorem proving work at Google Brain. His research includes AI debate and other scalable approaches to alignment and evaluation. Links: personal website; Google Scholar profile; Alignment Forum profile; LessWrong profile.
Marcus Hutter is a Senior Researcher at DeepMind and an Honorary Professor in the Research School of Computer Science at the Australian National University. His research on algorithmic-information-theoretic models of general intelligence unified Solomonoff induction with sequential decision theory in the AIXI framework and related computable approximations. He has also studied reward hacking and value-learning formulations to remove the incentive to manipulate reward signals. He authored Universal Artificial Intelligence and established the 500,000€ Prize for Compressing Human Knowledge (aka Hutter prize). Links: personal website; Google Scholar profile; DBLP page; ANU profile page.
Scott Aaronson is a Professor of Computer Science at The University of Texas at Austin and the founding director of its Quantum Information Center. He researches computational complexity theory and quantum computing, including boson sampling, postselection, and the limits of quantum speedups. As a visiting researcher at OpenAI, he worked on theoretical foundations for AI safety, including AI-output watermarking. He also created the Complexity Zoo and authored Quantum Computing Since Democritus. Links: personal website; blog; Google Scholar profile; DBLP page; UT Austin profile page.
Victoria Krakovna is a Research Scientist on the AGI Safety & Alignment team at Google DeepMind. She researches dangerous capability evaluation, deceptive alignment, scheming propensity evaluations, specification gaming, goal misgeneralization, and methods for avoiding harmful side effects. She cofounded the Future of Life Institute. Links: personal website/blog; Google Scholar profile; Future of Life Institute profile; MATS mentor page.

Institutional stewardship

This project could fail. Poor execution could create a status-chasing bottleneck, further pollute the signal-to-noise ratio in alignment research, or just waste researchers' time. Poor coordination with other initiatives could hinder rather than help the field.

To reduce this risk, we will engage as a good citizen with the alignment research community. We will track and publish our own performance metrics: turnaround times, reviewer load, and author satisfaction, and solicit assessment by the wider community whether we are participating cooperatively and productively in the publication ecosystem. Continuing the journal will be contingent upon positive community feedback and the editorial board's continuing reassessment of counterfactually positive impact. Accepted papers will remain online, regardless of the ultimate fate of the project.

Next steps

Join the founding team

A journal is only as good as its community, and you could be part of it. We want participation in the Alignment Journal—as an editor, author, or reviewer—to be credibly status-accruing. This should be a justifiable use of time toward your career goals.

Time-Efficient: Respecting your expertise by using automation to remove the grunt work.
Visible: Ensuring that high-quality editorial and review work is recognized as a first-class contribution to the field.
Impactful: Giving participants a direct hand in shaping the standards and content of alignment research.

If you believe this infrastructure is a missing piece of the safety ecosystem, we want your help.

Editors: We need people with the judgment to steer the journal and act as moderators of a fair, rigorous review process.
Reviewers: We are building a pool of deep experts across technical and conceptual alignment, interpretability, and governance.
Authors: If your work is rigorous and important for AI Alignment, we want to hear what review experience you would value.
Governance: If you have experience building high-trust community institutions or designing governance transitions, we specifically want to hear from you.

We’ll soon share an initial description of our design and plans for the journal with much more detail, so reach out now if you’d like to shape it.

Support us online

We welcome you following us on all the usual platforms,

LinkedIn,
X/Twitter as @AlignmentJrnl
github.com/Alignment-Journal,
Our progress will be blogged to blog.alignmentjournal.org as well as LessWrong.

Above all, our content will be hosted at our main site alignmentjournal.org.

Contributors to this document

We are grateful to Geoffrey Irving, Victoria Krakovna, and David Duvenaud for their support and feedback on this post. The authors do not commit to every detail of the journal strategy outline, in perpetuity. This is the first stage in an ongoing consultation, and we expect to adjust our positions in the face of new evidence about best strategies. All responsibility for mistakes in content or execution resides with the current managing editors, Dan MacKinlay and Jess Riedel.

^{^}
We intend to experiment with a variety of possible ratings, certifications and other quality signals. This is our starting proposal, as it is one we have some experience with.
^{^}
The practical implications of the emphasis on achieving State-of-the-Art results on benchmarks in machine learning research is complicated and contentious, and, we argue, not yet well understood even inside the field. For an opinionated introduction, see Moritz Hardt’s book, The Emerging Science of Machine Learning Benchmarks.

Will this journal be open to very abstract philosophy (as opposed to sticking to experiments, math, neuroscience, etc.)? In my expert opinion most of the key questions in AGI alignment are philosophical in nature--in other words, they are centered on areas that lack foundational concepts and data. A bit more specifically, would it be open to very speculative philosophy, meaning philosophy that is

not settled or clear, even in the author's mind;
but is rather aimed at making some kind of progress on what really matters (even at the cost of the previous bullet point), e.g. by stepping away from what is solidly known or solidly acknowledged by the community to be real / important.

I think this would be very consonant with your stated scope, though it would be especially hard for reviewers to judge.

We're been discussing scope a lot, and this is indeed a big question. Some considerations:

We can only do a good job reviewing papers in a given field if we've got a good editor in that field. So scope will be constrained by the practical question of who we are able to get.
Justified or not, it's a bit perilous for a new journal in mathematical/technical fields to also publish papers in fields with less objective criteria, especially in a new field like alignment with contentious boundaries. Even a slight perception of softness could hurt us in the beginning.
- Of course, even in pure math importance is a subjective criteria, but perception doesn't necessarily track this. And some sorts of philosophy (formal logic) can have pretty objective standards, but I think this is not the sort of philosophy you're interested in.
- It's possible this can be mitigated with journal sectioning (e.g., Alignment: Mathematical, Alignment: Empirical, Alignment: Philosophical, etc.), but it's dicey and hard to do right, especially when the journal is new and not yet established.
Regarding scope, we always need to ask: Would this sort of work benefit from review? Could reviewers meaningfully improve the work? Could we establish a reputation where publication (or the contents of the reviewer abstract) was a useful, credible signal to other researchers?
There's no point in starting a journal if we exclude the sort of work that actually matters.

Incidentally, if someone wanted to help make the case for philosophy in the journal, a very useful thing would be to compile a list of papers (which could be a mix of published in traditional journals and not, and need not be strictly on alignment) to serve as exemplars of what should be included.

Thanks. Makes sense, yeah, seems tough. Good luck :)

but I think this is not the sort of philosophy you're interested in.

Yeah, definitely not, unfortunately.

Incidentally, if someone wanted to help make the case for philosophy in the journal, a very useful thing would be to compile a list of papers (which could be a mix of published in traditional journals and not, and need not be strictly on alignment) to serve as exemplars of what should be included.

Yeah someone should maybe do that. I would submit Eliezer's TDT paper, I think.

I maybe wouldn't directly submit this, because it's too speculative (unclear and unclearly explained), but I would still gesture at it or something:

https://www.lesswrong.com/posts/TNQKFoWhAkLCB4Kt7/a-hermeneutic-net-for-agency

and followups

(Like, these probably couldn't go in a journal, and this particular work may not be that high quality / may boil down by 5x to a good paper, but this is the general type of investigation that I would hope for there to be room for if feasible.)

I'm sure you've had lots of discussion about this; why the label "AI alignment"?

I think "alignment" refers to the somewhat specific task of aligning an AIs values to human values. But my understanding of your actual scope is more like "theoretical AI safety". A lot of foundational work is done with the intention that it will eventually help with alignment, but which definitely isn't about alignment, and a lot of theoretical AI safety work isn't about alignment per se at all. For example, some of my research problems are trying to understand which types of AI systems are not dangerous, not because their values are aligned with ours, but because they're not unrestrained consequentialists.

I wish you the best of skill standing up to the incentives which have enshittified the academic publishing ecosystem.

Fwiw, I'm trying to address this in a different context (economics/policy) at The Unjournal (Unjournal.org). I think I have some sense of how to make things better there, and what some of the blockers are.

Hopefully some of the insights and tools will carry over/be relevant to this context as well, and we can leverage extend what has worked.

Yes, I'm excited to see what we can learn David's experience, especially given the incentive designer's insight that he brings to this. We also, collectively, have some experience with the ILIAD conferences which was a precursor experimenting with alternative compensation mechanisms. See Proceedings of ILIAD: Lessons and Progress for some analysis of that project.

What happens if AI labs offer to support you, expecting that you make it more prestigiuous to publish capability evals?

Sounds like a lot more risk of bias (and appearance thereof) than it's worth. At the least, I figure you'd need to have a disclosure on every paper authored by an employee of the company, as well as conflict-of-interest rules making sure the action editor and reviewers were un-biased. Would be a pain, and still suspect. (Here's GPT's summary of how existing journals handle this, most commonly in medical research: https://chatgpt.com/share/69a992c3-75d8-8002-a592-a8053ee1cdbe )

An intermediate and more plausible case would be personal donations from a former or current employee of a frontier company; we expect many to be philanthropically motivated in the coming years. Imo, this is something we'd consider, but I haven't thought about it much yet. We're set for funding for the first year.

If we are successful in standing up a good and well-respected journal, I expect there will be many funders interested in supporting us. (And if we're not successful, the issue is moot.) So I'm not too worried about getting backed into a corner where our only option to keep running is money from a potentially biasing source. We'd ideally like a broad diverse base of funders, like the arXiv.

Sounds promising! Curious about whether you have plans to accept papers based on experimental setup instead of results (to reduce publication bias) and if you'll consider a "press abstract" designed to help journalists disseminate information to the broader public?

Hmm. Ultimately it would be up to the editorial board, but here's why I personally think these features are probably low priority given their nontrivial cost: (1) I presume we are talking about numerical experiments, and I expect the foundational/conceptual topics we want to publish on are less vulnerable to publication bias than, say, experimental psychology or economics. It would be more like pre-registering numerical math papers. That said, if you think the alignment literature has big problems with publication bias, I'd be interested to hear more. (2) Our primary audience is other researchers. Often, journals are motivated to provide press abstracts to induce popular coverage (by making a time-pressed journalist's life easier, as with a press release), and increasing popular coverage is not one of our goals. It can also be a corrupting influence (although there are steps that we could take to reduce this). High quality popular-science journalist will generally take the time to talk to the authors and outside researchers to get the story right.

(1) yeah this makes sense! I do think that accepting experimental work based on results rather than experimental setup is a structure that leads to publication bias, but given you're looking to be more foundational/conceptual, I don't think this will be an issue here.

(2) "increasing popular coverage is not one of our goals" fair enough! I look forward to seeing the first issue (:

(Caveat: I'm not an expert in this field) -- I expect there could be some value in a ~'registered reports' approach for these high-cost computational experiments.

In informal reporting (ACX, this forum) recall reading some mentions of something related to the "publication bias" story in econ/social sci. Perhaps more like concerns of 'labs reporting selectively'; both researchers promoting capabilities (selective reporting on successes) and safety-aligned researchers accused of cherry-picking the most alarming failures/misalignment evidence.

Yea, I can definitely see the selective reporting problem, which goes beyond the problem of negative results being unfairly denied publication. But to combat selective reporting, you'd really need to require preregistered experiments, which is more of a collective-action problem between journals, since if any of them allow un-preregistered experiments, the authors can just publish there. (Of course, you can try and convince the broad community to ignore all experiments that aren't preregistered, but if you can do this then you've already won; the journals will be strongly incentivized to follow suit.)

Required preregistration is just very cumbersome and difficult to do for exploratory science; it really seems only feasible for the later stages of things like medical trials or big contentious question requiring a decisive experiment.

This publication bias story in ML is a whole can of worms which I would love to open at some point. tl;dr it is a problem, but the field has semi-accidentally mitigated many of the worse excesses of it. There is an IMO massively under-regarded work on this— Moritz Hardt’s Machine Learning Benchmarks, which I will write a LW review of some day if I have time.

I'm curious about the timeline. E.g., when do you expect to open the first call for papers, when do you expect the first issue to be published, etc?

It depends on few factors, but April at the earliest for initial submissions. Publication will almost certainly be on a rolling basis (no discrete issues). Our ambitious goal is to drive the submission-to-publication time down to something like a month, but it will require combining several new tricks so it won't be that fast at the beginning.

In order to not drown in slop submissions you could require each author to stake as much money as the reviewers would be happy to be paid to reject their submission as slop.

I like this idea aesthetically. I foresee some challenges in making "staking" something that won't trigger alarms in the existing research bureaucracies that host many of our potential authors. If you have clever ideas for how to handle that I would be curious to hear.

I guess during signup you could require authors to say what existing research bureaucracy they are working for, and only if they click the I am an independent researcher. link are they introduced to staking.

An example, for what it's worth: Quantum journal is relatively new physics arXiv-overlay journal (10 years old) that runs on volunteers effort and modest publication fees (~$700). They didn't want the fees to be a barrier to submitting, so they have a very easy process for getting the fees waived; you basically just have to ask. My understanding is that they still have not been overrun with slop, and whenever I am asked to review the papers are reasonable quality. So it does not seem they are foisting the slop handling onto reviewers. Desk-rejection by the editors appears to be enough.

I'm not speaking on behalf of the initiative here, but I do see some promise in author submission fees being used to cover referee compensation. If done judiciously.

I'm interested to hear more. Would it mostly be for practical reasons (financial sustainability), or to reduce the submission of bad work that wastes editor/reviewer time?

So excited about this! Do you plan on accepting replications of empirical work?

We do not yet plan to support replications of empirical work. Organisationally, there is a desire to keep opening scope tight and theoretical to avoid having diffuse messaging at start up

Personally, I would make a case that replications are not as important in the ML/AI research as in the sciences of the physical (although this depends somewhat on what we mean by "replications")

That said, I think that there is a strong argument for replications generally, and maybe in this field too, and if the Editorial Board agreed with that, then that is what we would do. I am beholden at this point to mention the connection to the UnJournal work that David has mentioned elsewhere in these comments.

This post seems written as if it's "addressed to" the lesswrong community, rather than the broader community of researchers who might want to publish in such a journal. Was this intentional?

We are trying to do both, in that we are attempting to be a bridge between LW and wider scientific communities. Where do you feel our tone might be excluding domain scientists?

I think the general sense is that this is written for a LW audience. If I'd point to specific wordings:

"Key bets", "The Core Bet"
"build-in-the-open updates"
"friction that kills speed"
"This project could fail"
"Status-chasing bottleneck"
"counterfactually positive impact"
"credibly status-accruing"

I think how other organizations handle this sort of thing is that they may have one post on Lesswrong for this specific audience, and a second, less detailed post for a broader community on their website. E.g., compare Anthropic's RSP update with Holden's post on the topic.

Concretely, I think it seems like your post assumes some of the worldviews and assumptions of the lesswrong-ish alignment community, and so general academics may feel like the post is not addressed to them.

Thanks, this is specific and useful. I think it's less that we're attempting to target LW and more that it's just how we tend to talk. We'll work on keeping the word choice more conventional and professional.

not settled or clear, even in the author's mind;
but is rather aimed at making some kind of progress on what really matters (even at the cost of the previous bullet point), e.g. by stepping away from what is solidly known or solidly acknowledged by the community to be real / important.