I'm reminded of an episode from the ozone hole saga: the original researchers who came up with the ozone depletion theory, Rowland and Molinda, discovered a caveat to their theory that would imply the effects of CFCs would be much less than they initially expected. they felt compelled by professional honor to publish these results, even though they cut against their original theory. as expected, the publication of these results (and from the original authors, no less) gave the CFC industry plenty of ammunition to say "look, see, they were wrong all along, haha". however, their commitment to publishing their best understanding also earned them a lot of respect and many people who thought Rowland and Molina had already made up their minds to be anti-CFCs came to think more highly of them. ultimately, further evidence swayed the consensus back in the direction that CFCs were in fact bad for ozone. if Rowland and Molina had tried to cover up their tentative negative results, the ensuing distrust probably would have poisoned their results a lot (though it's hard to evaluate this counterfactual)
(I'm working on a full length piece about the whole ozone hole saga, but this was so relevant that i felt a need to mention it.)
The examples and arguments in this post are more about honesty than quality. I agree that honesty will be important and will get more important as stakes go up.
But I disagree with the "quality will matter more" heuristic. I think that in many places the best actions during the singularity will be extremely "sloppy" by usual standards:
I think you should consider renaming your post to "Honesty matters most when stakes are highest".
the reaction to covid than carefully preparing an important cure
I think there's a good change that spending 1-2 month doing the RaDVaC style targeting of parts of the viral genome, instead of choosing the spike protein in a day where it's easy for the virus to mutate against it would have been a good choice.
The low quality of the vaccine did produce problems.
This is brilliant. Very well done.
There are also other things that need to be done. In a sane world, there would be multiple replications of every AI safety study (I'm working on that).
My other actionable recommendation: Orgs doing safety research should put up bounties for anyone who finds errors in your work, with specific guidelines and a third party judge. If an organization believes a result is true and genuinely moves the needle on x-risk, it would be silly not to offer 25k to anyone who finds a bug in their codebase which qualitatively changes their results. Finding such a bug would be very valuable!!
There are also other things that need to be done. In a sane world, there would be multiple replications of every AI safety study (I'm working on that).
Just got around to your comment. I'm glad you're doing this! In my spare time I've done a bunch of lower effort critiques/replications of other research work, one of which I wrote up for InkHaven (at least much lower than your 'Reevaluating "Model Organisms of Emergent Misalignment"' piece). I think this is valuable, though I worry that a lot of replication work is too credulous to serve as a bug detection mechanism. (Generally it's very junior people doing the replication, who understandably hesitant to critique established work, and who lack the context to make some of the more incisive critiques.)
and who lack the context to make some of the more incisive critiques.
Yup! This is a core limitation.
too credulous to serve as a bug detection mechanism.
I certainly do expect us to miss plenty of bugs. The hope is that by reimplementing the paper it would be somewhat unlikely to have the exact same bug. But this could happen or we could also have bugs of our own.
I'm planning on writing up something on the limitations of our work for transparency and so readers know how to interpret our findings. But the tl;dr is that replications can help but they aren't a ground source of truth. Really happy to talk if you have further thoughts on this or anything replications related!
I certainly do expect us to miss plenty of bugs.
To be clear, I'm not critiquing your work with this! And I don't think "bugs" is the right characterization -- I totally expect even a basic fresh reimplementation to catch obvious bugs -- rather than some fundamental limitation in the research methodology.
Has anyone thought of putting a prize on the alignment problem? I imagine the application form would be swamped by cranks/LLM slop but that feels solvable (e.g. requiring a small fee for each submission to pay the evaluator, requiring >1k LW karma).
I think this is hard because there is no way to specify the problem or judge submissions. If one was able to specify what it means "solve alignment" in a verifiable way, they would already be 95% of the way there.
There are diminishing returns from effort to project quality, perhaps quality = log(effort) with k > 1. But there are increasing returns from quality to impact, impact = exp(k * quality) with k > 1. In this model it follows that impact = effort^k, meaning that quality is so important that it makes the overall returns.
Ok, but why are there increasing returns to quality on impact? When a research output is high quality it will have so many convincing robustness checks that its claims are correct, it demonstrates a new technique, and it's possible to immediately build on. These factors can IMO make a high quality output more valuable than 10 dubious papers published in parallel.
Only if it also happens to be important, right? The ten papers also gave you ten shots on goal. You can replicate the best one in detail with care, or someone else can.
I think a few more parameters are needed to model quality, I have seen a lot of wasted effort...
I’m not sure I agree with this framing. If a friend told me that they were struggling with improving the quality of their outputs, I don’t think that my first suggestion would be for them to put in more effort.
This seems to be falsely equating working fast with lying about your results? Work fast sometimes when it makes sense. Be honest about the limitations and mistakes., Please retitle this to honesty counts most when stakes are highest. The lie in your example and other fraud cases started with a mistake but there was a decision to cover it up instead of admit it and keep doing honest work.
Very true.
In my own experience, the feeling of urgency actively detracts from useful AI alignment work. Instead of forward-chaining towards a better understanding of intelligence/corrigibility/how-minds-work/etc you end up back-chaining from any kind of results you imagine wold maybe do something.
Which isn't at all the right frame for approaching the problem, since the effectiveness of what you get is limited by your first thought about it.
On a related note, I think the overwhelming majority of alignment work isn't helping to address the core problems. Even great researchers like John Wentsworth can somewhat miss the point.
(edited: grammar fix)
On the other hand, I think much medical progress is stalled by too high safety standards. I don't want to get into a political discussion here, but am interested in what this says about the general point.
I think it's mostly just that bad treatments aren't locked in forever, and good treatments have a lot of upside; but if you destroy the world then that's that.
I think safety standards (in the sense of preventing experiments from being ran on the basis of potential adverse effects) are very different from scientific research standards. The first wants to limit bad effects of the act of research, the second wants to ensure the things you learn and communicate actually approximate truth.
If you run a study with bad methodology, your conclusions are unreliable, and no amount of arguing tradeoffs will make them reliable
This would be a good argument for trying to pause AI research, wouldn’t it? Without the pressure, you can easily afford to have the slack to do these things.
I’m curious if that is what people who have thought about a pause think, that the current plan is no plan? But granting that, funding and total work hours spent on figuring that out is 10?-50?x less than the amount spent on technical safety research, so perhaps that isn’t so surprising? And there still isn’t a consensus plan for technical safety either beyond “iterate and hope.” I don’t think that should prevent technical safety research from being done. Should it prevent pause work being done?
I agree. I'm not implying we shouldn't work toward a pause, but commenting on the direction that work should go.
Just coming up with steongwr arguments is a viable direction, and "we're bound to screw up safety under this much pressure" is the strongest argument I know of. If we add to that that people are more likely to become dishonest under that much pressure, that would be another point in favor. Probably that's what you meant?
But will people take their results in the direction of optimism if they think that might mean we all die? I'm not sure.
Quality of work is definitely more relevant than ever at the moment. Duke University near me had a similar issue where one of the most published Scientists (Anil Potti), had epigenetic cancer diagnostic work go all the way to the clinic before other researchers figured out a lot of data had been removed to make the models work. Science has good write-up, on the study. It seemed like a similar case of pressure and wanting to confirm promising results led to a snowball of lies, much like Elizabeth Holmes at Theranos as well. I think in good cause work and competitive environments there is even more temptation to move forward before quality has been double checked.
I would push back a bit on your specific example of double checking code written by Claude by hand. Do you also expect people to read through the Python interpreter code line by line before writing Python scripts? I'm not pushing back against double checking, just that the way things are double checked needs to be scaleable. I don't know the answer, but maybe it's something along the lines of handwriting specific tests for the code you are generating.
Or, the end of the world is no excuse for sloppy work
One morning when I was nine, my dad called me over to his computer. He wanted to show me this amazing Korean scientist who had managed to clone stem cells, and who was developing treatments to let people with spinal cord injuries – people like my dad – walk again on their own two legs.
I don't remember exactly what he said next, or what I said back. I have a sense that I was excited too, and that I was upset when I learned the United States had banned this kind of research.
Unfortunately, his research didn’t pan out. No such treatment arrived. My dad still walks on crutches.
Years later, I learned that the scientist, Hwang Woo-Suk, had been exposed as a fraud.
In 2004, Hwang published a paper in Science claiming that his team had cloned a human embryo and derived stem cells from it (the first time anyone had done this). A year later, in 2005, he published a second paper claiming that they managed to repeat this feat eleven more times, producing 11 patient-specific stem cell lines for patients with type 1 diabetes, congenital hypogammaglobulinemia (a rare immune disorder), and spinal cord injuries. This was the result that, if true, would have helped my dad.
None of this was real. The 2004 cell line did exist, but was not a clone; investigators concluded that it was an unfertilized egg that had spontaneously started dividing. The 2005 cell lines did not exist at all; investigators later found that the data reported for all eleven lines had been fabricated from just two samples, and the DNA in those two samples did not match the patients they had supposedly been derived from.
My dad was not the only person Hwang had given hope to. On July 31st, 2005, Hwang had appeared on a Korean TV show. The dance duo Clon had just performed; one of its members, Kang Won-rae, had been paralyzed from the waist down in a motorcycle accident five years earlier, and had performed in his wheelchair. Hwang walked onto the stage and told a national audience, with tears in his eyes, that he hoped “for a day that Kang will get up and perform magnificently as he did in the past” – a day that was coming soon. He made similar promises to other patients and their families.
I don't think Hwang was a monster who set out to commit fraud for international acclaim. I think he was a capable scientist with real results. (Some of his lab’s cloned animals were almost certainly real clones, including the world’s first cloned dog Snuppy.) But over time, he repeatedly took what he felt was his only option.
The 2004 paper may have started as a real mistake; it’s possible his team genuinely thought the parthenogenetic egg was a clone. But by 2005, with a nation watching and a Nobel on the table and a paralyzed pop star looking at him on live television, there was no version of "actually, we can't do this yet" that he could bring himself to say. So he didn't say it.
The way in which Hwang began his downward spiral is what sticks out most to me. He started out a good scientist, with good results and an important field of study. But with tens of millions of dollars of funding, thousands of adoring fans, and all the letters written to him by hopeful patients and their families, Hwang likely felt the weight of the world on his shoulders. He had to do what he had to do, in order to not let them down.
I work in AI safety. Many of the people I work with believe (and I believe) that the next decade will substantially determine whether and how humanity gets through this century. The stakes are literally astronomical and existential, and the timelines may be short.
That is the weight we carry. And I worry that when push comes to shove, our scientific standards will slip (or are slipping) in order to not let other people down.
For example, wouldn’t it be the right choice to just accept the code written by Claude, without reading it carefully? We don’t have much time left, and we need to figure out how to do interpretability, or monitoring, or how to align models with personas, and so forth.
Why investigate that note of confusion about the new result you saw? Surely with the stakes involved, it’s important to push forward, rather than question every assumption we have?
Why question your interpretability tools, when they seem to produce results that make sense, and let you steer the models to produce other results that seem to make sense? Why flag the failed eval run with somewhat suspicious results, when the deadline for model release is coming soon, and evaluation setups are famously finicky and buggy anyways? Why not simplify away some of the nuance of your paper’s results, when doing so would let it reach a much larger audience?
I worry that it’s tempting for us to take the expedient choice and let our standards slip, precisely because the stakes are so high. But it is precisely because the stakes are so high, with all the real people who will be affected by the outcome, that we need to be vigilant.
Yes, timelines may be short and we may not have time to do all the research that we want. But slipping up and producing misleading or wrong research will only hurt, not help. And if we need to say "actually, we can't do that yet", then we should say as much.