Conditioning as a Crux Finding Device
Say you disagree with someone, e.g. they have low pdoom and you have high pdoom. You might be interested in finding cruxes with them.
You can keep imagining narrower and narrower scenarios in which your beliefs still diverge. Then you can back out properties of the final scenario to identify cruxes.
For example, you start by conditioning on AGI being achieved - both of your pdooms tick up a bit. Then you also condition on that AGI being misaligned, and again your pdooms increase a bit (if the beliefs move in opposite dire...
Got it thanks!
(eg. any o1 session which finally stumbles into the right answer can be refined to drop the dead ends and produce a clean transcript to train a more refined intuition)
Do we have evidence that this is what's going on? My understanding is that distilling from CoT is very sensitive—reordering the reasoning, or even pulling out the successful reasoning, causes the student to be unable to learn from it.
I agree o1 creates training data, but that might just be high quality pre-training data for GPT-5.
Why does it make the CoT less faithful?
Because you are training the CoT to look nice, instead of letting it look however is naturally most efficient for conveying information from past-AI-to-future-AI. The hope of Faithful CoT is that if we let it just be whatever's most efficient, it'll end up being relatively easy to interpret, such that insofar as the system is thinking problematic thoughts, they'll just be right there for us to see. By contrast if we train the CoT to look nice, then it'll e.g. learn euphemisms and other roundabout ways of conveying the same information to its future self, that don't trigger any warnings or appear problematic to humans.
Favorite post of the year so far!
My favored version of this project would involve >50% of the work going into the econ literature and models on investor incentives, with attention to
And then a smaller fraction of the work would involve looking into AI labs, specifically. I'm curious if this matches your intentions for the project or whether you think there are important lessons about the labs that will not be found in the existing econ literature.
How does the fiduciary duty of companies to investors work?
OpenAI instructs investors to view their investments "in the spirit of a donation," which might be relevant for this question.
I would really like to see a post from someone in AI policy on "Grading Possible Comprehensive AI Legislation." The post would lay out what kind of safety stipulations would earn a bill an "A-" vs a "B+", for example.
I'm imagining a situation where, in the next couple years, a big omnibus AI bill gets passed that contains some safety-relevant components. I don't want to be left wondering "did the safety lobby get everything it asked for, or did it get shafted?" and trying to construct an answer ex-post.
I don't know how I hadn't seen this post before now! A couple weeks after you published this, I put out my own post arguing against most applications of analogies in explanations of AI risk. I've added a couple references to your post in mine.
Adult brains are capable of telekinesis, if you fully believe in your ability to move objects with your mind. Adults are generally too jaded to believe such things. Children have the necessary unreserved belief, but their minds are not developed enough to exercise the ability.
File under 'noticing the start of an exponential': A.I. Helped to Find a Vast Source of the Copper That A.I. Needs to Thrive
Scott Alexander says:
Suppose I notice I am a human on Earth in America. I consider two hypotheses. One is that everything is as it seems. The other is that there is a vast conspiracy to hide the fact that America is much bigger than I think - it actually contains one trillion trillion people. It seems like SIA should prefer the conspiracy theory (if the conspiracy is too implausible, just increase the posited number of people until it cancels out).
I am often confused by the kind of reasoning at play in the text I bolded. Maybe someone can help sort me out....
Jacob Steinhardt on predicting emergent capabilities:
...There’s two principles I find useful for reasoning about future emergent capabilities:
- If a capability would help get lower training loss, it will likely emerge in the future, even if we don’t observe much of it now.
- As ML models get larger and are trained on more and better data, simpler heuristics will tend to get replaced by more complex heuristics. . . This points to one general driver of emergence: when one heuristic starts to outcompete another. Usually, a simple heuristic (e.g. answering directly) w
I think you could also push to make government liable as part of this proposal
There might be indirect effects like increasing hype around AI and thus investment, but overall I think those effects are small and I'm not even sure about the sign.
Sign of the effect of open source on hype? Or of hype on timelines? I'm not sure why either would be negative.
Open source --> more capabilities R&D --> more profitable applications --> more profit/investment --> shorter timelines
There's a countervailing effect of democratizing safety research, which one might think outweighs because it's so much more neglected than capabilities, more low-hanging fruit.
GDP is an absolute quantity. If GDP doubles, then that means something. So readers should be thinking about the distance between the curve and the x-axis.
But 1980 is arbitrary. When comparing 2020 to 2000, all that matters is that they’re 20 years apart. No one cares that “2020 is twice as far from 1980 as 2000” because time did not start in 1980.
This is the difference between a ratio scale and a cardinal scale. In a cardinal scale, the distance between points is meaningful, e.g., "The gap between 1 and 2 is twice as big as the gap between 2 and 4." In a r...
I just came across this word from John Koenig's Dictionary of Obscure Sorrows, that nicely capture the thesis of All Debates Are Bravery Debates.
redesis n. a feeling of queasiness while offering someone advice, knowing they might well face a totally different set of constraints and capabilities, any of which might propel them to a wildly different outcome—which makes you wonder if all of your hard-earned wisdom's fundamentally nonstraferable, like handing someone a gift card in your name that probably expired years ago.
(and perhaps also reversing some past value-drift due to the structure of civilization and so on)
Can you say more about why this would be desirable?
Most civilizations in the past have had "bad values" by our standards. People have been in preference falsification equilibria where they feel like they have to endorse certain values or face social censure. They probably still are falsifying preferences and our civilizational values are probably still bad. E.g. high incidence of people right now saying they're traumatized. CEV probably tends more towards the values of untraumatized than traumatized humans, even from a somewhat traumatized starting point.
The idea that civilization is "oppressive" and some ...
A lot of this piece is unique to high school debate formats. In the college context, every judge is themself a current or previous debater, so some of these tricks don't work. (There are of course still times when optimizing for competitive success distracts from truth-seeking.)
Here are some responses to Rawls from my debate files:
A2 Rawls
1. It’s pretty much a complete guide to action? Maybe there are decisions where it is silent, but that’s true of like every ethical theory like this (“but util doesn’t care about X!”). I don’t think the burden is on him to incorporate all the other concepts that we typically associate with justice. At very least not a problem for “justifying the kind of society he supports”
2. Like the two responses to this are either “Rawls tells you the true conception of the good, ignore the other ones” or “just allow for other-regarding preferences and proceed as usual”...
What % evals/demos and what % mech interp would you expect to see if there wasn't Goodharting? 1/3 and 1/5 doesn't seem that high to me, given the value of these agendas and the advantages of touching reality that Ryan named.
these are the two main ways I would expect MATS to have impact: research output during the program and future research output/career trajectories of scholars.
We expect to achieve more impact through (2) than (1); per the theory of change section above, MATS' mission is to expand the talent pipeline. Of course, we hope that scholars produce great research through the program, but often we are excited about scholars doing research (1) as a means for scholars to become better researchers (2). Other times, these goals are in tension. For example, some mentors ...
Thanks, Aaron! That's helpful to hear. I think "forgetting" is a good candidate explanation because scholars answered that question right after competing Alignment 201, which is designed for breadth. Especially given the expedited pace of the course, I wouldn't be surprised if people forgot a decent chunk of A201 material over the next couple months. Maybe for those two scholars, forgetting some A201 content outweighed the other sources of breadth they were afforded, like seminars, networking, etc.
Today I am thankful that Bayes' Rule is unintuitive.
Much ink has been spilled complaining that Bayes' Rule can yield surprising results. As anyone who has taken an introductory statistics class knows, it is difficult to solve a problem that requires an application of Bayes' Rule without plugging values into the formula, at least for a beginner. Eventually, the student of Bayes may gain an intuition for the Rule (perhaps in odds form), but at that point they can be trusted to wield their intuition responsibly because it was won through disciplined pra...
Update: We have finalized our selection of mentors.
I have a friend Balaam who has a very hard time saying no. If I ask him, “Would it bother you if I eat the last slice of pizza?” he will say “Of course that’s fine!” even if it would be somewhat upsetting or costly to him.
I think this is a reference to Guess/Ask/Tell Culture, so I'm linking that post for anyone interested :)
This happens in chess all the time!
You recommend Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover. I wonder if you would also include Why Would AI "Aim" to Defeat Humanity on this list? I know it came out after this post.
MIT Tech Review doesn't break much news. Try Techmeme.
Re "what people are talking about"
Sure, the news is biased toward topics people already think are important because you need readers to click etc etc. But you are people, so you might also think that at least some of those topics are important. Even if the overall news is mostly uncorrelated with your interests, you can filter aggressively.
Re "what they're saying about it"
I think you have in mind articles that are mostly commentary, analysis, opinion. News in the sense I mean it here tells you about som... (read more)