I think a really substantial fraction of people who are doing "AI Alignment research" are instead acting with the primary aim of "make AI Alignment seem legit". These are not the same goal, a lot of good people can tell and this makes them feel kind of deceived, and also this creates very messy dynamics within the field where people have strong opinions about what the secondary effects of research are, because that's the primary thing they are interested in, instead of asking whether the research points towards useful true things for actually aligning the AI.
This doesn't feel right to me, off the top of my head, it does seem like most of the field is just trying to make progress. For most of those that aren't, it feels like they are pretty explicit about not trying to solve alignment, and also I'm excited about most of the projects. I'd guess like 10-20% of the field are in the "make alignment seem legit" camp. My rough categorization:
Make alignment progress:
This list seems partially right, though I would basically put all of Deepmind in the "make legit" category (I think they are genuinely well-intentioned about this, but I've had long disagreements with e.g. Rohin about this in the past). As a concrete example of this, whose effects I actually quite like, think of the specification gaming list. I think the second list is missing a bunch of names and instances, in-particular a lot of people in different parts of academia, and a lot of people who are less core "AINotKillEveryonism" flavored.
Like, let's take "Anthropic Capabilities" for example, which is what the majority of people at Anthropic work on. Why are they working on it?
They are working on it partially because this gives Anthropic access to state of the art models to do alignment research on, but I think in even greater parts they are doing it because this gives them a seat at the table with the other AI capabilities orgs and makes their work seem legitimate to them, which enables them to both be involved in shaping how AI develops, and have influence over these other orgs.
I think this goal isn't crazy, but I do get a sense that the overall strategy for Anthropic i...
(I realize this is straying pretty far from the intent of this post, so feel free to delete this comment)
I totally agree that a non-trivial portion of DeepMind's work (and especially my work) is in the "make legit" category, and I stand by that as a good thing to do, but putting all of it there seems pretty wild. Going off of a list I previously wrote about DeepMind work (this comment):
...We do a lot of stuff, e.g. of the things you've listed, the Alignment / Scalable Alignment Teams have done at least some work on the following since I joined in late 2020:
- Eliciting latent knowledge (see ELK prizes, particularly the submission from Victoria Krakovna & Vikrant Varma & Ramana Kumar)
- LLM alignment (lots of work discussed in the podcast with Geoffrey you mentioned)
- Scalable oversight (same as above)
- Mechanistic interpretability (unpublished so far)
- Externalized Reasoning Oversight (my guess is that this will be published soon) (EDIT: this paper)
- Communicating views on alignment (e.g. the post you linked, the writing that I do on this forum is in large part about communicating my views)
- Deception + inner alignment (in particular examples of goal misgeneralization)
- Understanding agency (se
I mean, I think my models here come literally from conversations with you, where I am pretty sure you have said things like (paraphrased) "basically all the work I do at Deepmind and the work of most other people I work with at Deepmind is about 'trying to demonstrate the difficulty of the problem' and 'convincing other people at Deepmind the problem is real'".
In as much as you are now claiming that is only 10%-20% of the work, that would be extremely surprising to me and I do think would really be in pretty direct contradiction with other things we have talked about.
Like, yes, of course if you want to do field-building and want to get people to think AI Alignment is real, you will also do some alignment research. But I am talking about the balance of motivations, not the total balance of work. My sense is most of the motivation for people at the Deepmind teams comes from people thinking about how to get other people at Deepmind to take AI Alignment seriously. I think that's a potentially valuable goal, but indeed it is also the kind of goal that often gets represented as someone just trying to make direct progress on the problem.
Hmm, this is surprising. Some claims I might have made that could have led to this misunderstanding, in order of plausibility:
fyi your phrasing here is different from what I initially interpreted "make AI safety seem legit".
like there's maybe a few things someone might mean if they say "they're working on AI Alignment research"
(and of course people can be doing a mixture of the above, or 5th options I didn't lisT)
I interpreted you initially as saying #4, but it sounds like you/Rohin here are talking about #3. There are versions of #3 that are secretly just #4 without much theory-of-change, but, idk, I think Rohin's stated goal here is just pretty reasonable and definitely something I want in my overall AI Alignment Field portfolio. I agree you should avoid accidentally conflating it with #1.
(i.e. this seems related to a form of research-debt, albeit focused on bridging the gap between one field and another, rather than improving intra-field research debt)
They are working on it partially because this gives Anthropic access to state of the art models to do alignment research on, but I think in even greater parts they are doing it because this gives them a seat at the table with the other AI capabilities orgs and makes their work seem legitimate to them, which enables them to both be involved in shaping how AI develops, and have influence over these other orgs.
...Am I crazy or is this discussion weirdly missing the third option of "They're doing it because they want to build a God-AI and 'beat the other orgs to the punch'"? That is completely distinct from signaling competence to other AGI orgs or getting yourself a "seat at the table" and it seems odd to categorize the majority of Anthropic's aggslr8ing as such.
It seems to me like one (often obscured) reason for the disagreement between Thomas and Habryka is that they are thinking about different groups of people when they define "the field."
To assess the % of "the field" that's doing meaningful work, we'd want to do something like [# of people doing meaningful work]/[total # of people in the field].
Who "counts" in the denominator? Should we count anyone who has received a grant from the LTFF with the word "AI safety" in it? Only the ones who have contributed object-level work? Only the ones who have contributed object-level work that passes some bar? Should we count the Anthropic capabilities folks? Just the EAs who are working there?
My guess is that Thomas was using more narrowly defined denominator (e.g., not counting most people who got LTFF grants and went off to to PhDs without contributing object-level alignment stuff; not counting most Anthropic capabilities researchers who have never-or-minimally engaged with the AIS community) whereas Habryka was using a more broadly defined denominator.
I'm not certain about this, and even if it's true, I don't think it explains the entire effect size. But I wouldn't be surprised if roughly 10-3...
Yeah, all four of those are real things happening, and are exactly the sorts of things I think the post has in mind.
I take "make AI alignment seem legit" to refer to a bunch of actions that are optimized to push public discourse and perceptions around. Here's a list of things that come to my mind:
Each of these things seems like they have a core good thing, but according to me they've all backfired to the extend that they were optimized to avoid the thorny parts of AI x-risk, because this enables rampant goodharting. Specifically I think the effects of avoiding the core stuff have been bad, creating weird cargo cults around alignment research, making it easier for orgs to have fake narratives about how they care about alignment, and etc.
Based on my own retrospective views of how lightcone's office went less-than-optimally, I recently gave some recommendations to someone maybe setting up another alignment research space. (Background: I've been working in the lightcone office since shortly after it opened.) They might be of interest to people mining this post for insights on how to execute similar future spaces. Here they are, lightly edited:
I recommend selecting for people who want to understand agents, instead of people who want to reduce AI X-risk.
Strong disagree. I think locking in particular paradigms of how to do AI safety research would be quite bad.
I personally benefitted tremendously from the Lightcone offices, especially when I was there over the summer during SERI MATS. Being able to talk to lots of alignment researchers and other aspiring alignment researchers increased my subjective rate of alignment upskilling by >3x relative to before, when I was in an environment without other alignment people.
Thanks so much to the Lightcone team for making the office happen. I’m sad (emotionally, not making a claim here whether it was the right decision or not) to see it go, but really grateful that it existed.
"The EA and rationality communities might be incredibly net negative" is a hell of a take to be buried in a post about closing offices.
:-(
Part of the point here is Oli, Ben and the rest of the team are still working through our thoughts/feelings on the subject, didn't feel in a good space to write any kind "here's Our Take™" post. i.e the point here was not meant to do "narrative setting"
But, it seemed important to get the information about our reasoning out there. I felt it was valuable to get some version of this post shipped soon, and this was the version we all felt pretty confident about rushing out the door without angsting about exactly what to say.
(Oli may have a somewhat different frame about what happened and his motivations)
The fact that some people in EA (a huge broad community) are probably wrong about some things didn't seem to be an argument that Lightcone Offices would be ineffective as (AFAIK) you could filter people at your discretion.
I mean, no, we were specifically trying to support the EA community, we do not get to unilaterally decide who is part of the community. People I don't personally have much respect for but are members of the EA community who are putting in the work to be considered members in good standing definitely get to pass through. I'm not going as far as to say this was the only thing going on, I made choices about which parts of the movement seemed like they were producing good work and acting ethically and which parts seemed pretty horrendous and to be avoided, but I would (for instance) regularly make an attempt to welcome people from an area that seemed to have poor connections in the social graph (e.g. the first EA from country X, from org Y, from area-of-work Z etc), even if I wasn't excited about that person or place or area, because it was part of the EA community and it seems very valuable for the community as a whole to have better interconnectedness between ...
Extremely strong upvote for Oliver's 2nd message.
Also, not as related: kudos for actually materially changing the course of your organization, something which is hard for most organizations, period.
In particular, I wonder if many people who won't read through a post about offices and logistics would notice and find compelling a standalone post with Oliver's 2nd message and Ben's "broader ecosystem" list—analogous to AGI Ruin: A List of Lethalities. I know related points have been made elsewhere, but I think 95-Theses-style lists have a certain punch.
Are there any implications for the future of LessWrong.com the online forum? How is the morale and the economic security of the people responsible for keeping this place running?
I think I might change some things but it seems very unlikely to me I will substantially reduce investment in LessWrong. Funding is scarcer post-FTX, so some things might change a bit, but I do care a lot about LessWrong continuing to get supported, and I also think it's pretty plausible I will substantially ramp up my investment into LW again.
This going to point about 87 degrees off from the main point of the post, so I'm fine with discussing this elsewhere or in DMs or something, but I do wonder how cruxy this is:
More broadly, I think AI Alignment ideas/the EA community/the rationality community played a pretty substantial role in the founding of the three leading AGI labs (Deepmind, OpenAI, Anthropic), and man, I sure would feel better about a world where none of these would exist, though I also feel quite uncertain here. But it does sure feel like we had a quite large counterfactual effect on AI timelines.
I missed the first chunk of your conversation with Dylan at the lurkshop about this, but at the time, it sounded like you suspected "quite large" wasn't 6-48 months, but maybe more than a decade. I could have gotten the wrong impression, but I remember being confused enough that I resolved to hunt you down later to ask (which I promptly forgot to do).
I gather that this isn't the issue, but it does seem load bearing. A model that suggests alignment/EA/rationality influences sped up AGI by >10 years has some pretty heavy implications which are consistent with the other things you've mentioned. If my understanding i...
FWIW Im very angry about what happened and I think the speedup was around five years in expectation.
I missed the first chunk of your conversation with Dylan at the lurkshop about this, but at the time, it sounded like you suspected "quite large" wasn't 6-48 months, but maybe more than a decade.
A decade in-expectation seems quite extreme.
To be clear, I don't think AGI happening soon is particularly overdetermined, so I do think this is a thing that does actually differ quite a bit depending on details, but I do think it's very unlikely that actions that people adjacent to rationality took that seriously sped up timelines by more than a decade. I would currently give that maybe 3% probability or something.
I mean, I don't see the argument for more than that. Unless you have some argument for hardware progress stopping, my sense is that things would get cheap enough that someone is going to try the AI stuff that is happening today within a decade.
Thanks for sharing your reasoning, that was very interesting to read! I kind of agree with the worldview outlined in the quoted messages from the "Closing-Office-Reasoning" channel. Something like "unless you go to extreme lengths to cultivate integrity and your ability to reason in truth-tracking ways, you'll become a part of the incentive-gradient landscape around you, which kills all your impact."
Seems like a tough decision to have to decide whether an ecosystem has failed vs. whether it's still better than starting from scratch despite its flaws. (I could imagine that there's an instinct to just not think about it.)
Sometimes we also just get unlucky, though. (I don't think FTX was just bad luck, but e.g., with some of the ways AI stuff played out, I find it hard to tell. Of course, just because I find it hard to tell doesn't mean it's objectively hard to tell. Maybe some things really were stupid also when they happened, not just in hindsight.)
I'm curious if you think there are "good EA orgs" where you think the leadership satisfies the threshold needed to predictably be a force of good in the world (my view is yes!). If yes, do you think that this isn't necessarily enough for ...
Also see this recent podcast interview with habryka (incl. my transcript of it), which echoes some of what's written here. Unsurprisingly so, when the slack messages were from Jan 26th and the podcast was from <= Feb 5th.
See e.g. this section about the Rationality/AI Alignment/EA ecosystem.
As a LW veteran interested in EA I also perceive a lot of the dynamics you wrote about and they really bother me. Thank you for your hard and thoughtful work.
I greatly appreciate this post. I feel like "argh yeah it's really hard to guarantee that actions won't have huge negative consequences, and plenty of popular actions might actually be really bad, and the road to hell is paved with good intentions." With that being said, I have some comments to consider.
The offices cost $70k/month on rent [1], and around $35k/month on food and drink, and ~$5k/month on contractor time for the office. It also costs core Lightcone staff time which I'd guess at around $75k/year.
That is ~$185k/month and ~$2.22m/year. I won...
Thank you for sharing this, I was wondering about your perspective on these topics.
I am really curious about the intended counterfactual of this move. My understanding is that the organizations that were using the office raised funds for a new office in a few weeks (from the same funding pool that funds Lightcone), so their work will continue in a similar way.
Is the main goal to have Lightcone focus more on the Rose Garden Inn? What are your plans there, do you have projects in mind for "slowing down AI progress, pivotal acts, intelligence enhancement, etc."? Anything people can help with?
Seems like a classic case of Goodharting, with lots of misaligned mesaoptimizers taking advantage.
I'm a little confused: I feel like I read this post already, but I can't find it. Was there a prior deleted version?
You did see part of it before; I posted in Open Thread a month ago with the announcement, but today Ray poked me and Oli to also publish some of the reasoning we wrote in slack.
I don't particularly like the status hierarchy and incentive landscape of the ML community, which seems quite well-optimized to cause human extinction
the incentives are indeed bad, but more like incompetent and far from optimized to cause extinction
Oliver's second message seems like a truly relevant consideration for our work in the alignment ecosystem. Sometimes, it really does feel like AI X-risk and related concerns created the current situation. Many of the biggest AGI advances might not have been developed counterfactually, and machine learning engineers would just be optimizing another person's clicks.
I am a big fan of "Just don't build AGI" and academic work with AI, simply because it is better at moving slowly (and thereby safely through open discourse and not $10 mil training runs) compared ...
I also remember someone joining the offices to collaborate on a project, who explained that in their work they were looking for "The next Eliezer Yudkowsky or Paul Christiano". When I asked what aspects of Eliezer they wanted to replicate, they said they didn't really know much about Eliezer but it was something that a colleague of theirs said a lot.
💀
I disagree with the claims by Habryka and Ben Pace that your impact on AI wasn't positive and massive, and here's why.
My reasons for disagreement with Habryka and Ben Pace on their impact largely derive from me being way more optimistic on AI risk and AI Alignment than I used to be, which implies Habryka and Ben Pace had way more positive impact than they thought.
Some of my reasons why I became more optimistic, such that the chance of AI doom was cut to 1-10% from a prior 80%, come down to the following:
I basically believe that deceptive alignment won't
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Thank you both for writing this and sharing your thoughts on the ecosystem in general. It's always heartening for me, even just as someone who occasionally visits the Bay, to see the amount of attention and thought being put into the effects of things like this on not just the ecosystem there, but also the broader ecosystem that I mostly interact with and work in. Posts like this make me slightly more hopeful for the community's general health prospects.
A lot of people in AI Alignment I've talked to have found it pretty hard to have clear thoughts in the current social environment, and many of them have reported that getting out of Berkeley, or getting social distance from the core of the community has made them produce better thoughts.
What do you think is the mechanism behind this?
I think the biggest thing is a strong, high-stakes but still quite ambiguous status-hierarchy in the Bay Area.
I think there are lots of contributors to this, but I definitely feel a very huge sense of needing to adopt certain views, to display "good judgement", and to conform to a bunch of epistemic and moral positions in order to operate in the space. This is particularly harsh since the fall of FTX with funding being less abundant and a lot of projects being more in-peril and the stakes of being perceived as reasonable and competent by a very messy and in-substantial parts social process are even higher.
[...]that a lot of my work over the past few years has been bad for the world (most prominently transforming LessWrong into something that looks a lot more respectable in a way that I am worried might have shrunk the overton window of what can be discussed there by a lot, and having generally contributed to a bunch of these dynamics).
While I did not literally claim this in advance, I came close enough that I claim the right to say I Told You So.
I think weighted voting helped on average here. Indeed, of all the things that I have worked on LessWrong is the one that feels like it has helped the most, though it's still pretty messy.
I think it should increase your trust in the voting system! Most of the rest of the internet has voting dominated by whatever new users show up whenever a thing gets popular, and this makes it extremely hard to interpret votes in different contexts. E.g. on Reddit the most upvoted things in most subreddits actually often don't have that much to do with the subreddit, they are just the thins that blew up to the frontpage and so got a ton of people voting on it. Weighted voting helps a lot in creating some stability in voting and making things less internet-popularity weighted (it also does some other good things, and has some additional costs, but this is I think one of the biggest ones).
unduly exaggerating my voice is unethical
The users of the forum have collectively granted you a more powerful voice through our votes over the years. While there are ways you could use it unethically, using it as intended is a good thing.
Lightcone recently decided to close down a big project we'd been running for the last 1.5 years: An office space in Berkeley for people working on x-risk/EA/rationalist things that we opened August 2021.
We haven't written much about why, but I and Ben had written some messages on the internal office slack to explain some of our reasoning, which we've copy-pasted below. (They are from Jan 26th). I might write a longer retrospective sometime, but these messages seemed easy to share, and it seemed good to have something I can more easily refer to publicly.
Background data
Below is a graph of weekly unique keycard-visitors to the office in 2022.
The x-axis is each week (skipping the first 3), and the y-axis is the number of unique visitors-with-keycards.
Members could bring in guests, which happened quite a bit and isn't measured in the keycard data below, so I think the total number of people who came by the offices is 30-50% higher.
The offices opened in August 2021. Including guests, parties, and all the time not shown in the graphs, I'd estimate around 200-300 more people visited, so in total around 500-600 people used the offices.
The offices cost $70k/month on rent [1], and around $35k/month on food and drink, and ~$5k/month on contractor time for the office. It also costs core Lightcone staff time which I'd guess at around $75k/year.
Ben's Announcement
Oliver's 1st message in #Closing-Office-Reasoning
(In response to a question on the Slack saying "I was hoping you could elaborate more on the idea that building the space may be net harmful.")
Oliver's 2nd message
Ben's 1st message in #Closing-Office-Reasoning
Note from Ben: I have lightly edited this because I wrote it very quickly at the time
The office rent cost about 1.5x what it needed to be. We started in a WeWork because we were prototyping whether people even wanted an office, and wanted to get started quickly (the office was up and running in 3 weeks instead of going through the slower process of signing a 12-24 month lease). Then we were in a state for about a year of figuring out where to move to long-term, often wanting to preserve the flexibility of being able to move out within 2 months.