“There's another interpretation of this, which I think might be better where you can model people like AI_WAIFU as modeling timelines where we don't win with literally zero value. That there is zero value whatsoever in timelines where we don't win. And Eliezer, or people like me, are saying, 'Actually, we should value them in proportion to how close to winning we got'. Because that is more healthy... It's reward shaping! We should give ourselves partial reward for getting partially the way. He says that in the post, how we should give ourselves dignity points in proportion to how close we get.
And this is, in my opinion, a much psychologically healthier way to actually deal with the problem. This is how I reason about the problem. I expect to die. I expect this not to work out. But hell, I'm going to give it a good shot and I'm going to have a great time along the way. I'm going to spend time with great people. I'm going to spend time with my friends. We're going to work on some really great problems. And if it doesn't work out, it doesn't work out. But hell, we're going to die with some dignity. We're going to go down swinging.”
I'm not entirely sure on the metaphysics here, but an additional possible point is that in Many Worlds or similar big universes, there is some literal payoff of "us trying hard and getting pretty close in one universe means there are more nearby universes that succeeded."
Is there a way we could get sure on the metaphysics here? It feels like it's an important thing to know if it actually happens to be true.
I like this comment, and I personally think the framing you suggest is useful. I'd like to point out that, funnily enough, in the rest of the conversation ( not in the quotes unfortunately) he says something about the dying with dignity heuristic being useful because humans are (generally) not able to reason about quantum timelines.
Edit+Disclaimer: I keep going back and forth on whether or not posting this comment was good on net. I think more people should take stabs at the alignment problem in their own idiosyncratic way, and this is a very niche criticism guarding against a hypothetical failure mode that I'm not even really sure exists. I think I'm going to settle on retracting this but leaving up because it's fundamentally criticizing someone who is doing good that I'm not doing and I don't like doing that. If you really want to read this you can figure out how to remove HTML strikethroughs with inspect element.
I know saying "don't let this bother you" doesn't actually not let something bother you, but please don't let this dissuade you from possibly making earnest attempts on the alignment solution.
Most ambitious people start their planning by seeing how best they can apply their tools to optimize their status. That they might be arbitrarily famous and rich and powerful, at least among some sufficiently large in-group, is the most important deciding factor in what they do.
On top of that very honed optimization process they have these secondary constraints, which are mostly a function of their rationality and ethics, and those constraints say that their mind has to be in this sufficiently deep "I am a good person" well, which their better selves cannot or will not argue them out of. But that the ambitious human's plan is one that's moves their status upwards, comes before all other considerations. Adopting a plan where they obviously remain a genuinely "average" player in the status game or god forbid below average, means condemning them to a life of despair; it's intolerable. This is the origin story of "earning to give", the general form of which is "pursue status first and then maybe later somehow use it to do good as a tertiary objective".
Relying on the ethics-oracle to be effective enough to steer an arbitrarily ambitious human is basically playing a losing game; the architecture is working against you. It's like trying to prevent an AGI from turning the world into paperclips by making sure it has to win a debate with a "good enough" ethics professor first. All that happens in the thousand foot view is that there's some back and forth and now the ethics professor full-throatedly believes in the virtue of a paperclipped earth.
When this goes right, it's either because the ambitious human's rationality and sense of honor is so far ahead of their IQ that they stop raising their status (think: willingly chooses to put their hand on the stove), or the human miraculously possesses some particular set of useful skills such that their status-maximizing-plan becomes correlated with doing good. In particular it's pretty much impossible to become rich and famous and well connected and liked by your peers without, on some level, spending most of your available resources specifically on a status race, and competing with the other people who are spending most of their available resources on the status race. Doing so "without trying" is as plausible as getting the world record in deadlifting "without trying".
And sometimes it does in fact go right! I'm not saying We The Commoners should punish the status rabbits, or, god forbid, demand motive ambiguity, when they end up doing something good like inventing PageRank or starting an obviously positive existential risk initiative for us or whatever. I am glad Eliezer wrote the sequences, whatever his motives.
But Connor, here's where the conversation gets... difficult. From my 1000 ft. perspective, it seems to me like literally everything you've done so far is compatible with this story:
There's probably a plausible case that EleutherAI was net-positive and I really hesitate on saying things like "you probably shouldn't have expected it would be" even though that's my inside view. I'm definitely not saying you, Connor Leahy, should specifically be the one guy expected to optimize for the social good instead of your own self interest. But it does concern me, that you have this reasoning process where you seem to always land on doing the thing that raises your coinage, and now you're starting an alignment org where this mode of operation is typically incompatible with the goal of not getting us all killed.
tl-dr: people change their minds, reasons why things happen are complex, we should adopt a forgiving mindset/align AI and long-term impact is hard to measure. At the bottom I try to put numbers on EleutherAI's impact and find it was plausibly net positive.
I don't think discussing whether someone really wants to do good or whether there is some (possibly unconscious?) status-optimization process is going to help us align AI.
The situation is often mixed for a lot of people, and it evolves over time. The culture we need to have on here to solve AI existential risk need to be more forgiving. Imagine there's a ML professor who has been publishing papers advancing the state of the art for 20 years who suddenly goes "Oh, actually alignment seems important, I changed my mind", would you write a LW post condemning them and another lengthy comment about their status-seeking behavior in trying to publish papers just to become a better professor?
I have recently talked to some OpenAI employee who met Connor something like three years ago, when the whole "reproducing GPT-2" thing came about. And he mostly remembered things like the model not having been benchmarked carefully enough. Sure, it did not perform nearly as good on a lot of metrics, though that's kind of missing the point of how this actually happened? As Connor explains, he did not know this would go anywhere, and spent like 2 weeks working on, without lots of DL experience. He ended up being convinced by some MIRI people to not release it, since this would be establishing a "bad precedent".
I like to think that people can start with a wrong model of what is good and then update in the right direction. Yes, starting yet another "open-sourcing GPT-3" endeavor the next year is not evidence of having completely updated towards "let's minimize the risk of advancing capabilities research at all cost", though I do think that some fraction of people at EleutherAI truly care about alignment and just did not think that the marginal impact of "GPT-Neo/-J accelerating AI timelines" justified not publishing them at all.
My model for what happened for the EleutherAI story is mostly the ones of "when all you have is a hammer everything looks like a nail". Like, you've reproduced GPT-2 and you have access to lots of compute, why not try out GPT-3? And that's fine. Like, who knew that the thing would become a Discord server with thousands of people talking about ML? That they would somewhat succeed? And then, when the thing is pretty much already somewhat on the rails, what choice do you even have? Delete the server? Tell the people who have been working hard for months to open-source GPT-3 like models that "we should not publish it after all"? Sure, that would have minimized the risk of accelerating timelines. Though when trying to put number on it below I find that it's not just "stop something clearly net negative", it's much more nuanced than that.
And after talking to one of the guys who worked on GPT-J for hours, talking to Connor for 3h, and then having to replay what he said multiple times while editing the video/audio etc., I kind of have a clearer sense of where they're coming from. I think a more productive way of making progress in the future is to look at what the positive and negative were, and put numbers on what was plausibly net good and plausible net bad, so we can focus on doing the good things in the future and maximize EV (not just minimize risk of negative!).
To be clear, I started the interview with a lot of questions about the impact of EleutherAI, and right now I have a lot more positive or mixed evidence for why it was not "certainly a net negative" (not saying it was certainly net positive). Here is my estimate of the impact of EleutherAI, where I try to measure things in my 80% likelihood interval for positive impact for aligning AI, where the unit is "-1" for the negative impact of publishing the GPT-3 paper. eg. (-2, -1) means: "a 80% change that impact was between 2x GPT-3 papers and 1x GPT-3 paper".
Mostly Negative
-- Publishing the Pile: (-0.4, -0.1) (AI labs, including top ones, use the Pile to train their models)
-- Making ML researchers more interested in scaling: (-0.1, -0.025) (GPT-3 spread the scaling meme, not EleutherAI)
-- The potential harm that might arise from the next models that might be open-sourced in the future using the current infrastructure: (-1, -0.1) (it does seem that they're open to open-sourcing more stuff, although plausibly more careful)
Mixed
-- Publishing GPT-J: (-0.4, 0.2) (easier to finetune than GPT-Neo, some people use it, though admittedly it was not SoTA when it was released. Top AI labs had supposedly better models. Interpretability / Alignment people, like at Redwood, use GPT-J / GPT-Neo models to interpret LLMs)
Mostly Positive
-- Making ML researchers more interested in alignment: (0.2, 1) (cf. the part when Connor mentions ML professors moving to alignment somewhat because of Eleuther)
-- Four of the five core people of EleutherAI changing their career to work on alignment, some of them setting up Conjecture, with tacit knowledge of how these large models work: (0.25, 1)
-- Making alignment people more interested in prosaic alignment: (0.1, 0.5)
-- Creating a space with a strong rationalist and ML culture where people can talk about scaling and where alignment is high-status and alignment people can talk about what they care about in real-time + scaling / ML people can learn about alignment: (0.35, 0.8)
Averaging these ups I get (if you could just add confidence intervals, I know this is not how probability work) a 80% chance of the impact being in: (-1, 3.275), so plausibly net good.
Like, who knew that the thing would become a Discord server with thousands of people talking about ML? That they would somewhat succeed? And then, when the thing is pretty much already somewhat on the rails, what choice do you even have? Delete the server? Tell the people who have been working hard for months to open-source GPT-3 like models that "we should not publish it after all"?
I think this eloquent quote can serve to depict an important, general class of dynamics that can contribute to anthropogenic x-risks.
I funnily enough ended up retracting the comment around 9 minutes before you posted yours, triggered by this thread and the concerns you outlined about this sort of psychologizing being unproductive. I basically agree with your response.
I don't think discussing whether someone really wants to do good or whether there is some (possibly unconscious?) status-optimization process is going to help us align AI.
Two comments:
First point: by "really want to do good" (the really is important here) I mean someone who would be fundamentally altruistic and would not have any status/power desire, even subconsciously.
I don't think Conjecture is an "AGI company", everyone I've met there cares deeply about alignment and their alignment team is a decent fraction of the entire company. Plus they're funding the incubator.
I think it's also a misconception that it's an unilateralist intervension. Like, they've talked to other people in the community before starting it, it was not a secret.
First point: by "really want to do good" (the really is important here) I mean someone who would be fundamentally altruistic and would not have any status/power desire, even subconsciously.
Then I'd argue the dichotomy is vacuously true, i.e. it does not generally pertain to humans. Humans are the result of human evolution. It's likely that having a brain that (unconsciously) optimizes for status/power has been very adaptive.
Regarding the rest of your comment, this thread seems relevant.
I don't know exactly what goes into the decision between for-profit vs nonprofit, or whether Conjecture's for-profit status was the right decision, but I do want to suggest that it's not as simple as "for-profit means I plan to make money, nonprofit means I plan to benefit the world".
I used to work at a nonprofit in the military-industrial complex in the USA; there was almost no day-to-day difference between what we were doing versus what (certain units within) for-profits like Raytheon were doing. Our CEO still had a big salary, we still were under pressure to maximize revenues and minimize costs, we competed head-to-head for many of the same customers, etc.
If there’s a for-profit that has only a small set of investors/shareholders, and none of them are pressuring the firm to have a present or future profit (as I assume is the case for Conjecture), then I think there isn't really a huge philosophical difference between that versus a nonprofit; I think it just amounts to various tax and regulatory advantages and disadvantages that trade off against each other. Someone can correct me if I'm wrong.
I think this comment is getting enough vote & discussion heat for me to feel the merit in clarifying with the following statements:
I strong downvoted your comment in both dimensions because I found it disagreeable and counterproductive. This kind of "Kremlinology of the heart" is toxic and demoralizing. It's why I never ever bother to do anything motivated by altruism: because I know when I start trying to do the right thing, I'll get attacked by people who think they know what's in my heart. When I openly act in selfish self-interest, nobody has anything to say about it, but any time I do selfless things, people start questioning my motives; it's clear what I'm incentivized to do. If you really want people to do good thing, don't play status games like this. Incentivize the behavior you want.
I feel unsure about the merits of this for other contexts (because it can indeed create a toxic atmosphere), but I think there are specific contexts where scrutinizing someone's decision-making algorithm seems particularly important:
Heading an alignment organization with strong information security where you have enough control so that it's unusual compared to other organizations fulfils both criteria.
So, I'd say that not discussing the topic in contexts similar to this one would be a mistake.
I strong downvoted your comment in both dimensions because I found it disagreeable and counterproductive.
Generally, I think it would be net-negative to discourage such open discussions about unilateral, high-risk interventions—within the EA/AIS communities—that involve conflicts of interest. Especially, for example, unilateral interventions to create/fund for-profit AGI companies, or to develop/disseminate AI capabilities.
You know what, I've retracted the comment because frankly you're probably right. Even if what I said is literally true, attacking the decision making architecture of Connor Leahy when he's basically doing something good is not two of (true, kind, necessary). It makes people sad, it's the kind of asymmetric justice thing I hate, and I don't want to normalize it. Even when it's attached with disclaimers or say "I'm just attacking you just cuz bro don't take it personal."
Most VC-types are easier to get a hold of than you think. They're sort of in the business of being easy to get a hold of by smart weirdos. If you think you have something to say to him that might change his mind, there's a good shot he'll read your cold email.
Just to state a personal opinion, I think if it makes you work harder on alignment, I’m fine with that being your subconscious motivation structure. There are places where it diverges, and this sort of comment can be good in that it highlights to such people that any detrimental status seeking will be noticed and punished. But if we start scaling down how much credit people should get based on purity of subconscious heart, we’re all going to die.
But if we start scaling down how much credit people should get based on purity of subconscious heart, we’re all going to die.
That's not how I interpreted lc's comment. I think lc means that people – and maybe especially "ambitious" people (i.e., people with some grandiose traits who enjoy power/influence – are at risk to go astray in their rationality when choosing/updating their path to impact as they're tempted to pick paths that fit their strengths and lead to recognition. He's saying "pay close attention whether the described path to impact is indeed positive."
For instance, Connor seems gifted at ML capabilities work and willing to take action based on inner conviction. Is he in the unfortunate world where the best path to impact says "don't reap any of the benefits of your ML talents" or in the fortunate one where it says "making money with ML is step one of a sound plan?"
Everyone faces this sort of tradeoff, but since you sometimes see people believe things like "this may not be the most impactful thing I could possibly do, but it's what suits my strengths," and Connor doesn't seem to have beliefs like that, there are specific reasons to pay close attention. Of course, the same goes for carefully watching other people who claim that they know how to have a lot of impact and it happens to be something that really plays to their strengths. (I think we definitely need some people who act ambitiously on some specific vision that plays to their strengths!)
If you have to solve an actually hard problem in the actual real world, in actual physics, for real, an actual problem, that is actually hard, you can't afford to throw your epistemics out the door because you feel bad.
But I like to believe that there was a positive magnetic contagion that happened there.
These statements seem in tension, except insofar as I don't take one or the other literally.
In their announcement post they mention:
Mechanistic interpretability research in a similar vein to the work of Chris Olah and David Bau, but with less of a focus on circuits-style interpretability and more focus on research whose insights can scale to models with many billions of parameters and larger. Some example approaches might be:
- Locating and editing factual knowledge in a transformer language model.
- Using deep learning to automate deep learning interpretability - for example, training a language model to give semantic labels to neurons or other internal circuits.
- Studying the high-level algorithms that models use to perform e.g, in-context learning or prompt programming.
"One of the important parts of my threat model is that I think 99% of the damage from GPT-3 was done the moment the paper was published. And, as they say about the nuclear bomb, the only secret was that it was possible. And I think there's a bit of naivety that sometimes goes into these arguments, where people are, 'Well, EleutherAI accelerated things, they drew attention to the meme'. And I think there's a lot of hindsight bias there, in that people don't realize how everyone knew about this, except the alignment community. Everyone at OpenAI, Google Brain and DeepMind. People knew about this, and they figured it out fucking fast."
I also agree, in that it gave people possibility, albeit my timeline for AI-PONR is 2016-2022, from the time of go being crushed by AI, basically proving that we managed to get an intuition in AI, to the Chinchilla scaling paper which gave a clear path to human-level AI and superhuman AI. It also threw out the old scaling laws too. I'd also add Gato to the list. Despite overhype, Deepmind plans to scale it in the next few years, and it's eerily close to solving the software for robots.
It seems to me that you're passing comments in bad faith here. Connor repeatedly stressed in podcast that Conjecture would not do capabilities research and that they would not have had plans for developing products had they not been funding constrained.
You make pretty big accusations in the parent comment too, all that not supported by an iota of evidence but an out-of-context quote from podcast picked by you.
Just FYI I deleted that comment before you made the reply, which is why your comment is in some sort of Twilight Zone. I also removed the quote because it does have other interpretations, though I prefer mine.
I talked to Connor Leahy about Yudkowsky's antimemes in Death with Dignity, common misconceptions about EleutherAI and his new AI Alignment company Conjecture.
Below are some highlighted quotes from our conversation (available on Youtube, Spotify, Google Podcast, Apple Podcast). For the full context for each of these quotes, you can find an accompanying transcript, organized in 74 sub-sections.
Understanding Eliezer Yudkowsky
Eliezer Has Been Conveying Antimemes
Why the Dying with Dignity Heuristic is Useful
EleutherAI
Why training GPT-3 Size Models made sense
EleutherAI Spread Alignment Memes in the ML World
On the Policy and Impact of EleutherAI's Open Source
Conjecture
How Conjecture Started
Where Conjecture Fits in the AI Alignment Landscape
Why Conjecture is Doing Interpretability Research
Conjecture Approach To Solving Alignment
On being non-disclosure by default
On Building Products as a For-Profit
Scaling The Alignment Field