A number of people seem to have departed OpenAI at around the same time as you. Is there a particular reason for that which you can share? Do you still think that people interested in alignment research should apply to work at OpenAI?
A number of people seem to have departed OpenAI at around the same time as you. Is there a particular reason for that which you can share?
My own departure was driven largely by my desire to work on more conceptual/theoretical issues in alignment. I've generally expected to transition back to this work eventually and I think there a variety of reasons that OpenAI isn't the best for it. (I would likely have moved earlier if Geoffrey Irving's departure hadn't left me managing the alignment team.)
I'm pretty hesitant to speak on behalf of other people who left. It's definitely not a complete coincidence that I left around the same time as other people (though there were multiple important coincidences), and I can talk about my own motivations:
Do you still think that people interested in alignment research should apply to work at OpenAI?
I think alignment is a lot better if there are strong teams trying to apply best practices to align state of the art models, who have been learning about what it actually takes to do that in practice and building social capital. Basically that seems good because (i) I think there's a reasonable chance that we fail not because alignment is super-hard but because we just don't do a very good job during crunch time, and I think such teams are the best intervention for doing a better job, (ii) even if alignment is very hard and we need big new ideas, I think that such teams will be important for empirically characterizing and ultimately adopting those big new ideas. It's also an unusually unambiguous good thing.
I spent a lot of time at OpenAI largely because I wanted to help get that kind of alignment effort going. For some color see this post; that team still exists (under Jan Leike) and there are now some other similar efforts at the organization.
I'm not as in the loop as I was a few months ago and so you might want to defer to folks at OpenAI, but from the outside I still tentatively feel ...
What are the most important ideas floating around in alignment research that don't yet have a public write-up? (Or, even better, that have a public write-up but could do with a good one?)
I have a big gap between "stuff I've written up" and "stuff that I'd like to write up." Some particular ideas that come to mind: how epistemic competitiveness seems really important for alignment; how I think about questions like "aligned with whom" and why I think it's good to try to decouple alignment techniques from decisions about values / preference aggregation (this position is surprisingly controversial); updated views on the basic dichotomy in Two Kinds of Generalization and the current best hopes for avoiding the bad kind.
I think that there's a cluster of really important questions about what we can verify, how "alien" the knowledge of ML systems will be, and how realistic it's going to be to take a kind of ad hoc approach to alignment. In my experience people with a more experimental bent to be more optimistic about those questions tend to have a bunch of intuitions about those questions that do kind of hang together (and are often approximately shared across people). This comes with some more color on the current alignment plan / what's likely to happen in practice as people try to solve the problem on their feet. I don't think that's really been written up well but it s...
I wonder how valuable you find some of the more math/theory focused research directions in AI safety. I.e., how much less impactful do you find them, compared to your favorite directions? In particular,
I'd also be interested in suggestions for other impactful research directions/areas that are more theoretical and less ML-focused (expanding on adamShimi's question, I wonder which part of mathematics and statistics you expect to be particularly useful).
I'm generally bad at communicating about this kind of thing, and it seems like a kind of sensitive topic to share half-baked thoughts on. In this AMA all of my thoughts are half-baked, and in some cases here I'm commenting on work that I'm not that familiar with. All that said I'm still going to answer but please read with a grain of salt and don't take it too seriously.
Vanessa Kosoy's learning-theoretic agenda, e.g., the recent sequence on infra-Bayesianism, or her work on traps in RL. Michael Cohen's research, e.g. the paper on imitation learning seems to go into a similar direction.
I like working on well-posed problems, and proving theorems about well-posed problems are particularly great.
I don't currently expect to be able to apply those kinds of algorithms directly to alignment for various reasons (e.g. no source of adequate reward function that doesn't go through epistemic competitiveness which would also solve other aspects of the problem, not practical to get exact imitation), so I'm mostly optimistic about learning something in the course of solving those problems that turns out to be helpful. I think that's plausible because these formal problems do engage some of the dif...
Pre-hindsight: 100 years from now, it is clear that your research has been net bad for the long-term future. What happened?
As an aside, I think that the possibility of "work doesn't matter" is typically way more important then "work was net bad," at least once you are making a serious effort to do something good rather than bad for the world (I agree that for the "average" project in the world the negative impacts are actually pretty large relative to the positive impacts).
EAs/rationalists often focus on the chance of a big downside clawing back value. I think that makes sense to think seriously about, and sometimes it's a big deal, but most of the time the quantitative estimates just don't seem to add up at all to me and I think people are making a huge quantitative error. I'm not sure exactly where we disagree, I think a lot of it is just that I'm way more skeptical about the ability to incidentally change the world a huge amount---I think that changing the world a lot usually just takes quite a bit of effort.
I guess in some sense I agree that the downside is big for normal butterfly-effect-y reasons (probably 50% of well-intentioned actions make the world worse ex post), so it's also possible that I'm just answering this question in a slightly different way.
My big caveat is that I think the numbers ...
"Even if actively trying to push the field forward full-time I'd be a small part of that effort"
I think conditioning on something like 'we're broadly correct about AI safety' implies 'we're right about some important things about how AI development will go that the rest of the ML community is surprisingly wrong about'. In that world we're maybe able to contribute as much as a much larger fraction of the field, due to being correct about some things that everyone else is wrong about.
I think your overall point still stands, but it does seem like you sometimes overestimate how obvious things are to the rest of the ML community
Some plausible and non-exhaustive options, in roughly descending order of plausibility:
You've written multiple outer alignment failure stories. However, you've also commented that these aren't your best predictions. If you condition on humanity going extinct because of AI, why did it happen?
I think my best guess is kind of like this story, but:
Unfortunately (fortunately?) I don't feel like I have access to any secret truths. Most idiosyncratic things I believe are pretty tentative, and I hang out with a lot of folks who are pretty open to the kinds of weird ideas that might have ended up feeling like Paul-specific secret truths if I hung with a more normal crowd.
It feels like my biggest disagreement with people around me is something like: to what extent is it likely to be possible to develop an algorithm that really looks on paper like it should just work for aligning powerful ML systems. I'm at like 50-50 and I think that the consensus estimate of people in my community is more like "Uh, sure doesn't sound like that's going to happen, but we're still excited for you to try."
Do you have any advice for junior alignment researchers? In particular, what do you think are the skills and traits that make someone an excellent alignment researcher? And what do you think someone can do early in a research career to be more likely to become an excellent alignment researcher?
Some things that seem good:
I personally feel like I got a lot of benefit out of doing some research in adjacent areas, but I'd guess that mostly it's better to focus on what you actually want to achieve and just be a ...
What are the highest priority things (by your lights) in Alignment that nobody is currently seriously working on?
It's not clear how to slice the space up into pieces so that you can talk about "is someone working on this piece?" (and the answer depends a lot on that slicing). Here are two areas in robustness that feel kind of empty for my preferred way of slicing up the problem (though for a different slicing they could be reasonably crowded). These are are also necessarily areas where I'm not doing any work so I'm really out on a limb here.
I think there should be more theoretical work on neural net verification / relaxing adversarial training. I should probably update from this to think that it's more of a dead end (and indeed practical verification work does seem to have run into a lot of trouble), but to me it looks like there's got to be more you can say at least to show that various possible approaches are dead ends. I think a big problem is that you really need to keep the application in mind in order to actually know the rules of the game. (That is, we have a predicate A, say implemented as a neural network, and we want to learn a function f such that for all x we have A(x, f(x)), but the problem is only supposed to be possible because in some sense the predicate A is "easy" to satisfy...
Do you know what sorts of people you're looking to hire? How much do you expect ARC to grow over the coming years, and what will the employees be doing? I can imagine it being a fairly small group of like 3 researchers and a few understudies, I can also imagine it growing to 30 people like MIRI. Which one of these is it closer to?
I'd like to hire a few people (maybe 2 researchers median?) in 2021. I think my default "things are going pretty well" story involves doubling something like every 1-2 years for a while. Where that caps out / slows down a lot depends on how the field shapes out and how broad our activities are. I would be surprised if I wanted to stop growing at <10 people just based on the stuff I really know I want to do.
The very first hires will probably be people who want to work on the kind of theory I do, since right now that's what I'm feeling most excited about and really want to set up a team working on. I don't really know where that will end up going.
Once getting that going I'm not sure whether the next step will be growing it further or branching out into other things, and it will probably depend on how the theory work goes. I could also imagine doing enough theory on my own to change my view about how promising it is and make initial hires in another area instead.
I'm not interested in the strongest argument from your perspective (i.e. the steelman), but I am interested how much you think you can pass the ITT for Eliezer's perspective on the alignment problem — what shape the problem is, why it's hard, and how to make progress. Can you give a sense of the parts of his ITT you think you've got?
I think I could do pretty well (it's plausible to me that I'm the favorite in any head-to-head match with someone who isn't a current MIRI employee? probably not but I'm at least close). There are definitely some places I still get surprised and don't expect to do that well, e.g. I was recently surprised by one of Eliezer's positions regarding the relative difficulty of some kinds of reasoning tasks for near-future language models (and I expect there are similar surprises in domains that are less close to near-term predictions). I don't really know how to split it into parts for the purpose of saying what I've got or not.
I don't have an easy way of slicing my work up / think that it depends on how you slice it. Broadly I think the two candidates are (i) making RL from human feedback more practical and getting people excited about it at OpenAI, (ii) the theoretical sequence from approval-directed agents and informed oversight to iterated amplification to getting a clear picture of the limits of iterated amplification and setting out on my current research project. Some steps of that were really hard for me at the time though basically all of them now feel obvious.
My favorite blog post was probably approval-directed agents, though this is very much based on judging by the standards of how-confused-Paul-started-out. I think that it set me on a way better direction for thinking about AI safety (and I think it also helped a lot of people in a similar way). Ultimately it's clear that I didn't really understand where the difficulties were, and I've learned a lot in the last 6 years, but I'm still proud of it.
How many ideas of the same size as "maybe we could use inverse reinforcement learning to learn human values" are we away from knowing how to knowably and reliably build human-level AI technology that wouldn't cause something comparably bad as human extinction?
A lot of this is going to come down to estimates of the denominator.
(I mostly just think that you might as well just ask people "Is this good?" rather than trying to use a more sophisticated form of IRL---in particular I don't think that realistic versions of IRL will successfully address the cases where people err in answering the "is it good?" question, that directly asking is more straightforward in many important ways, and that we should mostly just try to directly empower people to give better answers to such questions.)
Anyway, with that caveat and kind of using the version of your idea that I feel most enthusiastic about (and construing it quite broadly), I have a significant probability on 0, maybe a median somewhere in 10-20, significant probability at very high levels.
In this post I argued that an AI-induced point of no return would probably happen before world GDP starts to noticeably accelerate. You gave me some good pushback about the historical precedent I cited, but what is your overall view? If you can spare the time, what is your credence in each of the following PONR-before-GDP-acceleration scenarios, and why?
1. Fast takeoff
2. The sorts of skills needed to succeed in politics or war are easier to develop in AI than the sorts needed to accelerate the entire world economy, and/or have less deployment lag. (Maybe it takes years to build the relevant products and industries to accelerate the economy, but only months to wage a successful propaganda campaign to get people to stop listening to the AI safety community)
3. We get an "expensive AI takeoff" in which AI capabilities improve enough to cross some threshold of dangerousness, but this improvement happens in a very compute-intensive way that makes it uneconomical to automate a significant part of the economy until the threshold has been crossed.
4. Vulnerable world: Thanks to AI and other advances, a large number of human actors get the ability to make WMD's.
5. Persuasion/p...
I don't know if we ever cleared up ambiguity about the concept of PONR. It seems like it depends critically on who is returning, i.e. what is the counterfactual we are considering when asking if we "could" return. If we don't do any magical intervention, then it seems like the PONR could be well before AI since the conclusion was always inevitable. If we do a maximally magical intervention, of creating unprecedented political will, then I think it's most likely that we'd see 100%+ annual growth (even of say energy capture) before PONR. I don't think there are reasonable definitions of PONR where it's very likely to occur before significant economic acceleration.
I don't think I consider most of the scenarios list necessarily-PONR-before-GDP acceleration scenarios, though many of them could permit PONR-before-GDP if AI was broadly deployed before it started adding significant economic value.
All of these probabilities are obviously pretty unreliable and made up on the spot:
1. Fast takeoff
Defined as 1-year doubling starts before 4-year doubling finishes, maybe 25%?
...2. The sorts of skills needed to succeed in politics or war are easier to develop in AI than the sorts needed to accelerate
Not really.
I expect that many humans will continue to participate in a process of collectively clarifying what we want and how to govern the universe. I wouldn't be surprised if that involves a lot of life-kind-of-like-normal that gradually improves in a cautious way we endorse rather than some kind of table-flip (e.g. I would honestly not be surprised if post-singularity we still end up raising another generation because there's no other form of "delegation" that we feel more confident about). And of course in such a world I expect to just continue to spend a lot of time thinking, again probably under conditions that are designed to be gradually improving rather than abruptly changing. The main weird thing is that this process will now be almost completely decoupled from productive economic activity.
I think it's hard to talk about "your life" and identity is likely to be fuzzy over the long term. I don't think that most of the richness and value in the world will come from creatures who feel like "us" (and I think our selfish desires are mostly relatively satiable). That said, I do also expect that basically all of the existing humans will have a future that they feel excited abou...
What is your theory of change for the Alignment Research Center? That is, what are the concrete pathways by which you expect the work done there to systematically lead to a better future?
For the initial projects, the plan is to find algorithmic ideas (or ideally a whole algorithm) that works well in practice, can be adopted by labs today, and would put us in a way better position with respect to future alignment challenges. If we succeed in that project, then I'm reasonably optimistic about being able to demonstrate the value of our ideas and get them adopted in practice (by a combination of describing them publicly, talking with people at labs, advising people who are trying to pressure labs to take alignment seriously about what their asks should be, and consulting for labs to help implement ideas). Even if adoption or demonstrating desirability turns out to be hard, I think that the alignment community would be in a much better place if we had a proposal that we all felt good about that we were advocating for (since we'd then have a better shot at doing so, and labs that were serious about alignment would be able to figure out what to do).
Beyond that, I'm also excited about offering concrete and well-justified advice (either about what algorithms to use or about alignment-relevant deployment decisions) that can help labs who care about alignment, or can be taken as a clear indicator of best practices so be adopted by labs who want to present as socially-responsible (whether to please employees, funders, civil society, or competitors).
But I'm mostly thinking about the impact of initial activities, and for that I feel like the theory of change is relatively concrete/straightforward.
If you could magically move most of the US rationality and x-risk and EA community to a city in the US that isn't the Bay, and you had to pick somewhere, where where would you move them to?
If I'm allowed to think about it first then I'd do that. If I'm not, then I'd regret never having thought about it, probably Seattle would be my best guess.
What's the most important thing that AI alignment researchers have learned in the past 10 years? Also, that question but excluding things you came up with.
"Thing" is tricky. Maybe something like the set of intuitions and arguments we have around learned optimizers, i.e. the basic argument that ML will likely produce a system that is "trying" to do something, and that it can end up performing well on the training distribution regardless of what it is "trying" to do (and this is easier the more capable and knowledgeable it is). I don't think we really know much about what's going on here, but I do think it's an important failure to be aware of and at least folks are looking for it now. So I do think that if it happens we're likely to notice it earlier than we would if taking a purely experimentally-driven approach and it's possible that at the extreme you would just totally miss the phenomenon. (This may not be fair to put in the last 10 years, but thinking about it sure seemed like a mess >10 years ago.)
(I may be overlooking something such that I really regret that answer in 5 minutes but so it goes.)
According to your internal model of the problem of AI safety, what are the main axes of disagreement researchers have?
The three that first come to mind:
How many ideas of the same size as "maybe a piecewise linear non-linearity would work better than a sigmoid for not having vanishing gradients" are we away from knowing how to build human-level AI technology?
I think it's >50% chance that ideas like ReLUs or soft attention are best though of as multiplicative improvements on top of hardware progress (as are many other ideas like auxiliary objectives, objectives that better capture relevant tasks, infrastructure for training more efficiently, dense datasets, etc.), because the basic approach of "optimize for a task that requires cognitive competence" will eventually yield human-level competence. In that sense I think the answer is probably 0.
Maybe my median number of OOMs left before human-level intelligence, including both hardware and software progress, is 10 (pretty made-up). Of that I'd guess around half will come from hardware, so call it 5 OOMs of software progress. Don't know how big that is relative to ReLUs, maybe 5-10x? (But hard to define the counterfactual w.r.t. activation functions.)
(I think that may imply much shorter timelines than my normal view. That's mostly from thoughtlessness in this answer which was quickly composed and didn't take into account many sources of evidence, some is from legit correlations not taken into account here, some is maybe legitimate signal from an alternative estimation approach, not sure.)
I'm pretty comfortable working with strong axioms. But in terms of "would actually blow my mind if it turned out not to be consistent," I guess alpha-inaccessible cardinals for any concrete alpha? Beyond that I don't really know enough set theory to have my mind blown.
Favorite: Irit Dinur's PCP for constraint satisfaction. What a proof system.
If you want to be more pure, and consider the mathematical objects that are found rather than built, maybe the monster group? (As a layperson so I can't appreciate the full extent of what's going, on and like most people I only real know about it second-hand, but its existence seems like a crazy and beautiful fact about the world.)
Least favorite: I don't know, maybe Chaitin's constant?
What was your biggest update about the world from living through the coronavirus pandemic?
Follow-up: does it change any of your feelings about how civilization will handle AGI?
I found our COVID response pretty "par for the course" in terms of how well we handle novel challenges. That was a significant negative update for me because I had a moderate probability on us collectively pulling out some more exceptional adaptiveness/competence when an issue was imposing massive economic costs and had a bunch of people's attention on it. I now have somewhat more probability on AI dooms that play out slowly where everyone is watching and yelling loudly about it but it's just really tough to do something that really improves the situation (and correspondingly more total probability on doom). I haven't really sat down and processed this update or reflected on exactly how big it should be.
What are the best examples of progress in AI Safety research that we think have actually reduced x-risk?
(Instead of operationalizing this explicitly, I'll note that the motivation is to understand whether doing more work toward technical AI Safety research is directly beneficial as opposed to mostly irrelevant or having second-order effects. )
You seem in the unusual position of having done excellent conceptual alignment work (eg with IDA), and excellent applied alignment work at OpenAI, which I'd expect to be pretty different skillsets. How did you end up doing both? And how useful have you found ML experience for doing good conceptual work, and vice versa?
Aw thanks :) I mostly trained as a theorist through undergrad, then when I started grad school I spent some time learning about ML and decided to do applied work at OpenAI. I feel like the methodologies are quite different but the underlying skills aren't that different. Maybe the biggest deltas are that ML involves much more management of attention and jumping between things in order to be effective in practice, while theory is a bit more loaded on focusing on one line of reasoning for a long time and having some clever idea. But while those are important skills I don't think they are the main things that you improve at by working in either area and aren't really core.
I feel like in general there is a lot of transfer between doing well in different research areas, though unsurprisingly it's less than 100% and I think I would be better at either domain if I'd just focused on it more. The main exception is that I feel like I'm a lot better at grounding out theory that is about ML, since I've had more experience and have more of a sense for what kinds of assumptions are reasonable in practice. And on the flip side I do think theory is similar to a lot of algorithm design/analysis questions that come up in ML (frankly it doesn't seem like a central skill but I think there are big logistical benefits from being able to do the whole pipeline as one person).
How many hours per week should the average AI alignment researcher spend on improving their rationality? How should they spend those hours?
I probably wouldn't set aside hours for improving rationality (/ am not exactly sure what it would entail). Seems generally good to go out of your way to do things right, to reflect on lessons learned from the things you did, to be willing to do (and slightly overinvest in) things that are currently hard in order to get better, and so on. Maybe I'd say that like 5-10% of time should be explicitly set aside for activities that just don't really move you forward (like post-mortems or reflecting on how things are going in a way that's clearly not going to pay itself off for this project) and a further 10-20% on doing things in ways that aren't the very optimal way right now but useful for getting better at doing them in the future (e.g. using unfamiliar tools, getting more advice from people than would make sense if the world ended next week, being more methodical about how you approach problems).
I guess the other aspect of this is separating some kind of general improvement from more domain specific improvement (i.e. are the numbers above about improving rationality or just getting better at doing stuff?). I think stuff that feels vaguely like "rationality" in the sense of being abou...
What are the main ways you've become stronger and smarter over the past 5 years? This isn't a question about new object-level beliefs so much as ways-of-thinking or approaches to the world that have changed for you.
Did you get much from reading the sequences? What was one of the things you found most interesting or valuable personally it them?
I enjoyed Leave a Line of Retreat. It's a very concrete and simple procedure that I actually still use pretty often and I've benefited a lot just from knowing about. Other than that I think I found a bunch of the posts interesting and entertaining. (Looking back now the post is a bit bombastic, I suspect all the sequences are, but I don't really mind.)
Copying my question from your post about your new research center (because I'm really interested in the answer): which part (if any) of theoretical computer science do you expect to be particularly useful for alignment?
Going to start now. I vaguely hope to write something for all of the questions that have been asked so far but we'll see (80 questions is quite a few).
I think that by count across all the possible worlds (and the impossible ones) the vast majority of observers like us are in simulations. And probably by count in our universe the vast majority of observers like us are in simulations, except that everything is infinite and so counting observers is pretty meaningless (which just helps to see that it was never the thing you should care about).
I'm not sure "we're in a simulation" is the kind of thing it's meaningful to talk about credences in, but it's definitely coherent to talk about betting odds (i.e. how much would I be willing to have copies of me in a simulation sacrifice for copies of me outside of a simulation to benefit?). You don't want to talk about those using $ since $ are obviously radically more valuable outside of the simulation and that will dominate the calculation of betting odds. But we can measure in terms of experiences (how would I trade off welfare between the group inside and outside the simulation). I'd perhaps take a 2:1 rate, i.e. implying I think there's a 2/3 "chance" that we're in a simulation? But pretty unstable and complicated.
Are there any research questions you're excited about people working on, for making AI go (existentially) well, that are not related to technical AI alignment or safety? If so, what? (I'm especially interested in AI strategy/governance questions)
Not sure if you want "totally unrelated to technical AI safety" or just "not basically the same as technical AI safety." Going for somewhere in between.
Should marginal CHAI PhD graduates who are dispositionally indifferent between the two options try to become a professor or do research outside of universities?
Not sure. If you don't want to train students, seems toe me like you should be outside of a university. If you do want to train students it's less clear and maybe depends on what you want to do (and given that students vary in what they are looking for, this is probably locally self-correcting if too many people go one way or the other). I'd certainly lean away from university for the kinds of work that I want to do, or for the kinds of things that involve aligning large ML systems (which benefit from some connection to customers and resources).
And on an absolute level, is the world much more or less prepared for AGI than it was 15 years ago?
Follow-up: How much did the broader x-risk community change it at all?
What are your thoughts / advice on working as an individual vs joining an existing team / company when it comes to safety research? (For yourself and for others)
1. What credence would you assign to "+12 OOMs of compute would be enough for us to achieve AGI / TAI / AI-induced Point of No Return within five years or so." (This is basically the same, though not identical, with this poll question)
2. Can you say a bit about where your number comes from? E.g. maybe 25% chance of scaling laws not continuing such that OmegaStar, Amp(GPT-7), etc. don't work, 25% chance that they happen but don't count as AGI / TAI / AI-PONR, for total of about 60%? The more you say the better, this is my biggest crux! Thanks!
I'd say 70% for TAI in 5 years if you gave +12 OOM.
I think the single biggest uncertainty is about whether we will be able to adapt sufficiently quickly to the new larger compute budgets (i.e. how much do we need to change algorithms to scale reasonably? it's a very unusual situation and it's hard to scale up fast and depends on exactly how far that goes). Maybe I think that there's an 90% chance that TAI is in some sense possible (maybe: if you'd gotten to that much compute while remaining as well-adapted as we are now to our current levels of compute) and conditioned on that an 80% chance that we'll actually do it vs running into problems?
(Didn't think about it too much, don't hold me to it too much. Also I'm not exactly sure what your counterfactual is and didn't read the original post in detail, I was just assuming that all existing and future hardware got 12OOM faster. If I gave numbers somewhere else that imply much less than that probability with +12OOM, then you should be skeptical of both.)
Natural language has both noise (that you can never model) and signal (that you could model if you were just smart enough). GPT-3 is in the regime where it's mostly signal (as evidenced by the fact that the loss keeps going down smoothly rather than approaching an asymptote). But it will soon get to the regime where there is a lot of noise, and by the time the model is 9 OOMs bigger I would guess (based on theory) that it will be overwhelmingly noise and training will be very expensive.
So it may or may not work in the sense of meeting some absolute performance threshold, but it will certainly be a very bad way to get there and we'll do something smarter instead.
What research in the past 5 years has felt like the most significant progress on the alignment problem? Has any of it made you more or less optimistic about how easy the alignment problem will be?
Why did nobody in the world run challenge trials for the covid vaccine and save us a year of economic damage?
Wild speculation, not an expert. I'd love to hear from anyone who actually knows what's going on.
I think it's overoptimistic that human challenge trials would save a year, though it does seem like they could have plausibly have saved weeks or months if done in the most effective form. (And in combination with other human trials and moderate additional spending I'd definitely believe 6-12 months of acceleration was possible.)
In terms of why so few human experiments have happened in general, I think it's largely because of strong norms designed to protect experiment participants (and taken quite seriously by doctors I've talked to), together with limited upside for the experimenters, an overriding desire for vaccine manufacturers to avoid association with a trial that ends up looking bad (this doesn't apply to other kinds of trial but the upside is often lower and there's no real stakeholder), a lack of understanding for a long time of how big a problem this would be, the difficulty of quickly shifting time/attention from other problems to this one, and the general difficulty of running experiments.
What do you do to keep up with AI Safety / ML / theoretical CS research, to the extent that you do? And how much time do you spend on this? For example, do you browse arXiv, Twitter, ...?
A broader question I'd also be interested in (if you're willing to share) is how you allocate your working hours in general.
Are there any good examples of useful or interesting sub-problems in AI Alignment that can actually be considered "solved"?
Given growth in both AI research and alignment research over the past 5 years, how do the rates of progress compare? Maybe separating absolute change, first and second derivatives.
Should more AI alignment research be communicated in book form? Relatedly, what medium of research communication is most under-utilized by the AI alignment community?
I think it would be good to get more arguments and ideas pinned down, explained carefully, collected in one place. I think books may be a reasonable format for that, though man they take a long time to write.
I don't know what medium is most under-utilized.
What mechanisms could effective altruists adopt to improve the way AI alignment research is funded?
Long run I'd prefer with something like altruistic equity / certificates of impact. But frankly I don't think we have hard enough funding coordination problems that it's going to be worth figuring that kind of thing out.
(And like every other community we are free-riders---I think that most of the value of experimenting with such systems would accrue to other people who can copy you if successful, and we are just too focused on helping with AI alignment to contribute to that kind of altruistic public good. If only someone would be willing to purchase the impact certificate from us if it worked out...)
What is the main mistake you've made in your research, that you were wrong about?
Positive framing: what's been the biggest learning moment in the course of your work?
Basically every time I've shied away from a solution because it feels like cheating, or like it doesn't count / address the real spirit of the problem, I've regretted it. Often it turns out it really doesn't count, but knowing exactly why (and working on the problem with no holds barred) had been really important for me.
The most important case was dismissing imitation learning back in 2012-2014, together with basically giving up outright on all ML approaches, which I only recognized as a problem when I was writing up why those approaches were doomed more carefully and why imitation learning was a non-solution.
Any thoughts on the Neural Tangent Kernel/Gaussian Process line of research? Or attempts to understand neural network training at a theoretical level more generally?
Overall I haven't thought about it that much but it seems interesting. (I thought your NTK summary was good.)
With respect to alignment, the main lesson I've taken away is to be careful about intuitions that come from "building up structure slowly," you should at least check that all of your methods work fine in the local linear regime where in some sense everything is in there at the start and you are just perturbing weights a tiny bit. I think this has been useful for perspective. In some sense it's something you think about automatically when focusing on the worst case, but it's still nice to know which parts of the worst case are actually real and I think I used to overlook some of these issues more.
In practice it seems like the number of datapoints is large relative to the width, and in fact it's quite valuable to take multiple gradient descent steps even if your initialization is quite careful. So it doesn't seem like you can actually make the NTK simplification, i.e. you still have to deal with the additional challenges posed by long optimization paths. I'd want to think about this much more if there was a proposal that appeared to apply for the NTK but not for general neural...
What's your take on "AI Ethics", as it appears in large tech companies such as Google or Facebook? Is it helping or hurting the general AI safety movement?
I think "AI ethics" is pretty broad and have different feelings of different parts. I'm generally supportive of work that makes AI better for humanity or non-human animals, even when it's not focused on the long-term. Sometimes I'm afraid about work in AI ethics that doesn't seem pass any reasonable cost-benefit analysis, and that it will annoy people in AI and make it harder to get traction with pro-social policies that are better-motivated (I'm also sometimes concerned about this for work in AI safety). I don't have a strong view about the net effect of work in AI ethics on AI safety, but it generally seems good for the two communities to try to get along (at least as well as either of them gets along with AI more broadly, rather than viewing each other as competitors for some limited amount of socially-responsible oxygen).
Curated. I don't think we've curated an AMA before, and not sure if I have a principled opinion on doing that, but this post seems chock full of small useful incites, and fragments of ideas that seem like they might otherwise take awhile to get written up more comprehensively, which I think is good.
If you believe AGI will be created. What would be the median year you think it will be created at?
Ex. -2046, 2074, etc.
2065
That's an estimate for TAI (i.e. world doubling every 4 years), not sure what "AGI" means exactly.
Broad distribution in both directions, reasonably good chance by 2040 (maybe 25%)?
Don't hold me to that. I think it's literally not the same as the last time someone asked in this AMA, inconsistencies preserved to give a sense for stability.
Which rationalist virtue do you identify with the strongest currently? Which one would you like to get stronger at?
There has been surprisingly little written on concrete threat models for how AI leads to existential catastrophes (though you've done some great work rectifying this!). Why is this? And what are the most compelling threat models that don't have good public write-ups? In particular, are there under-appreciated threat models that would lead to very different research priorities within Alignment?
What sort of epistemic infrastructure do you think is importantly missing for the alignment research community?
I mostly found myself more agreeing with Robin, in that e.g. I believe previous technical change is mostly a good reference class, that Eliezer's AI-specific arguments are mostly kind of weak. (I liked the image, I think from that debate, of a blacksmith emerging into the townsquare with his mighty industry and making all bow before them.)
That said, I think Robin's quantitative estimates/forecasts are pretty off and usually not very justified, and I think he puts too much stock on an outside view extrapolation from past transitions rather than looking at the inside view for existing technologies (the extrapolation seems helpful in the absence of anything else, but it's just not that much evidence given the shortness and noisiness of the time series and the shakiness of the underlying regularities). I don't remember exactly what kinds of estimates he gives in that debate.
(This is more obvious for his timeline estimates, which I think have an almost comically flimsy justification given how seriously he takes them.)
Overall I think that it would be more interesting to have a Carl vs Robin FOOM debate; I expect the outcome would be Robin saying "do you really call that a FOOM?" and Carl saying "well it is pretty fast and would have crazy disruptive geopolitical consequences and generally doesn't fit that well with your implied forecasts about the world even if not contradicting that many of the things you actually commit to" and we could all kind of agree and leave it at that modulo a smaller amount of quantitative uncertainty.
In the interview with AI Impacts, you said:
...examples of things that I’m optimistic about that they [people at MIRI] are super pessimistic about are like, stuff that looks more like verification...
Are you still optimistic? What do you consider the most promising recent work?
I don't think my view has changed too much (I don't work in the area so don't pay as much attention or think about it as often as I might like).
The main updates have been:
What's the optimal ratio of researchers to support staff in an AI alignment research organization?
What works of fiction / literature have had the strongest impact on you? Or perhaps, that are responsible for the biggest difference in your vector relative to everyone else's vector?
(e.g. lots of people were substantially impacted by the Lord of the Rings, but perhaps something else had a big impact on you that led you in a different direction from all those people)
(that said, LotR is a fine answer)
You gave a great talk on the AI Alignment Landscape 2 years ago. What would you change if giving the same talk today?
Do you think progress has been made on the question of "which AIs are good successors?" Is this still your best guess for the highest impact question in moral philosophy right now? Which other moral philosophy questions, if any, would you put in the bucket of questions that are of comparable importance?
Philosophical Zombies: inconceivable, conceivable but not metaphysically possible, or metaphysically possible?
Other than by doing your own research, from where or whom do you tend to get valuable research insights?
What would you advise a college student to do if the student is unusually good at math and wants to contribute to creating an aligned AGI? Beyond a computer science major/multivariable calculus/linear algebra/statistics what courses should this student take?
How will we know when it's not worth getting more people to work on reducing existential risk from AI?
You've appeared on the 80,000 Hours podcast two times. To the extent that you remember what you said in 2018-19, are there any views you communicated then which you no longer hold now? Another way of asking this question is—do you still consider those episodes to be accurate reflections of your views?
What kind of relationships to 'utility functions' do you think are most plausible in the first transformative AI?
How does the answer change conditioned on 'we did it, all alignment desiderata got sufficiently resolved' (whatever that means) and on 'we failed, this is the point of no return'?
I'm curious about the extent to which you expect the future to be awesome-by-default as long as we avoid all clear catastrophes along the way; vs to what extent you think we just has a decent chance of getting a non-negligible fraction of all potential value (and working to avoid catastrophes is one of the most tractable ways of improving the expected value).
Proposed tentative operationalisation:
I would guess GCRs are generally less impactful than pressures that lead our collective preferences to evolve in a way that we wouldn't like on reflection. Such failures are unrecoverable catastrophes in the sense that we have no desire to recover, but in a pluralistic society they would not necessarily or even typically be global. You could view alignment failures as an example of values drifting, given that the main thing at stake are our preferences about the universe's future rather than the destruction of earth-originating intelligent life.
I expect this is the kind of thing I would be working on if I thought that alignment risk was less severe. My best guess about what to do is probably just futurism---understanding what is likely to happen and giving us more time to think about that seems great. Maybe eventually that leads to a different priority.
I'd be interested in your thoughts on human motivation in HCH and amplification schemes.
Do you see motivational issues as insignificant / a manageable obstacle / a hard part of the problem...?
Specifically, it concerns me that every H will have preferences valued more highly than [completing whatever task we assign], so would be expected to optimise its output for its own values rather than the assigned task, where these objectives diverged. In general, output needn't relate to the question/task.
[I don't think you've addressed this at all recently - I've on...
What do you think of a successor AI that collects data on one’s wellbeing (‘height’ via visual analog scale and ‘depth’ by assessing one’s understanding of the rationale for their situation), impact (thinking and actions toward others), and connections (to verify impact based on network analysis and wellbeing data and to predict populations’ welfare), motivates decreases of suffering groups’ future generations, rewards individuals with impact that is increasing or above a certain level, and withdraws benefits/decreases wellbeing of individuals whose impact is decreasing and below a certain level?
I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1).
If you want to ask something just post a top-level comment; I'll spend at least a day answering questions.
You can find some background about me here.