All of AdamGleave's Comments + Replies

AdamGleaveΩ11163

Whether a PhD is something someone will enjoy is so dependent on individual personality, advisor fit, etc that I don't feel I can offer good generalized advice. Generally I'd suggest people trying to gauge fit try doing some research in an academic environment (e.g. undergrad/MS thesis, or a brief RA stint after graduating) and talk to PhD students in their target schools. If after that you think you wouldn't enjoy a PhD then you're probably right!

Personally I enjoyed my PhD. I had smart & interesting colleagues, an advisor who wanted me to do high-qua... (read more)

AdamGleaveΩ13219

I'm sympathetic to a lot of this critique. I agree that prospective students should strive to find an advisor that is "good at producing clear, honest and high-quality research while acting in high-integrity ways around their colleagues". There are enough of these you should be able to find one, and it doesn't seem worth compromising.

Concretely, I'd definitely recommend digging into into an advisor's research and asking their students hard questions prior to taking any particular PhD offer. Their absolutely are labs that prioritize publishing above all els... (read more)

AdamGleaveΩ66-2

Thanks for the post Ryan -- I agree that given the difficulty in making models actually meaningfully robust the best solution to misuse in the near-term is going to be via a defence in depth approach consisting of filtering the pre-training data, input filtering, output filtering, automatic and manual monitoring, KYC checks, etc.

At some point though we'll need to grapple with what to do about models that are superhuman in some domains related to WMD development, cybercrime or other potential misuses. There's glimmers of this already here, e.g. my impressi... (read more)

4ryan_greenblatt
This seems right to me, but it's worth noting that this point might occur after the world is already radically transformed by AI (e.g. all human labor is obsolete). So, it might not be a problem that humans really need to deal with. The main case I can imagine where this happens prior to the world being radically transformed is the case where automatic drug/virus/protein outputting AIs (like you mentioned) can do a massive amount of the work end-to-end. I'd hope that for this case, the application is sufficiently narrow that there are additional precautions we can use, e.g. just have a human screen every request to the model. But this seems pretty scary overall.

It can definitely be worth spending money when there's a clear case for it improving employee productivity. I will note there are a range of both norms and physical layouts compatible with open-plan, ranging from "everyone screaming at each other and in line of sight" trading floor to "no talking library vibes, desks facing walls with blinders". We've tried to make different open plan spaces zoned with different norms and this has been fairly successful, although I'm sure some people will still be disturbed by even library-style areas and be more productive in a private office.

2epistemic meristem
I'd thought it was a law of nature that quiet norms for open plans don't actually work; it sounds like you've found a way to have your cake and eat it too!
AdamGleaveΩ14316

When I started working on this project, a number of people came to me and told me (with varying degrees of tact) that I was wasting my time on a fool's errand. Around half the people told me they thought it was extremely unlikely I'd find such a vulnerability. Around the other half told me such vulnerabilities obviously existed, and there was no point demonstrating it. Both sets of people were usually very confident in their views. In retrospect I wish I'd done a survey (even an informal one) before conducting this research to get a better sense of people'... (read more)

AdamGleaveΩ220

Thanks for flagging this disagreement Ryan. I enjoyed our earlier conversation (on LessWrong and in-person) and updated in favor of the sample efficiency framing, although we (clearly) still have some significant differences in perspective here. Would love to catch up again sometime and see if we can converge more on this. I'll try and summarize my current take and our key disagreements for the benefit of other readers.

I think I mostly agree with you that in the special case of vanilla RLHF this problem is equivalent to a sample efficiency problem. Specifi... (read more)

AdamGleaveΩ242

Thanks, that's a good link. In our case our assets significantly exceed the FDIC $250k insurance limit and there are operational costs to splitting assets across a large number of banks. But a high-interest checking account could be a good option for many small orgs.

3Chipmonk
This isn't strictly true: Some "banks" partner with many (real) banks to get greater FDIC coverage. E.g.: Wealthfront has up to $5M FDIC (so I imagine they partner with up to 20 accredited banks).
AdamGleaveΩ151

Does this circle exploit have any connection to convolutions? That was my first thought when I saw the original writeups, but nothing here seems to help explain where the exploit is coming from. All of the listed agents vulnerable to it, AFAIK, make use of convolutions. The description you give of Wu's anti-circle training sounds a lot like you would expect from an architectural problem like convolution blindness: training can solve the specific exploit but then goes around in cycles or circles (ahem), simply moving the vulnerability around, like squeezin

... (read more)
8gwern
I wouldn't really expect larger convolutions to fix it, aside from perhaps making the necessary 'circles' larger and/or harder to find or create longer cycles in the finetuning as there's more room to squish the attack around the balloon. It could be related to problems like the other parameters of the kernel like stride or padding. (For example, I recall the nasty 'checkboard' artifacts in generative upscaling were due to the convolution stride/padding, and don't seem to ever come up in Transformer/MLP-based generative models but also simply making the CNN kernels larger didn't fix it, IIRC - you had to fix the stride/padding settings.) I personally would find it interesting but I don't know how important it is. It seems likely that you might find a completely different-looking adversarial attack, but would that be conclusive? There would be so many things that change between a CNN KataGo and a from-scratch ViT KataGo. Especially if you are right that Timbers et al find a completely different adversarial attack in their AlphaZero which AFAIK still uses CNNs. Maybe you could find many different attacks if you change up enough hyperparameters or initializations. On the gripping hand, now that I look at this earlier version, their description of it as a weird glitch in AZ's evaluation of pass moves at the end of the game sounds an awful lot like your first Tromp-Taylor pass exploit ie. it could probably be easily fixed with some finetuning. And in that case, perhaps Timbers et al would have found the 'circle' exploit in AZ after all if they had gotten past the first easy end-game pass-related exploit? (This also suggests a weakness in the search procedures: it really ought to produce more than one exploit, preferably a whole list of distinct exploits. Some sort of PBT or novelty search approach perhaps...) Maybe a mechanistic interpretability approach would be better: if you could figure out where in KataGo it screws up the value estimate so badly, and what edits a
AdamGleaveΩ21499

This matches my impression. FAR could definitely use more funding. Although I'd still at the margin rather hire someone above our bar than e.g. have them earn-to-give and donate to us, the math is getting a lot closer than it used to be, to the point where those with excellent earning potential and limited fit for AI safety might well have more impact pursuing a philanthropic pathway.

I'd also highlight there's a serious lack of diversity in funding. As others in the thread have mentioned, the majority of people's funding comes (directly or indirectly) from... (read more)

1epistemic meristem
Re open plan offices: many people find them distracting. I doubt they're a worthwhile cost-saving measure for research-focused orgs; better to have fewer researchers in an environment conducive to deep focus. I could maybe see a business case for them in large orgs where it might be worth sacrificing individual contributors' focus in exchange for more legibility to management, or where management doesn't trust workers to stay on task when no one is hovering over their shoulder, but I hope no alignment org is like that. For many people open plan offices are just great, of course, and I think it can be hard for them to grok how distracting they can be for people on the autism spectrum, to pick a not-so-random example. :) But I like the idea of looking for ways to increase efficiency!
9Steven Byrnes
It’s easier than that—there are high-interest-rate free FDIC-eligible checking accounts. MaxMyInterest.com has a good list, although you might need to be a member to view it. As of this moment (2023-07-20), the top of their leaderboard is: Customers Bank (5.20% APY), BankProv (5.15%), BrioDirect (5.06%), UFB Direct (5.06%).
AdamGleaveΩ330

I still don't understand which of (1), (2), or (3) your most worried about.

Sample efficiency isn't the main way I think about this topic so it's a bit difficult to answer. I find all these defeaters fairly plausible, but if I had to pick the central concern it'd be (3).

I tend to view ML training as a model taking a path through a space of possible programs. There's some programs that are capable and aligned with our interests; others that are capable but will actively pursue harmful goals; and of course many other programs that just don't do anything parti... (read more)

AdamGleaveΩ111

Oh, we're using terminology quite differently then. I would not call (a) reward hacking, as I view the model as being the reward (to the RL process), whereas humans are not providing reward at all (but rather some data that gets fed into a reward model's learning process). I don't especially care about what definitions we use here, but do wonder if this means we're speaking past each other in other areas as well.

AdamGleaveΩ110

Ah, that paper makes a lot more sense. A reward model was attractive in the original Deep RL From Human Preferences paper because the environment was complex and non-differentiable: using RL was a natural fit. It's always seemed a bit stranger to use RL for fine-tuning language models, especially in the prompt-completion setting where the "environment" is trivial. (RL becomes more natural when you start introducing external tools, or conversations with humans.)

I'll need to take a closer look at the paper, but it looks like they derive the DPO objective by ... (read more)

AdamGleaveΩ110

Thanks for the follow-up, this helps me understand your view!

At any given point, the reward model will be vulnerable to arbitrary adversarial attacks under sufficient optimization pressure, but we don't need arbitrary optimization against any given reward model. Like, each human update lets you optimize a bit more against the reward model which gets you the ability to get somewhat closer to the policy you actually want.

Sure, this feels basically right to me. My reframing of this would be that we could do in principle do RL directly with feedback provided b... (read more)

1Joseph Miller
It's unclear to me if this is true because modelling the humans that generated the training data sufficiently well probably requires the model to be smarter than the humans it is modelling. So I expect the current regime where RLHF just elicits a particular persona from the model rather than teaching any new abilities to be sufficient to reach superhuman capabilities.
1ryan_greenblatt
(Here's a possibly arcane remark. It's worth noting that I think always correct reward models are sufficient for high stakes alignment via runtime filtering (technically you just need to never give a very bad output decent reward). So, always correct reward models would be great if you could get them. Note that always correct could look pretty different than current advsarial robustness; in particular, you also need to worry about collusion and stuff like seizing physical control of the reward channel. I also think that avoiding high stakes failure via adversarial training and/or other robustness style tech seems generally good. I've been assuming we're talking about the low stakes alignment problem (aka outer alignment))
1ryan_greenblatt
I think this trend extrapolation argument seems fine in the absence of a specific defeater. In the case of AI takeover, there is a clear defeater to 'models have been getting more aligned over time'. I was trying to get at which specific defeater you thought overcame the trend expolation argument. Here's an attempt at an exhaustive list of defeaters: 1. Thinking sample efficiency would get worse in the future breaking the current trend 2. Thinking that the current 'trend extrapolated' sample efficiency would be insufficent or otherwise good to improve on the margin 3. Thinking that there would be important negative consequences of reward model non-robustness which aren't well described by sample efficiency (e.g., teaching the model to lie via first practicing on the reward model, or having an easier time exploring into bad behavior or something) It's also worth noting that if I was centrally interested in (2) I would push on that directly. (But you have other applications of robustness in mind, so this might not be that interesting.) I still don't understand which of (1), (2), or (3) your most worried about. (Maybe (3) based on some argument I don't yet understand? I also don't see why hacking the reward model is dangerous which is maybe an important crux here.) Aside on alignment trend extrapolation: I'm also not really sure how to measure 'aligned' in a way that makes sense given that models have been also getting smarter. It seems plausible that alignment has been notably improving over time? Beyond this, the more natural trend extrapolation might be takeover risk. My guess is that the ex-ante takeover risk from GPT4 should have been like 0.1% and then the future/ongoing risk is more like 0.001% or something. And, the trend extrapolation doesn't look good here : ).
1ryan_greenblatt
Are you assuming that we can't collect human data online as the policy optimizes against the reward model? (People currently do collect data online to avoid getting hacked like this.) This case seems probably hopeless to me without very strong regularization (I think you agree with this being mostly hopeless), but it also seems easy to avoid by just collecting human data online.
1ryan_greenblatt
I don't really see why (b) leads to dangerous failures. It seems like failures should be totally benign and just result in somewhat lower production? Beyond this, it seems like this failure should happen early as it doesn't require clever models to occur, so by default there will be strong commercial incentives to resolve this. I agree it's an alignment failure in some sense which could be addressed by alignment technology. I just think it isn't very important to reduce from an X-risk/AI takeover perspective.
1ryan_greenblatt
Minor point, feel free to ignore. FWIW, I typically use 'reward hacking' to refer to just (a) here. I'd just call (b) 'poor reward model sample efficiency'. That said, I more centrally use 'reward hacking' to describe hacking a reward process based on outcomes via stuff like 'sensor tampering', but this is still a subset of RLHF: the subset where humans look at outcomes and then assess reward taking this into account.
1ryan_greenblatt
I originally linked to the wrong paper! : ( Here is the actual Direct Preference Optimization paper. (I guess I just googled something like 'DPO RL' and then didn't actually check that it was the right paper) Yikes, sorry for wasting your time.
AdamGleaveΩ440

To check my understanding, is your view something like:

  1. If the reward model isn't adversarially robust, then the RL component of RLHF will exploit it.

  2. These generations will show up in the data presented to the human. Provided the human is adversarially robust, then the human feedback will provide corrective signal to the reward model.

  3. The reward model will stop being vulnerable to those adversarial examples, although may still be vulnerable to other adversarial examples.

  4. If we repeat this iterative process enough times,... (read more)

1ryan_greenblatt
This is pretty close to my understanding, with one important objection. Thanks for responding and trying to engage with my perspective. Objection I don't claim we'll necessarily ever get a fully robust reward model, just that the reward model will be mostly robust on average to the actual policy you use as long as human feedback is provided at a frequent enough interval. We never needed a good robust reward model which works on every input, we just needed a reward model which captured our local preferences about the actual current policy distribution sufficiently well. At any given point, the reward model will be vulnerable to arbitrary adversarial attacks under sufficient optimization pressure, but we don't need arbitrary optimization against any given reward model. Like, each human update lets you optimize a bit more against the reward model which gets you the ability to get somewhat closer to the policy you actually want. KL penalty is presumably an important part of the picture. My overall claim is just that normal RLHF is pretty fault tolerant and degrades pretty gracefully with lack of robustness. Sample efficiency seems fine now and progress continues Beyond this, my view is considerably informed by 'sample efficiency seems fine in practice now and is better better not worse with more powerful models (from my limited understanding)'. It's possible this trend could reverse with sufficiently big models, but I'd find this surprising and don't see any particular reason to expect this. Technical advancement in RLHF is ongoing and will improve sample efficiency further. It seems to me like your views either imply that sample efficiency is low enough now that high quality RLHF currently can't compete with other ways of training AIs which are less safe but have cheaper reward signals (e.g., heavy training on automated outcomes based feedback or low quality human reward signal). Or possibly that this will happen in the future as models get more powerful (re

Yes, thanks for spotting my typo! ($2.75 psf isn't crazy for Berkeley after negotiation, but is not something I've ever seen as a list price.)

To compare this to other costs, renting two floors of the WeWork, which we did for most of the summer last year, cost around $1.2M/yr for 14,000 sq. ft. of office space. The Rose Garden has 20,000 sq. ft. of floor space and 20,000 additional sq. ft. of usable outdoor space for less implied annual cost than that.

I'm sympathetic to the high-level claim that owning property usually beats renting if you're committing for a long time period. But the comparison with WeWork seems odd: WeWork specializes in providing short-term, serviced office space and does so a... (read more)

5Linch
Very small nitpick, but did you mean $3.25-$3.75? (This was the smallest diff I could think of to make your calculation clear). 
2habryka
Yep, if you commit for longer time periods you can definitely get better deals, and there are definitely other ways to save on office space costs. I didn't mean to imply this was the minimum you could rent office space for. The $1.2M/yr estimate was roughly what we were paying at the time, and as such was the central comparison point we had. Comparing it to something more like $800k-$900k a year also seems reasonable to me, though I have less experience with the exact tradeoffs faced by doing that. One reason for that comparison is that the price estimate I did in the comment above included utilities and servicing the space, and I don't have a ton of experience with how much cost that adds to an unserviced office lease, though I still expect it to be a bunch lower than the WeWork prices.
AdamGleaveΩ110

This is a good point, adversarial examples in what I called in the post the "main" ML system can be desirable even though we typically don't want them in the "helper" ML systems used to align the main system.

One downside to adversarial vulnerability of the main ML system is that it could be exploited by bad actors (whether human or other, misaligned AIs). But this might be fine in some settings: e.g. you build some airgapped system that helps you build the next, more robust and aligned AI. One could also imagine crafting adversarial example backdoors that ... (read more)

AdamGleaveΩ110

Right: if the agent has learned an inner objective of "do things similar to what humans do in the world at the moment I am currently acting", then it'd definitely be incentivized to do that. It's not rewarded by the outer objective for e.g. behavioral cloning on a fixed dataset, as installing bunch of cameras would be punished by that loss (not something humans do) and changing human behavior wouldn't help as BC would still be on the dataset of pre-manipulation demos. That might be little comfort if you're worried about inner optimization, but most the oth... (read more)

AdamGleaveΩ110

Thanks, I'd missed that!

Curious if you have any high-level takeaways from that? Bigger models do better, clearly, but e.g. how low do you think we'll be able to get the error rate in the next 5-10 years given expected compute growth? Are there any follow-up experiments you'd like to see happen in this space?

Also could you clarify whether the setting was for adversarial training or just a vanilla model? "During training, adversarial examples for training are constructed by PGD attacker of 30 iterations" makes me think it's adversarial training but I could imagine this just being used for evals.

1Ethan Caballero
The setting was adversarial training and adversarial evaluation. During training, PGD attacker of 30 iterations is used to construct adversarial examples used for training. During testing, the evaluation test set is an adversarial test set that is constructed via PGD attacker of 20 iterations. Experimental data of y-axis is obtained from Table 7 of https://arxiv.org/abs/1906.03787; experimental data of x-axis is obtained from Figure 7 of https://arxiv.org/abs/1906.03787.
AdamGleaveΩ8101

Rachel did the bulk of the work on this post (well-done!), I just provided some advise on the project and feedback on earlier manuscripts.

I wanted to share why I'm personally excited by this work in case it helps contextualize it for others.

We'd all like AI systems to be "corrigible", cooperating with us in correcting them. Cooperative IRL has been proposed as a solution to this. Indeed Dylan Hadfield-Menell et al show that CIRL is provably corrigible in a simple setting, the off-switch game.

Provably corrigible sounds great, but where there's a proof there... (read more)

1Closed Limelike Curves
The standard approaches to dealing with this are nonparametric models, safe Bayes, and including many different models in your space of all possible models.
4Rohin Shah
(Indeed, this was because I didn't see shutdown corrigibility as a difference between assistance games and reward learning -- optimal policies for both would tend to avoid shutdown.)

I'm excited by many of the interventions you describe but largely for reasons other than buying time. I'd expect buying time to be quite hard, in so far as it requires coordinating to prevent many actors from stopping doing something they're incentivized to do. Whereas since alignment research community is small, doubling it is relatively easy. However, it's ultimately a point in favor of the interventions that they look promising under multiple worldviews, but it might lead me to prioritize within them differently to you.

One area I would push back on is t... (read more)

AdamGleaveΩ220

I agree that in a fast takeoff scenario there's little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I'm personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient.

In terms of humans "owning" the economy but still having trouble getting what they want, it's not obvious this is a worse outcome than the... (read more)

1Johannes Treutlein
I think such a natural progression could also lead to something similar to extinction (in addition to permanently curtailing humanity's potential). E.g., maybe we are currently in a regime where optimizing proxies harder still leads to improvements to the true objective, but this could change once we optimize those proxies even more. The natural progression could follow an inverted U-shape. E.g., take the marketing example. Maybe we will get superhuman persuasion AIs, but also AIs that protect us from persuasive ads and AIs that can provide honest reviews. It seems unclear whether these things would tend to balance out, or whether e.g. everyone will inevitably be exposed to some persuasion that causes irreparable damage. Of course, things could also work out better than expected, if our ability to keep AIs in check scales better than dangerous capabilities.
AdamGleaveΩ5110

Thanks for this response, I'm glad to see more public debate on this!

The part of Katja's part C that I found most compelling was the argument that for a given AI system its best interests might be to work within the system rather than aiming to seize power. Your response argues that even if this holds true for AI systems that are only slightly superhuman, eventually we will cross a threshold where a single AI system can takeover. This seems true if we hold the world fixed -- there is some sufficiently capable AI system that can take over the 2022 world. Bu... (read more)

3Erik Jenner
Interesting points, I agree that our response to part C doesn't address this well. AI's colluding with each other is one mechanism for how things could go badly (and I do think that such collusion becomes pretty likely at some point, though not sure it's the most important crux). But I think there are other possible reasons to worry as well. One of them is a fast takeoff scenario: with fast takeoff, the "AIs take part in human societal structures indefinitely" hope seems very unlikely to me, so 1 - p(fast takeoff) puts an upper bound on how much optimism we can derive from that. It's harder to make an airtight x-risk argument using fast takeoff, since I don't think we have airtight arguments for p(fast takeoff) being close to 1, but still important to consider if we're figuring out our overall best guess, rather than trying to find a reasonably compact argument for AI x-risk. (To put this differently: the strongest argument for AI x-risk will of course consider all the ways in which things could go wrong, rather than just one class of ways that happens to be easiest to argue for). A more robust worry (and what I'd probably rely on for a compact argument) is something like What Failure Looks Like Part 1: maybe AIs work within the system, in the sense that they don't take over the world in obvious, visible ways. They usually don't break laws in ways we'd notice, they don't kill humans, etc. On paper, humans "own" the entire economy, but in practice, they have an increasingly hard time achieving outcomes they want (though they might not notice that, at least for a while).This seems like a mechanism for AIs to collectively "take over the world" (in the sense that humans don't actually have control of the long-run trajectory of the universe anymore), even if no individual AI can break out of the system, and if AIs aren't great at collaborating against humanity. Addressing a few specific points: True to some extent, but I'd expect AI progress to be much faster than hu
AdamGleaveΩ450

Thanks for the quick reply! I definitely don't feel confident in the 20W number, I could believe 13W is true for more energy efficient (small) humans, in which case I agree your claim ends up being true some of the time (but as you say, there's little wiggle room). Changing it to 1000x seems like a good solution though which gives you plenty of margin for error.

AdamGleaveΩ8122

This is a nitpick, but I don't think this claim is quite right (emphasis added)

 If a silicon-chip AGI server were literally 10,000× the volume, 10,000× the mass, and 10,000× the power consumption of a human brain, with comparable performance, I don’t think anyone would be particularly bothered—in particular, its electricity costs would still be below my local minimum wage!!

First, how much power does the brain use? 20 watts is StackExchange's answer, but I've struggled to find good references here. The appealingly named Appraising the brain's energy bu... (read more)

3Steven Byrnes
Thanks! Prior to your comment, the calculation in my head was 12 W × 10,000 × 10¢/kWh < $14.25/hr. The biggest difference from you is that I had heard 12 watts for brain energy consumption somewhere, and neglected to check it. I don’t recall where I had heard that, but for example, 12 W is in this article. They used the 20% figure, but for resting metabolic rate they cite this which says 1740 kcal/day (→16.9W) in men, 1348 kcal/day (→13.1W) in women, and the article turns 13.1W into 12W by sketchy rounding. That still presupposes that the 20% is valid in both genders. I traced the “20%” back to here which cites papers from 1957 & 1960 (and 1997 but that’s another secondary source). I downloaded the 1957 source (Kety, “The general metabolism of the brain in vivo”. In: Metabolism of the nervous system (Richter D, ed), pp 221–237), and it did cite studies of both men and women, and suggested that it scales with brain mass. I don’t understand everything that goes into the calculation, but they do say 20 W directly, so I certainly feel best about that number, but AFAICT it remains likely that the power would lower for smaller-than-average people including most women. I’m still confused about the discrepency with earlier in this paragraph, but I don’t want to spend more time on it. ¯\_(ツ)_/¯ My intended meaning was that the “power consumption” of “a silicon-chip AGI server” was all-in power consumption including HVAC, but I can see how a reader could reasonably interpret my words as excluding HVAC. I specifically said “my local minimum wage” because I happen to live in a state (Massachusetts) with high minimum wage of $14.25/hr. (The cost to the employer is of course a bit higher, thanks to legally-mandated employer taxes, sick days, sick-family days, etc.) Granted, we have unusually expensive electricity here in Massachusetts too, but people normally put servers where electricity is cheaper and talk to them over the internet. Anyway, I clearly messed up especially b
AdamGleaveΩ221

"The Floating Droid" example is interesting as there's a genuine ambiguity in the task specification here. In some sense that means there's no "good" behavior for a prompted imitation model here. (For an instruction-following model, we might want it to ask for clarification, but that's outside the scope of this contest.) But it's interesting the interpretation flips with model scale, and in the opposite direction to what I'd have predicted (doing EV calculations are harder so I'd have expected scale to increase not decrease EV answers.) Follow-up questions... (read more)

3Rohin Shah
I think the inverse scaling here is going from "random answer" to "win/loss detection" rather than "EV calculation" to "win/loss detection".
AdamGleaveΩ141

It's not clear to me how we can encourage rigor where effective without discouraging research on areas where rigor isn't currently practical. If anyone has ideas on this, I'd be very interested.


A rough heuristic I have is that if the idea you're introducing is highly novel, it's OK to not be rigorous. Your contribution is bringing this new, potentially very promising, idea to people's attention. You're seeking feedback on how promising it really is and where people are confused , which will be helpful for then later formalizing it and studying it more rigo... (read more)

3David Scott Krueger (formerly: capybaralet)
I think part of this has to do with growing pains in the LW/AF community... When it was smaller it was more like an ongoing discussion with a few people and signal-to-noise wasn't as important, etc. 
AdamGleaveΩ7167

Work that is still outside the academic Overton window can be brought into academia if it can be approached with the technical rigor of academia, and work that meets academic standards is much more valuable than work that doesn't; this is both because it can be picked up by the ML community, and because it's much harder to tell if you are making meaningful progress if your work doesn't meet these standards of rigor.

Strong agreement with this! I'm frequently told by people that you "cannot publish" on a certain area, but in my experience this is rarely true... (read more)

Joe CollmanΩ5157

Presumably "too dismissive of speculative and conceptual research" is a direct consequence of increased emphasis on rigor. Rigor is to be preferred all else being equal, but all else is not equal.

It's not clear to me how we can encourage rigor where effective without discouraging research on areas where rigor isn't currently practical. If anyone has ideas on this, I'd be very interested.

I note that within rigorous fields, the downsides of rigor are not obvious: we can point to all the progress made; progress that wasn't made due to the neglect of conceptua... (read more)

3David Scott Krueger (formerly: capybaralet)
Agree RE systemic blindspots, although the "algorithmic contribution" thing is sort of a known issue that a lot of senior people disagree with, IME.

I liked this post and think it'll serve as a useful reference point, I'll definitely send it to people who are new to the alignment field.

But I think it needs a major caveat added. As a survey of alignment research that regularly posts on LessWrong or interacts closely with that community, it does a fine job. But as capybaralet already pointed out, it misses many academic groups. And even some major industry groups are de-emphasized. For example, DeepMind alignment is 20+ people, and has been around for many years. But it's got if anything a slightly less ... (read more)

2Thomas Larsen
Strong upvoted and I quite like this antidote, I will work on adding my guess of the scale of these orgs into the table. 

One omission from the list is the Fund for Alignment Research (FAR), which I'm a board member of. That's fair enough: FAR is fairly young, and doesn't have a research agenda per se, so it'd be hard to summarize their work from the outside!. But I thought it might be of interest to readers so I figured I'd give a quick summary here.

FAR's theory of change is to incubate new, scalable alignment research agendas. Right now I see a small range of agendas being pursued at scale (largely RLHF and interpretability), then a long tail of very diverse agendas being p... (read more)

1Thomas Larsen
Hi Adam, thank you so much for writing this informative comment. We've added your summary of FAR to the main post (and linked this comment). 
AdamGleaveΩ250

A related dataset is Waterbirds, described in Sagawa et al (2020), where you want to classify birds as landbirds or waterbirds regardless of whether they happen to be on a water or land background.

The main difference from HappyFaces is that in Waterbirds the correlation between bird type and background is imperfect, although strong. By contrast, HappyFaces has perfect spurious correlation on the training set. Of course you could filter Waterbirds to make the spurious correlation perfect to get an equally challenging but more natural dataset.

1Charbel-Raphaël
Very Interesting, thank you.

A lot of this argument seems to rest on the training-inference gap, allowing a very large population of AIs to exist at the same as cost as training. In that way they can be a formidable group even if the individual AIs are only human-level. I was suspicious of this at first, but I found myself largely coming round to it after sanity checking it using a slightly different method than biological anchors. However, if I understand correctly the biological anchors framework implies the gap between training and inference grows with capabilities. My projection i... (read more)

1Roger Scott
Isn't multi-epoch training most likely to lead to overfitting, making the models less useful/powerful? If it were possible to write an algorithm to generate this synthetic training data how would the resulting training data have any more information content than the algorithm that  produced it? Sure, you'd get an enormous increase in training text volume, but large volumes of training data containing small amounts of information seems counterproductive for training purposes -- it will just bias the model disproportionately toward that small amount of information.

I did actually mean 45, in "all-things-considered" I was including uncertainty in whether my toy model was accurate. Since it's a right-tailed distribution, my model can underestimate the true amount a lot more than it can overestimate it.

For what it's worth, my all-things-considered view for Delta is now more like 30, as I've not really seen anything all that compelling for long COVID being much worse than in the model. I'm not sure about Omicron; it seems to be less virulent, but also more vaccine escape. Somewhere in the 15-90 day range sounds right to me, I've not thought enough to pin it down precisely.

AdamGleaveΩ330

My sense is that Stuart assuming there's an initial-specified reward function is a simplification, not a key part of the plan, and that he'd also be interested in e.g. generalizing a reward function learned from other sources of human feedback like preference comparison.

IRD would do well on this problem because it has an explicit distribution over possible reward functions, but this isn't really that unique to IRD -- Bayesian IRL or preference comparison would have the same property.

2Rohin Shah
Yeah, I agree with that. (I don't think we have experience with deep Bayesian versions of IRL / preference comparison at CHAI, and I was thinking about advice on who to talk to)

It could be net-negative if receiving a booster shot caused stronger imprinting, making future immune response less adaptive. I don't have a good sense of whether this original antigenic sin effect has already saturated after receiving two-doses (or even a single-dose), or whether it continues to become stronger.

My sense is this is an open question. From Petras et al (2021):

As suggested by a recent observation in naturally immunized individuals receiving two doses of the Pfizer COVID-19 (Comirnaty) vaccine, original antigenic sin may pose a problem in fu

... (read more)
2DirectedEvolution
One possible strategy in a world of sane and effective governance might be to reserve one or more protein targets for a truly global mass-vaccination campaign. Really drill in the idea that we have to wipe out Covid or else live in a world that's long-term deadlier than it was before. Produce enough vaccine and infrastructure to get the planet vaccinated in a short period of time. Then deliver it all at once. This could be going on in the background while we maintain our present efforts, building consensus and establishing infrastructure.

I largely with this analysis. One major possible "side-effect" of a third booster is original antigenic sin. Effectively, the immune system may become imprinted on the ancestral variant of the spike protein, preventing adaptation to new variants (whether via direct exposure or via future boosters targeting new variants). This would be the main way I could see a third booster being seriously net-negative, although I don't have a good sense of the probability. Still, if antibody levels are low, the benefit of a booster is greater and I'd guess (caveat: not a... (read more)

3DirectedEvolution
From the article: This was the first time I've encountered this concept. It actually made it seem like a booster shot would just be ineffective, rather than "seriously net-negative." Immunological memory would be optimized for the early variant the vaccine was designed for, and would be unable (or less able?) to update for the new variant. Pfizer and Moderna vaccines target the spike protein. However, there are several other potential protein targets. Maybe the booster could be designed to target these instead. That said, my read on the original antigenic sin article makes it seem like a plausible cause of breakthrough cases of Covid. If so, then I'd predict that antibody levels wouldn't be a good predictor of susceptibility to infection. But it would be much better to base this on empirical data, and I don't know if that exists.

Thanks for sharing this! I did notice a weird non-monotonicity: if I go from 90 minutes exposure to 120 minutes, the "Percent of Population w/ Sunburn Degree 1 at Time Exposed" drops from 96.8% to 72.7%. There is a warning in both cases that it's outside normal range, but it still seems odd that more exposure gives lower risk.

1Josh Jacobson
Indeed, the results for which warnings are thrown should be disregarded; the non-monotonicity of out-of-bounds results is a situation I noticed as well. The authors were quite clear about the equation only being useful in certain conditions, and it does seem to act reliably in those conditions, so I think this is just an out-of-bounds quirk that can be disregarded.

Just to flag I messed up the original calculation and underestimated everything by a factor of 2x, I've added an errata.

I'd also recommend Matt Bell's recent analysis, who estimates 200 days of life lost. This is much higher than the analysis in my comment and the OP. I found the assumptions and sources somewhat pessimistic but ultimately plausible.

The main things driving the difference from my comment were:

  • Uses data from the UK's Office of National Statistics that I'd missed, which has a very high number of 55% of people reporting symptoms after 5 weeks
... (read more)

I should probably argue with Matt directly, but my brief take is that this is just entirely incompatible with what we see on the ground. The friends of mine who got COVID aren't reporting 45% chance of their life being 20% worse. That's... an incredibly massive effect that we would definitely see. Would anyone realistically bet on that?

7Owain_Evans
Bell mentions this paper in Nature Medicine that finds only 2.3% of people having symptoms after 12 weeks. (The UK ONS study that is Bell's main sources estimates 13%). It seems better to take a mean of these estimates than to just drop one of them, as the studies are fairly similar in approach. (Both rely on self-report. The sample size for the Nature paper is >4000).   Note that the 13% figure in the ONS study drops to 1% if you restrict to subjects who had symptoms every week. (The study allows for people to go a week without any symptoms while still counting as a Long Covid case). I realize people report Long Covid as varying over time, but it's clearly worse to have a condition that causes some fatigue or tiredness at least once a week rather at least once every two weeks.

This is a good point, the demographics here are very skewed. I'm not too worried about it overstating risk, simply because the risk ended up looking not that high (at least after adjusting for hospitalization). I think at this point most of us have incurred more than 5 days of costs from COVID restrictions, so if that was really all the cost from COVID, I'd be pretty relaxed.

The gender skew could be an issue, e.g. chronic fatigue syndrome seems to occur at twice the rate in women than men.

This is an accurate summary, thanks! I'll add my calculation was only for long-term sequelae. Including ~10 days cost from acute effects, my all-things-considered view would be mean of ~40 days, corresponding to 1041 uCOVIDs per hour.

This is per actual hour of (quality-adjusted) life expectancy. But given we spend ~1/3rd of our time sleeping, you probably want to value a waking-hour at 1.5x a life-hour (assuming being asleep has neutral valence). If you work a 40 hour work week and only value your productive time (I do not endorse this, by the way), then y... (read more)

Errata: My original calculation underestimated the risk by a factor of about 2x. I neglected two key considerations, which fortunately somewhat canceled each other out. My new estimate from the calculation is 3.0 to 11.7 quality-adjusted days lost to long-term sequelae, with my all-things-considered mean at 45. 

The two key things I missed:

  - I estimated the risk of a non-hospitalized case is about 10x less than a hospitalized case, and so divided the estimates of disease burden by 10x. The first part is correct, but the second part would only ma... (read more)

2DanielFilan
I'm assuming the all-things-considered mean should be 4.5?
1MichaelStJules
Here's another BOTEC, by Matt Bell:   I think the main differences are using studies with higher excess burdens and using a lower reduction factor to translate to lifelong risk. On the latter:
5AdamGleave
Just to flag I messed up the original calculation and underestimated everything by a factor of 2x, I've added an errata. I'd also recommend Matt Bell's recent analysis, who estimates 200 days of life lost. This is much higher than the analysis in my comment and the OP. I found the assumptions and sources somewhat pessimistic but ultimately plausible. The main things driving the difference from my comment were: * Uses data from the UK's Office of National Statistics that I'd missed, which has a very high number of 55% of people reporting symptoms after 5 weeks, with fairly slow rates of recovery all the way out to 120 days post-infection. Given this is significantly higher than most other studies I've seen, I think Matt is being pessimistic by only down-adjusting to 45%, but I should emphasize these numbers are credible and the ONS study is honestly better than most out there. * Long COVID making your life 20% worse is on the pessimistic end. I put most mild symptoms at 5% worse. Ultimately subjective and highly dependent on what symptoms you get. * I think the difference in hospitalized vs non-hospitalized risk is closer to 10x (based on Al-Aly figure) not Matt's estimate of 2x, that means we should multiply by a factor of ~60% not ~97%.
3Owain_Evans
I quickly skimmed the El-Aly et al paper. It does look much better than some of the other studies. One concern is the demographics of the patients. Only 25% of people with Covid are younger than 48. Only 12% are female. I'd guess the veterans under 35 are significantly less affluent than LW readers. (Would more affluent veterans use private health care?). At a glance, I can't see results of any regressions on age but it might be worth contacting the authors about this.  How to adjust for this? One thing is just look at hospitalization risk (see AdamGleave's adjustment point (1)). However, it seems plausible that younger and healthier people would also recover better from less acute cases (and be less likely to have lingering symptoms). OTOH, there's anecdata and data (of less high quality IMO) suggesting that Long Covid doesn't fit the general patter of exponential increases in badness of Covid (and other similar diseases) with age. Overall, I'd still be inclined to make an adjustment of risk down if you are under 35 and healthy.  Demographic info about patients in El-Aly et al.
6Ben Pace
(I'd personally appreciate you saying how many microcovids you think is equivalent to an hour's time; that's the main number I've been using to figure out whether various costs are worth it.)
AdamGleaveΩ230

I googled "model-based RL Atari" and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly)

Ah, the "model-based using a model-free RL algorithm" approach :) They learn a world model using supervised learning, and then use PPO (a model-free RL algorithm) to train a policy in it. It sounds odd but it makes sense: you hopefully get much of the sample efficiency of model-based training, while still retaining the state-of-the-art results of model-free RL. You'... (read more)

AdamGleaveΩ230

Thanks for the clarification! I agree if the planner does not have access to the reward function then it will not be able to solve it. Though, as you say, it could explore more given the uncertainty.

Most model-based RL algorithms I've seen assume they can evaluate the reward functions in arbitrary states. Moreover, it seems to me like this is the key thing that lets rats solve the problem. I don't see how you solve this problem in general in a sample-efficient manner otherwise.

One class of model-based RL approaches is based on [model-predictive control](ht... (read more)

5Steven Byrnes
Hmm. AlphaZero can evaluate the true reward function in arbitrary states. MuZero can't—it tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I googled "model-based RL Atari" and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I'm not intimately familiar with the deep RL literature, I wouldn't know what's typical and I'll take your word for it, but it does seem that both possibilities are out there. Anyway, I don't think the neocortex can evaluate the true reward function in arbitrary states, because it's not a neat mathematical function, it involves messy things like the outputs of millions of pain receptors, hormones sloshing around, the input-output relationships of entire brain subsystems containing tens of millions of neurons, etc. So I presume that the neocortex tries to learn the reward function by supervised learning from observations of past rewards—and that's the whole thing with TD learning and dopamine. I added a new sub-bullet at the top to clarify that it's hard to explain by RL unless you assume the planner can query the ground-truth reward function in arbitrary hypothetical states. And then I also added a new paragraph to the "other possible explanations" section at the bottom saying what I said in the paragraph just above. Thank you. Well, the rats are trying to do the rewarding thing after zero samples, so I don't think "sample-efficiency" is quite the right framing. In ML today, the reward function is typically a function of states and actions, not "thoughts". In a brain, the reward can depend directly on what you're imagining doing or planning to do, or even just what you're thinking about. That's my proposal here. Well, I guess you could say that this is still a "normal MDP", but where "having thoughts" and "having ideas" etc. are part of the state / action space.
AdamGleaveΩ8110

I'm a bit confused by the intro saying that RL can't do this, especially since you later on say the neocortex is doing model-based RL. I think current model-based RL algorithms would likely do fine on a toy version of this task, with e.g. a 2D binary state space (salt deprived or not; salt water or not) and two actions (press lever or no-op). The idea would be:

  - Agent explores by pressing lever, learns transition dynamics that pressing lever => spray of salt water.

  - Planner concludes that any sequence of actions involving pressing lever wi... (read more)

Steven Byrnes*Ω4100

Good question! Sorry I didn't really explain. The missing piece is "the planner will conclude this has positive reward". The planner has no basis for coming up with this conclusion, that I can see.

In typical RL as I understand it, regardless of whether it's model-based or model-free, you learn about what is rewarding by seeing the outputs of the reward function. Like, if an RL agent is playing an Atari game, it does not see the source code that calculates the reward function. It can try to figure out how the reward function works, for sure, but when it doe... (read more)

AdamGleaveΩ8160

Thanks for the post, this is my favourite formalisation of optimisation so far!

One concern I haven't seen raised so far, is that the definition seems very sensitive to the choice of configuration space. As an extreme example, for any given system, I can always augment the configuration space with an arbitrary number of dummy dimensions, and choose the dynamics such that these dummy dimensions always get set to all zero after each time step. Now, I can make the basin of attraction arbitrarily large, while the target configuration set remains a fixed si... (read more)

AdamGleaveΩ5150

I feel like there are three facets to "norms" v.s. values, which are bundled together in this post but which could in principle be decoupled. The first is representing what not to do versus what to do. This is reminiscent of the distinction between positive and negative rights, and indeed most societal norms (e.g. human rights) are negative, but not all (e.g. helping an injured person in the street is a positive right). If the goal is to prevent catastrophe, learning the 'negative' rights is probably more important, but it seems to me t... (read more)

2Rohin Shah
Yeah, agreed with all of that, thanks for the comment. You could definitely try to figure out each of these things individually, eg. learning constraints that can be used with Constrained Policy Optimization is along the "what not to do" axis, and a lot of the multiagent RL work is looking at how we can get some norms to show up with decentralized training. But I feel a lot more optimistic about research that is trying to do all three things at once, because I think the three aspects do interact with each other. At least, the first two feel very tightly linked, though they probably can be separated from the multiagent setting.

Thanks for the informative post as usual.

Full-disclosure: I'm a researcher at UC Berkeley financially supported by CHAI, one of the organisations reviewed in this post. However, this comment is just my personal opinion.

Re: location, I certainly agree that an organization does not need to be in the Bay Area to do great work, but I do think location is important. In particular, there's a significant advantage to working in or near a major AI hub. The Bay Area is one such place (Berkeley, Stanford, Google Brain, OpenAI, FAIR) but not the only one; e... (read more)

2Larks
I definitely being near AI hubs is helpful, and I'd be interested in supporting any credible new groups that started in other hubs. Thanks for that extra info on CHAI staff. In general my objections to the bay area are partly about the EA/LW culture there, and partly about the broader culture. I did end up donating to CHAI despite this!

Description of CHAI is pretty accurate. I think it's a particularly good opportunity for people who are considering grad school as a long-term option: we're in an excellent position to help people get into top programs, and you'll also get a sense of what academic research culture is like.

We'd like to hire more than one engineer, and are currently trialling several hires. We have a mixture of work, some of which is more ML oriented and some of which is more infrastructure oriented. So we'd be willing to consider applicants with lim... (read more)

+1 on the last para, has repeatedly been my experience that the best qualified candidates for a job were not sure that they were and thought this meant they shouldn't apply, which is a quite unfortunate default decision.