I'm sympathetic to a lot of this critique. I agree that prospective students should strive to find an advisor that is "good at producing clear, honest and high-quality research while acting in high-integrity ways around their colleagues". There are enough of these you should be able to find one, and it doesn't seem worth compromising.
Concretely, I'd definitely recommend digging into into an advisor's research and asking their students hard questions prior to taking any particular PhD offer. Their absolutely are labs that prioritize publishing above all els...
Thanks for the post Ryan -- I agree that given the difficulty in making models actually meaningfully robust the best solution to misuse in the near-term is going to be via a defence in depth approach consisting of filtering the pre-training data, input filtering, output filtering, automatic and manual monitoring, KYC checks, etc.
At some point though we'll need to grapple with what to do about models that are superhuman in some domains related to WMD development, cybercrime or other potential misuses. There's glimmers of this already here, e.g. my impressi...
It can definitely be worth spending money when there's a clear case for it improving employee productivity. I will note there are a range of both norms and physical layouts compatible with open-plan, ranging from "everyone screaming at each other and in line of sight" trading floor to "no talking library vibes, desks facing walls with blinders". We've tried to make different open plan spaces zoned with different norms and this has been fairly successful, although I'm sure some people will still be disturbed by even library-style areas and be more productive in a private office.
When I started working on this project, a number of people came to me and told me (with varying degrees of tact) that I was wasting my time on a fool's errand. Around half the people told me they thought it was extremely unlikely I'd find such a vulnerability. Around the other half told me such vulnerabilities obviously existed, and there was no point demonstrating it. Both sets of people were usually very confident in their views. In retrospect I wish I'd done a survey (even an informal one) before conducting this research to get a better sense of people'...
Thanks for flagging this disagreement Ryan. I enjoyed our earlier conversation (on LessWrong and in-person) and updated in favor of the sample efficiency framing, although we (clearly) still have some significant differences in perspective here. Would love to catch up again sometime and see if we can converge more on this. I'll try and summarize my current take and our key disagreements for the benefit of other readers.
I think I mostly agree with you that in the special case of vanilla RLHF this problem is equivalent to a sample efficiency problem. Specifi...
Thanks, that's a good link. In our case our assets significantly exceed the FDIC $250k insurance limit and there are operational costs to splitting assets across a large number of banks. But a high-interest checking account could be a good option for many small orgs.
...Does this circle exploit have any connection to convolutions? That was my first thought when I saw the original writeups, but nothing here seems to help explain where the exploit is coming from. All of the listed agents vulnerable to it, AFAIK, make use of convolutions. The description you give of Wu's anti-circle training sounds a lot like you would expect from an architectural problem like convolution blindness: training can solve the specific exploit but then goes around in cycles or circles (ahem), simply moving the vulnerability around, like squeezin
This matches my impression. FAR could definitely use more funding. Although I'd still at the margin rather hire someone above our bar than e.g. have them earn-to-give and donate to us, the math is getting a lot closer than it used to be, to the point where those with excellent earning potential and limited fit for AI safety might well have more impact pursuing a philanthropic pathway.
I'd also highlight there's a serious lack of diversity in funding. As others in the thread have mentioned, the majority of people's funding comes (directly or indirectly) from...
I still don't understand which of (1), (2), or (3) your most worried about.
Sample efficiency isn't the main way I think about this topic so it's a bit difficult to answer. I find all these defeaters fairly plausible, but if I had to pick the central concern it'd be (3).
I tend to view ML training as a model taking a path through a space of possible programs. There's some programs that are capable and aligned with our interests; others that are capable but will actively pursue harmful goals; and of course many other programs that just don't do anything parti...
Oh, we're using terminology quite differently then. I would not call (a) reward hacking, as I view the model as being the reward (to the RL process), whereas humans are not providing reward at all (but rather some data that gets fed into a reward model's learning process). I don't especially care about what definitions we use here, but do wonder if this means we're speaking past each other in other areas as well.
Ah, that paper makes a lot more sense. A reward model was attractive in the original Deep RL From Human Preferences paper because the environment was complex and non-differentiable: using RL was a natural fit. It's always seemed a bit stranger to use RL for fine-tuning language models, especially in the prompt-completion setting where the "environment" is trivial. (RL becomes more natural when you start introducing external tools, or conversations with humans.)
I'll need to take a closer look at the paper, but it looks like they derive the DPO objective by ...
Thanks for the follow-up, this helps me understand your view!
At any given point, the reward model will be vulnerable to arbitrary adversarial attacks under sufficient optimization pressure, but we don't need arbitrary optimization against any given reward model. Like, each human update lets you optimize a bit more against the reward model which gets you the ability to get somewhat closer to the policy you actually want.
Sure, this feels basically right to me. My reframing of this would be that we could do in principle do RL directly with feedback provided b...
To check my understanding, is your view something like:
1. If the reward model isn't adversarially robust, then the RL component of RLHF will exploit it.
2. These generations will show up in the data presented to the human. Provided the human is adversarially robust, then the human feedback will provide corrective signal to the reward model.
3. The reward model will stop being vulnerable to those adversarial examples, although may still be vulnerable to other adversarial examples.
4. If we repeat this iterative process enough times,...
Yes, thanks for spotting my typo! ($2.75 psf isn't crazy for Berkeley after negotiation, but is not something I've ever seen as a list price.)
To compare this to other costs, renting two floors of the WeWork, which we did for most of the summer last year, cost around $1.2M/yr for 14,000 sq. ft. of office space. The Rose Garden has 20,000 sq. ft. of floor space and 20,000 additional sq. ft. of usable outdoor space for less implied annual cost than that.
I'm sympathetic to the high-level claim that owning property usually beats renting if you're committing for a long time period. But the comparison with WeWork seems odd: WeWork specializes in providing short-term, serviced office space and does so a...
This is a good point, adversarial examples in what I called in the post the "main" ML system can be desirable even though we typically don't want them in the "helper" ML systems used to align the main system.
One downside to adversarial vulnerability of the main ML system is that it could be exploited by bad actors (whether human or other, misaligned AIs). But this might be fine in some settings: e.g. you build some airgapped system that helps you build the next, more robust and aligned AI. One could also imagine crafting adversarial example backdoors that ...
Right: if the agent has learned an inner objective of "do things similar to what humans do in the world at the moment I am currently acting", then it'd definitely be incentivized to do that. It's not rewarded by the outer objective for e.g. behavioral cloning on a fixed dataset, as installing bunch of cameras would be punished by that loss (not something humans do) and changing human behavior wouldn't help as BC would still be on the dataset of pre-manipulation demos. That might be little comfort if you're worried about inner optimization, but most the oth...
Thanks, I'd missed that!
Curious if you have any high-level takeaways from that? Bigger models do better, clearly, but e.g. how low do you think we'll be able to get the error rate in the next 5-10 years given expected compute growth? Are there any follow-up experiments you'd like to see happen in this space?
Also could you clarify whether the setting was for adversarial training or just a vanilla model? "During training, adversarial examples for training are constructed by PGD attacker of 30 iterations" makes me think it's adversarial training but I could imagine this just being used for evals.
Rachel did the bulk of the work on this post (well-done!), I just provided some advise on the project and feedback on earlier manuscripts.
I wanted to share why I'm personally excited by this work in case it helps contextualize it for others.
We'd all like AI systems to be "corrigible", cooperating with us in correcting them. Cooperative IRL has been proposed as a solution to this. Indeed Dylan Hadfield-Menell et al show that CIRL is provably corrigible in a simple setting, the off-switch game.
Provably corrigible sounds great, but where there's a proof there...
I'm excited by many of the interventions you describe but largely for reasons other than buying time. I'd expect buying time to be quite hard, in so far as it requires coordinating to prevent many actors from stopping doing something they're incentivized to do. Whereas since alignment research community is small, doubling it is relatively easy. However, it's ultimately a point in favor of the interventions that they look promising under multiple worldviews, but it might lead me to prioritize within them differently to you.
One area I would push back on is t...
I agree that in a fast takeoff scenario there's little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I'm personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient.
In terms of humans "owning" the economy but still having trouble getting what they want, it's not obvious this is a worse outcome than the...
Thanks for this response, I'm glad to see more public debate on this!
The part of Katja's part C that I found most compelling was the argument that for a given AI system its best interests might be to work within the system rather than aiming to seize power. Your response argues that even if this holds true for AI systems that are only slightly superhuman, eventually we will cross a threshold where a single AI system can takeover. This seems true if we hold the world fixed -- there is some sufficiently capable AI system that can take over the 2022 world. Bu...
Thanks for the quick reply! I definitely don't feel confident in the 20W number, I could believe 13W is true for more energy efficient (small) humans, in which case I agree your claim ends up being true some of the time (but as you say, there's little wiggle room). Changing it to 1000x seems like a good solution though which gives you plenty of margin for error.
This is a nitpick, but I don't think this claim is quite right (emphasis added)
If a silicon-chip AGI server were literally 10,000× the volume, 10,000× the mass, and 10,000× the power consumption of a human brain, with comparable performance, I don’t think anyone would be particularly bothered—in particular, its electricity costs would still be below my local minimum wage!!
First, how much power does the brain use? 20 watts is StackExchange's answer, but I've struggled to find good references here. The appealingly named Appraising the brain's energy bu...
"The Floating Droid" example is interesting as there's a genuine ambiguity in the task specification here. In some sense that means there's no "good" behavior for a prompted imitation model here. (For an instruction-following model, we might want it to ask for clarification, but that's outside the scope of this contest.) But it's interesting the interpretation flips with model scale, and in the opposite direction to what I'd have predicted (doing EV calculations are harder so I'd have expected scale to increase not decrease EV answers.) Follow-up questions...
It's not clear to me how we can encourage rigor where effective without discouraging research on areas where rigor isn't currently practical. If anyone has ideas on this, I'd be very interested.
A rough heuristic I have is that if the idea you're introducing is highly novel, it's OK to not be rigorous. Your contribution is bringing this new, potentially very promising, idea to people's attention. You're seeking feedback on how promising it really is and where people are confused , which will be helpful for then later formalizing it and studying it more rigo...
Work that is still outside the academic Overton window can be brought into academia if it can be approached with the technical rigor of academia, and work that meets academic standards is much more valuable than work that doesn't; this is both because it can be picked up by the ML community, and because it's much harder to tell if you are making meaningful progress if your work doesn't meet these standards of rigor.
Strong agreement with this! I'm frequently told by people that you "cannot publish" on a certain area, but in my experience this is rarely true...
Presumably "too dismissive of speculative and conceptual research" is a direct consequence of increased emphasis on rigor. Rigor is to be preferred all else being equal, but all else is not equal.
It's not clear to me how we can encourage rigor where effective without discouraging research on areas where rigor isn't currently practical. If anyone has ideas on this, I'd be very interested.
I note that within rigorous fields, the downsides of rigor are not obvious: we can point to all the progress made; progress that wasn't made due to the neglect of conceptua...
I liked this post and think it'll serve as a useful reference point, I'll definitely send it to people who are new to the alignment field.
But I think it needs a major caveat added. As a survey of alignment research that regularly posts on LessWrong or interacts closely with that community, it does a fine job. But as capybaralet already pointed out, it misses many academic groups. And even some major industry groups are de-emphasized. For example, DeepMind alignment is 20+ people, and has been around for many years. But it's got if anything a slightly less ...
One omission from the list is the Fund for Alignment Research (FAR), which I'm a board member of. That's fair enough: FAR is fairly young, and doesn't have a research agenda per se, so it'd be hard to summarize their work from the outside!. But I thought it might be of interest to readers so I figured I'd give a quick summary here.
FAR's theory of change is to incubate new, scalable alignment research agendas. Right now I see a small range of agendas being pursued at scale (largely RLHF and interpretability), then a long tail of very diverse agendas being p...
A related dataset is Waterbirds, described in Sagawa et al (2020), where you want to classify birds as landbirds or waterbirds regardless of whether they happen to be on a water or land background.
The main difference from HappyFaces is that in Waterbirds the correlation between bird type and background is imperfect, although strong. By contrast, HappyFaces has perfect spurious correlation on the training set. Of course you could filter Waterbirds to make the spurious correlation perfect to get an equally challenging but more natural dataset.
A lot of this argument seems to rest on the training-inference gap, allowing a very large population of AIs to exist at the same as cost as training. In that way they can be a formidable group even if the individual AIs are only human-level. I was suspicious of this at first, but I found myself largely coming round to it after sanity checking it using a slightly different method than biological anchors. However, if I understand correctly the biological anchors framework implies the gap between training and inference grows with capabilities. My projection i...
I did actually mean 45, in "all-things-considered" I was including uncertainty in whether my toy model was accurate. Since it's a right-tailed distribution, my model can underestimate the true amount a lot more than it can overestimate it.
For what it's worth, my all-things-considered view for Delta is now more like 30, as I've not really seen anything all that compelling for long COVID being much worse than in the model. I'm not sure about Omicron; it seems to be less virulent, but also more vaccine escape. Somewhere in the 15-90 day range sounds right to me, I've not thought enough to pin it down precisely.
My sense is that Stuart assuming there's an initial-specified reward function is a simplification, not a key part of the plan, and that he'd also be interested in e.g. generalizing a reward function learned from other sources of human feedback like preference comparison.
IRD would do well on this problem because it has an explicit distribution over possible reward functions, but this isn't really that unique to IRD -- Bayesian IRL or preference comparison would have the same property.
It could be net-negative if receiving a booster shot caused stronger imprinting, making future immune response less adaptive. I don't have a good sense of whether this original antigenic sin effect has already saturated after receiving two-doses (or even a single-dose), or whether it continues to become stronger.
My sense is this is an open question. From Petras et al (2021):
...As suggested by a recent observation in naturally immunized individuals receiving two doses of the Pfizer COVID-19 (Comirnaty) vaccine, original antigenic sin may pose a problem in fu
I largely with this analysis. One major possible "side-effect" of a third booster is original antigenic sin. Effectively, the immune system may become imprinted on the ancestral variant of the spike protein, preventing adaptation to new variants (whether via direct exposure or via future boosters targeting new variants). This would be the main way I could see a third booster being seriously net-negative, although I don't have a good sense of the probability. Still, if antibody levels are low, the benefit of a booster is greater and I'd guess (caveat: not a...
Thanks for sharing this! I did notice a weird non-monotonicity: if I go from 90 minutes exposure to 120 minutes, the "Percent of Population w/ Sunburn Degree 1 at Time Exposed" drops from 96.8% to 72.7%. There is a warning in both cases that it's outside normal range, but it still seems odd that more exposure gives lower risk.
Just to flag I messed up the original calculation and underestimated everything by a factor of 2x, I've added an errata.
I'd also recommend Matt Bell's recent analysis, who estimates 200 days of life lost. This is much higher than the analysis in my comment and the OP. I found the assumptions and sources somewhat pessimistic but ultimately plausible.
The main things driving the difference from my comment were:
I should probably argue with Matt directly, but my brief take is that this is just entirely incompatible with what we see on the ground. The friends of mine who got COVID aren't reporting 45% chance of their life being 20% worse. That's... an incredibly massive effect that we would definitely see. Would anyone realistically bet on that?
This is a good point, the demographics here are very skewed. I'm not too worried about it overstating risk, simply because the risk ended up looking not that high (at least after adjusting for hospitalization). I think at this point most of us have incurred more than 5 days of costs from COVID restrictions, so if that was really all the cost from COVID, I'd be pretty relaxed.
The gender skew could be an issue, e.g. chronic fatigue syndrome seems to occur at twice the rate in women than men.
This is an accurate summary, thanks! I'll add my calculation was only for long-term sequelae. Including ~10 days cost from acute effects, my all-things-considered view would be mean of ~40 days, corresponding to 1041 uCOVIDs per hour.
This is per actual hour of (quality-adjusted) life expectancy. But given we spend ~1/3rd of our time sleeping, you probably want to value a waking-hour at 1.5x a life-hour (assuming being asleep has neutral valence). If you work a 40 hour work week and only value your productive time (I do not endorse this, by the way), then y...
Errata: My original calculation underestimated the risk by a factor of about 2x. I neglected two key considerations, which fortunately somewhat canceled each other out. My new estimate from the calculation is 3.0 to 11.7 quality-adjusted days lost to long-term sequelae, with my all-things-considered mean at 45.
The two key things I missed:
- I estimated the risk of a non-hospitalized case is about 10x less than a hospitalized case, and so divided the estimates of disease burden by 10x. The first part is correct, but the second part would only ma...
I googled "model-based RL Atari" and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly)
Ah, the "model-based using a model-free RL algorithm" approach :) They learn a world model using supervised learning, and then use PPO (a model-free RL algorithm) to train a policy in it. It sounds odd but it makes sense: you hopefully get much of the sample efficiency of model-based training, while still retaining the state-of-the-art results of model-free RL. You'...
Thanks for the clarification! I agree if the planner does not have access to the reward function then it will not be able to solve it. Though, as you say, it could explore more given the uncertainty.
Most model-based RL algorithms I've seen assume they can evaluate the reward functions in arbitrary states. Moreover, it seems to me like this is the key thing that lets rats solve the problem. I don't see how you solve this problem in general in a sample-efficient manner otherwise.
One class of model-based RL approaches is based on [model-predictive control](ht...
I'm a bit confused by the intro saying that RL can't do this, especially since you later on say the neocortex is doing model-based RL. I think current model-based RL algorithms would likely do fine on a toy version of this task, with e.g. a 2D binary state space (salt deprived or not; salt water or not) and two actions (press lever or no-op). The idea would be:
- Agent explores by pressing lever, learns transition dynamics that pressing lever => spray of salt water.
- Planner concludes that any sequence of actions involving pressing lever wi...
Good question! Sorry I didn't really explain. The missing piece is "the planner will conclude this has positive reward". The planner has no basis for coming up with this conclusion, that I can see.
In typical RL as I understand it, regardless of whether it's model-based or model-free, you learn about what is rewarding by seeing the outputs of the reward function. Like, if an RL agent is playing an Atari game, it does not see the source code that calculates the reward function. It can try to figure out how the reward function works, for sure, but when it doe...
Thanks for the post, this is my favourite formalisation of optimisation so far!
One concern I haven't seen raised so far, is that the definition seems very sensitive to the choice of configuration space. As an extreme example, for any given system, I can always augment the configuration space with an arbitrary number of dummy dimensions, and choose the dynamics such that these dummy dimensions always get set to all zero after each time step. Now, I can make the basin of attraction arbitrarily large, while the target configuration set remains a fixed si...
I feel like there are three facets to "norms" v.s. values, which are bundled together in this post but which could in principle be decoupled. The first is representing what not to do versus what to do. This is reminiscent of the distinction between positive and negative rights, and indeed most societal norms (e.g. human rights) are negative, but not all (e.g. helping an injured person in the street is a positive right). If the goal is to prevent catastrophe, learning the 'negative' rights is probably more important, but it seems to me t...
Thanks for the informative post as usual.
Full-disclosure: I'm a researcher at UC Berkeley financially supported by CHAI, one of the organisations reviewed in this post. However, this comment is just my personal opinion.
Re: location, I certainly agree that an organization does not need to be in the Bay Area to do great work, but I do think location is important. In particular, there's a significant advantage to working in or near a major AI hub. The Bay Area is one such place (Berkeley, Stanford, Google Brain, OpenAI, FAIR) but not the only one; e...
Description of CHAI is pretty accurate. I think it's a particularly good opportunity for people who are considering grad school as a long-term option: we're in an excellent position to help people get into top programs, and you'll also get a sense of what academic research culture is like.
We'd like to hire more than one engineer, and are currently trialling several hires. We have a mixture of work, some of which is more ML oriented and some of which is more infrastructure oriented. So we'd be willing to consider applicants with lim...
+1 on the last para, has repeatedly been my experience that the best qualified candidates for a job were not sure that they were and thought this meant they shouldn't apply, which is a quite unfortunate default decision.
Whether a PhD is something someone will enjoy is so dependent on individual personality, advisor fit, etc that I don't feel I can offer good generalized advice. Generally I'd suggest people trying to gauge fit try doing some research in an academic environment (e.g. undergrad/MS thesis, or a brief RA stint after graduating) and talk to PhD students in their target schools. If after that you think you wouldn't enjoy a PhD then you're probably right!
Personally I enjoyed my PhD. I had smart & interesting colleagues, an advisor who wanted me to do high-qua... (read more)