All of abergal's Comments + Replies

abergal30

Thanks for writing this up-- at least for myself, I think I agree with the majority of this, and it articulates some important parts of how I live my life in ways that I hadn't previously made explicit for myself.

Answer by abergal10

I think your first point basically covers why-- people are worried about alignment difficulties in superhuman systems, in particular (because those are the dangerous systems which can cause existential failures). I think a lot of current RLHF work is focused on providing reward signals to current systems in ways that don't directly address the problem of "how do we reward systems with behaviors that have consequences that are too complicated for humans to understand".

Chris Olah wrote this topic prompt (with some feedback from me (Asya) and Nick Beckstead). We didn’t want to commit him to being responsible for this post or responding to comments on it, so we submitted this on his behalf. (I've changed the by-line to be more explicit about this.)

2ESRogs
Ah, got it. Thanks!
abergalΩ230

Thanks for writing this! Would "fine-tune on some downstream task and measure the accuracy on that task before and after fine-tuning" count as measuring misalignment as you're imagining it? My sense is that there might be a bunch of existing work like that.

1William_S
I don't think all work of that form would measure misalignment, but some work of that form might, here's a description of some stuff in that space that would count as measuring misalignment. Let A be some task (e.g. add 1 digit numbers), B be a task that is downstream of A (to do B, you need to be able to do A, e.g. add 3 digit numbers), M is the original model, M1 is the model after finetuning. If the training on a downstream task was minimal, so we think it's revealing what the model knew before finetuning rather than adding knew knowledge, then better performance of M1 than M on A would demonstrated misalignment (don't have a precise definition of what would make finetuning minimal in this way, would be good to have a clearer criteria for that). If M1 does better on B after finetuning in a way that implicitly demonstrates better knowledge of A, but does not do better on A when asked to do it explicitly, that would demonstrate that the finetuned M1 is misaligned (I think we might expect some version of this to happen by default though, since M1 might overfit to only doing tasks of type B. Maybe if you have a training procedure where M1 generally doesn't get worse at any tasks then I might hope that it would get better on A and be disappointed if it doesn't).
abergalΩ230

This RFP is an experiment for us, and we don't yet know if we'll be doing more of them in the future. I think we'd be open to including research directions we think that are promising that apply equally well to both DL and non-DL systems-- I'd be interested in hearing any particular suggestions you have.

(We'd also be happy to fund particular proposals in the research directions we've already listed that apply to both DL and non-DL systems, though we will be evaluating them on how well they address the DL-focused challenges we've presented.)

2RyanCarey
I imagine you could catch useful work with i) models of AI safety, or ii) analysis of failure modes, or something, though I'm obviously biased here.
abergalΩ120

Getting feedback in the next week would be ideal; September 15th will probably be too late.

Different request for proposals!

abergal*Ω13210

Thank you so much for writing this! I've been confused about this terminology for a while and I really like your reframing.

An additional terminological point that I think it would be good to solidify is what people mean when they refer to "inner alignment" failures. As you alude to, my impression is that some people use it to refer to objective robustness failures, broadly, whereas others (e.g. Evan) use it to refer to failures that involve mesa optimization. There is then additional confusion around whether we should think "inner alignment" failures that ... (read more)

abergal650

I feel pretty bad about both of your current top two choices (Bellingham or Peekskill) because they seem too far from major cities. I worry this distance will seriously hamper your ability to hire good people, which is arguably the most important thing MIRI needs to be able to do. [Speaking personally, not on behalf of Open Philanthropy.]

RyanCarey*130

I think moving to the country could possibly be justified despite harms to recruitment and the rationality community, but in the official MIRI explanations, the downsides are quite underdiscussed.

To expand on this a bit, I think that people with working partners would be the group most likely to be deterred from working at MIRI if it was in either Bellingham or Peekskill. The two-body problem can be a serious constraint, and large metro areas tend to be much easier to find two jobs in. That may be getting better with the rise of remote work, but I do think it's worth keeping in mind. 

Announcement: "How much hardware will we need to create AGI?" was actually inspired by a conversation I had with Ronny Fernandez and forgot about, credit goes to him for the original idea of using 'the weights of random objects' as a reference class.

https://i.imgflip.com/1xvnfi.jpg

I think it would be kind of cool if LessWrong had built-in support for newsletters. I would love to see more newsletters about various tech developments, etc. from LessWrongers.

2Yoav Ravid
First step can be an option to subscribe to sequences
abergal*Ω340

Planned summary for the Alignment Newsletter:

This post describes the author’s insights from extrapolating the performance of GPT on the benchmarks presented in the <@GPT-3 paper@>(@Language Models are Few-Shot Learners@). The author compares cross-entropy loss (which measures how good a model is at predicting the next token) with benchmark performance normalized to the difference between random performance and the maximum possible performance. Since <@previous work@>(@Scaling Laws for Neural Language Models@) has shown that cross-entropy loss s

... (read more)
abergalΩ11250

AI Impacts now has a 2020 review page so it's easier to tell what we've done this year-- this should be more complete / representative than the posts listed above. (I appreciate how annoying the continuously updating wiki model is.)

2Larks
Thanks, added.
abergal*Ω6100

From Part 4 of the report:

Nonetheless, this cursory examination makes me believe that it’s fairly unlikely that my current estimates are off by several orders of magnitude. If the amount of computation required to train a transformative model were (say) ~10 OOM larger than my estimates, that would imply that current ML models should be nowhere near the abilities of even small insects such as fruit flies (whose brains are 100 times smaller than bee brains). On the other hand, if the amount of computation required to train a transformative model were
... (read more)
1Ajeya Cotra
Yes, it's assuming the scaling behavior follows the probability distributions laid out in Part 2, and then asking whether conditional on that the model size requirements could be off by a large amount. 
abergal*Ω9160

So exciting that this is finally out!!!

I haven't gotten a chance to play with the models yet, but thought it might be worth noting the ways I would change the inputs (though I haven't thought about it very carefully):

  • I think I have a lot more uncertainty about neural net inference FLOP/s vs. brain FLOP/s, especially given that the brain is significantly more interconnected than the average 2020 neural net-- probably closer to 3 - 5 OOM standard deviation.
  • I think I also have a bunch of uncertainty about algorithmic efficiency progress-- I could im
... (read more)
1Ajeya Cotra
Thanks! I definitely agree that the proper modeling technique would involve introducing uncertainty on algorithmic progress, and that this uncertainty would be pretty wide; this is one of the most important few directions of future research (the others being better understanding effective horizon length and better narrowing model size). In terms of uncertainty in model size, I personally find it somewhat easier to think about what the final spread should be in the training FLOP requirements distribution, since there's a fair amount of arbitrariness in how the uncertainty is apportioned between model size and scaling behavior. There's also semantic uncertainty about what it means to "condition on the hypothesis that X is the best anchor." If we're living in the world of "brain FLOP/s anchor + normal scaling behavior", then assigning a lot of weight to really small model sizes would wind up "in the territory" of the Lifetime Anchor hypothesis, and assigning a lot of weight to really large model sizes would wind up "in the territory" of the Evolution Anchor hypothesis, or go beyond the Evolution Anchor hypothesis.  I was roughly aiming for +- 5 OOM uncertainty in training FLOP requirements on top of the anchor distribution, and then apportioned uncertainty between model size and scaling behavior based on which one seemed more uncertain.
abergalΩ5110

I'm a bit confused about this as a piece of evidence-- naively, it seems to me like not carrying the 1 would be a mistake that you would make if you had memorized the pattern for single-digit arithmetic and were just repeating it across the number. I'm not sure if this counts as "memorizing a table" or not.

2Daniel Kokotajlo
Excellent point! Well, they do get the answer right some of the time... it would be interesting to see how often they "remember" to carry the one vs. how often they "forget." It looks like the biggest model got basically 100% correct on 2-digit addition, so it seems that they mostly "remember."
Answer by abergal110

This recent post by OpenAI is trying to shed some light in this question: https://openai.com/blog/ai-and-efficiency/

2ESRogs
LW linkpost here.

I really like this post.

Self-driving cars are currently illegal, I assume largely because of these unresolved tail risks. But I think excluding illegality I'm not sure their economic value is zero-- I could imagine cases where people would use self-driving cars if they wouldn't be caught doing it. Does this seem right to people?

Intuitively it doesn't seem like economic value tails and risk tails should necessarily go together, which makes me concerned about cases similar to self-driving cars that are harder to regulate legally.

2johnswentworth
Rather than people straight-up ignoring the risks, I imagine things like cruise control or automatic emergency braking; these are example self-driving use-cases which don't require solving all the tail risks. The economic value of marginal improvements is not zero, although it's nowhere near the value of giving every worker in the country an extra hour every weekday (roughly the average commute time). Totally agree with this. I do think that when we know some area has lots of tail risk, we tend to set up regulation/liability, which turns the risk tail into an economic value tail. That's largely the point of (idealized) liability law: to turn risks directly into (negative) value for someone capable of mitigating the risks. But there's plenty of cases where risk tails and value tails won't go together: * Cases where there's a positive value tail without any particular risks involved. * Cases where we don't know there's a risk tail. * Cases where liability law sucks. (Insert punchline here.) I don't think self-driving cars are actually a hard case here, they're just a case which has to be handled by liability law (i.e. lawsuits post-facto) rather than regulatory law (i.e. banning things entirely).

What's the corresponding story here for trading bots? Are they designed in a sufficiently high-assurance way that new tail problems don't come up, or do they not operate in the tails?

4johnswentworth
Great question. Let's talk about Knight Capital. Ten years ago, Knight Capital was the largest high-frequency trader in US equities. On August 1 2012, somebody deployed a bug. Knight's testing platform included a component which generated random orders and sent them to a simulated market; somebody accidentally hooked that up to the real market. It's exactly the sort of error testing won't catch, because it was a change outside of the things-which-are-tested; it was partly an error in deployment, and partly code which did not handle partial deployment. The problem was fixed about 45 minutes later. That was the end of Knight Capital. So yes, trading bots definitely operate in the tails. When the Knight bug happened, I was interning at the largest high-frequency trading company in US options. Even before that, the company was more religious about thorough testing than any other I've worked at. Everybody knew that one bug could end us, Knight was just a reminder (specifically a reminder to handle partial deployment properly).

I rewrote the question-- I think I meant 'counterfactual' in that this isn't a super promising idea if in fact we are just taking medical supplies from one group of people and transferring them to another.

I don't know anything about maintenance/cleaning, was thinking it would be in particular useful if we straight up run out of ICU space-- i.e., there is no alternative to go to an ICU. (Maybe this is a super unlikely class of scenarios?)

You're totally not obligated to do this, but I think it might be cool if you generated a 3D picture of hills representing your loss function-- I think it would make the intuition for what's going on clearer.

We're not going to do this because we weren't planning on making these public when we conducted the conversations, so we want to give people a chance to make edits to transcripts before we send them out (which we can't do with audio).

5MichaelA
That makes sense. Though if AI Impacts does more conversations like these in future, I’d be very interested in listening to them via a podcast app.

The claim is a personal impression that I have from conversations, largely with people concerned about AI risk in the Bay Area. (I also don't like information cascades, and may edit the post to reflect this qualification.) I'd be interested in data on this.