LESSWRONG
LW

All of Rachel Freedman's Comments + Replies

It was a secretive program — it wasn’t advertised anywhere, and we had to sign an NDA about its existence (which we have since been released from). I got the impression that this was because OpenAI really wanted to keep the existence of GPT4 under wraps. Anyway, that means I don’t have any proof beyond my word.

4Zach Stein-Perlman9mo

Thanks! To be clear, my question was like where can I learn more + what should I cite, not I don't believe you. I'll cite your comment. Yay OpenAI.

ryan_greenblatt's Shortform

Rachel Freedman9moΩ68-3

OpenAI did this too, with GPT-4 pre-release. It was a small program, though — I think just 5-10 researchers.

Beth Barnes9moΩ7117

I'd be surprised if this was employee-level access. I'm aware of a red-teaming program that gave early API access to specific versions of models, but not anything like employee-level.

2Zach Stein-Perlman9mo

Source?

9ryan_greenblatt9mo

It also wasn't employee level access probably? (But still a good step!)

Your LLM Judge may be biased

Rachel Freedman1y10

This is so interesting. I had no idea that this was a thing! I would have assumed that test-writers wrote all of the answers out, then used a (pseudo-)randomizer to order them. But if that really is a pattern in multiple choice tests, it makes absolute sense that Llama would pick up on it.

Your LLM Judge may be biased

Rachel Freedman1y10

I suspect that if you ask the model to reconsider its answer, it would double down even on the incorrect (B-biased) responses. LLMs really like being self-consistent. We haven’t run this experiment, but if you do, let us know the result!

If I understand correctly, your proposed fix is something like supervised finetuning on adversarial examples that trigger the B-bias. We can access the output logits directly (replacing step 1) and the ground-truth answer is provided in the dataset (removing the need for step 2), so this seems relatively doable.

The main cha... (read more)

2[anonymous]1y

You're correct though my thought was as a general policy, to have something check every answer for any kind of incorrectness or bias. For this situation, assuming you don't have a database of ground truth (trivial case if you do) you need some method to get the most likely ground truth. You could ask multiple larger LLMs and train this one on the 'more correct' opinion from it's peers. But that may not be worth doing. I was thinking more of factual questions, legal briefs that reference a case number, claims about a website, and so on. Each of these has a ground truth it is possible to look up, and you need to fine tune the model automatically to produce logits with very high probability of the correct answer. Also in this general case, the logits are useless, because you aren't looking for the token for the answers A/B/P/N but a series of steps that lead to an answer, and you want both the reasoning in the steps to be valid, and the answer to be correct. The 1 letter answer is a trivial case. So I found llms change wrong answers pretty often if the llm is gpt-4. Have not tried this one. What also are the other costs besides fine-tuning in terms of model performance on other tasks? 7B doesn't leave much usable space, everything it learns comes at a cost.

CIRL Corrigibility is Fragile

Rachel Freedman1y32

I’d be interested to see this as well!

Lightcone Infrastructure/LessWrong is looking for funding

Rachel Freedman2y50

Thank you for such a detailed and thorough answer! This resolves a lot of my confusion.

Based on conversations around closing the wework Lightcone office, I had assumed that you didn't want to continue hosting office space, and so hadn't considered that counterfactual cost. But the Inn expenses you mention seem more reasonable if the alternative is continuing to rent wework space.

The FTX context also makes a lot of sense. I was confused how the purchase fit into your current strategy and funding situation, but I understand that both of those were quite diff... (read more)

Lightcone Infrastructure/LessWrong is looking for funding

Rachel Freedman2y30

These all sound like major benefits to owning the venue yourself!

To be clear, I don't doubt at all that using the Inn for events is much better than non-purpose-built space. However, the Inn also has costs that renting existing spaces wouldn't: I assume that purchasing and renovating it costs more than renting hotel spaces as-needed for events (though please correct me if I'm wrong!), and my impression is that it's taken the Lightcone team a lot of time and effort over the past year+ to purchase and renovate, which naturally has opportunity costs.

I'm askin... (read more)

Lightcone Infrastructure/LessWrong is looking for funding

Rachel Freedman2y1011

Will much of that $3-6M go into renovating and managing the Rose Garden Inn, or to cover work that could have been covered by existing funding if the Inn wasn't purchased?

If so, I'm curious to hear more about the strategy behind buying and renovating the space, since it seems like a substantial capital investment, and a divergence from LightCone Infrastructure's previous work and areas of expertise. I'm aware that several (primarily social?) events were held there over the past year, and I see from an earlier comment that you're planning to host SERI MATS ... (read more)

habryka2y*180

Will much of that $3-6M go into renovating and managing the Rose Garden Inn, or to cover work that could have been covered by existing funding if the Inn wasn't purchased?

Thinking about the exact financing of the Inn is a bit messy, especially if we compare it to doing something like running the Lightcone Offices, because of stuff like property appreciation, rental income from people hosting events here, and the hard-to-quantify costs of tying up capital in real estate as opposed to more liquid assets like stocks.

If you assume something like 5% property ap... (read more)

Ben Pace2y*386

Here are some things I like about owning this space:

You don't have to ask anyone's permission to make modifications to it or to use it in unusual ways. For instance, if we want to cover every wall with whiteboards (as we've done in many rooms), we don't have to ask anyone's permission, we just order the boards, hire a few laborers, and have them cut them to size and nail them all along the walls. If I want to turn a 4-person team office into a 4-person bedroom, I just get a few people, carry the furniture into storage, carry some beds in, and we're done. T

... (read more)

CIRL Corrigibility is Fragile

Rachel Freedman2y21

I agree that human model misspecification is a severe problem, for CIRL as well as for other reward modeling approaches. There are a couple of different ways to approach this. One is to do cognitive science research to build increasingly accurate human models, or to try to just learn them. The other is to build reward modeling systems that are robust to human model misspecification, possibly by maintaining uncertainty over possible human models, or doing something other than Bayesianism that doesn't rely on a likelihood model. I’m more sympathetic to the l... (read more)

2Erik Jenner2y

Thanks for clarifying, that makes sense.

Rachel Freedman2y11

Thanks for the clarification! From OpenAI's announcement, it looks like this ranking only occurs during the finetuning portion of training (Step 2). But the user doesn't have the opportunity to provide this feedback after deployment. So are you suggesting that ChatGPT gets aligned to the values of the human contractor(s) that provide data during finetuning, and then carries these values forward when interacting with users? I'm asking because one of the key benefits of CIRL games (also called "assistance games") is that they allow the AI to continuously upd... (read more)

1Past Account2y

You are correct that this appears to stand in contrast one of the key benefits of CIRL games. Namely, that they allow the AI to continuously update towards the user's values. The argument I present is that ChatGPT can still learn something about the preferences of the user it is interacting with through the use of in-context value learning. During deployment, ChatGPT will then be able to learn preferences in-context allowing for continuous updating towards the user's values like in the CIRL game.

Rachel Freedman2y10

Where does the reward $r (s, a)$ in step 1 come from? Is it assigned by H? Is it determined by an outside observer? Is the reward function somehow hardcoded into the context?

1Past Account2y

The reward is from the user H which ranks candidate responses from ChatGPT. This is discussed more in OpenAI’s announcement. I edited the post to clarify this.

Does a LLM have a utility function?

Answer by Rachel FreedmanDec 09, 202240

I think that the significant distinction is whether an AI system has a utility function that it is attempting to optimize at test time. A LLM does have an utility function, in that there is an objective function written in its training code that it uses to calculate gradients and update its parameters during training. However, once it is deployed, its parameters are frozen and its score on this objective function can no longer impact its behavior. In that sense, I don't think that it makes sense to think of a LLM as "trying to" optimize this objective after deployment. However, this answer could change in response to changes in model training strategy, which is why this distinction is significant.

AI Safety Seems Hard to Measure

Rachel Freedman2y20

Unfortunately, I think that this problem extends up a meta-level as well: AI safety research is extremely difficult to evaluate. There's extensive debate about which problems and techniques safety researchers should focus on, even extending to debates about whether particular research directions are actively harmful. The object- and meta-level problems are related -- if we had an easy-to-evaluate alignment metric, we could check whether various alignment strategies lead to models scoring higher on this metric, and use that as a training signal for alignmen... (read more)

[Link] Why I’m optimistic about OpenAI’s alignment approach

Rachel Freedman2y60

Thank you for writing this! I've been trying to consolidate my own thoughts around reward modeling and theoretical v. empirical alignment research for a long time, and this post and the discussion has been very helpful. I'll probably write that up as a separate post later, but for now I have a few questions:

What does the endgame look like? The post emphasizes that we only need an MVP alignment research AI, so it can be relatively unintelligent, narrow, myopic, non-agenty, etc. This means that it poses less capabilities risk and is easier to evaluate, both

... (read more)

3Vladimir_Nesov2y

The simulators framing suggests that RLHF doesn't teach things to a LLM, instead LLM can instantiate many agents/simulacra, but does it haphazardly, and RLHF picks out particular simulacra and stabilizes them, anchoring them to their dialogue handles. So from this point of view, LLM could well have all human values, but isn't good enough to channel stable simulacra that exhibit them, and RLHF stabilizes a few simulacra that do so in the ways useful for their roles in a bureaucracy.

AGI Safety FAQ / all-dumb-questions-allowed thread

Rachel Freedman3y10

As an AI researcher, my favourite way to introduce other technical people to AI Alignment is Brian Christian’s book “The Alignment Problem” (particularly section 3). I like that it discusses specific pieces of work, with citations to the relevant papers, so that technical people can evaluate things for themselves as interested. It also doesn’t assume any prior AI safety familiarity from the reader (and brings you into it slowly, starting with mainstream bias concerns in modern-day AI).

AGI Safety FAQ / all-dumb-questions-allowed thread

Rachel Freedman3y20

I work on AI safety via learning from human feedback. In response to your three ideas:

Uniformly random human noise actually isn’t much of a problem. It becomes a problem when the human noise is systematically biased in some way, and the AI doesn’t know exactly what that bias is. Another core problem (which overlaps with the human bias), is that the AI must use a model of human decision-making to back out human values from human feedback/behavior/interaction, etc. If this model is wrong, even slightly (for example, the AI doesn’t realize that the noise i

Rachel Freedman3y10

In reward learning research, it’s common to represent the AI’s estimate of the true reward function as a distribution over possible reward functions, which I think is analogous to what you are describing. It’s also common to define optimal behavior, given a distribution over reward functions, as that behavior which maximizes the expected reward under that distribution. This is mathematically equivalent to optimizing a single reward function equal to the expectation of the distribution. So, this helps in that the AI is optimizing a reward function that is more likely to be “aligned” than one at an extreme end of the distribution. However, this doesn’t help with the problems of optimizing a single fixed reward function.

AGI Safety FAQ / all-dumb-questions-allowed thread

Rachel Freedman3y10

Consciousness, intelligence and human-value-alignment are probably mostly orthogonal, so I don’t think that solving the hard problem of intelligence would directly impact AGI alignment research. (Perhaps consciousness requires general intelligence, so understanding how consciousness works on a mechanistic level might dramatically accelerate timelines? But that’s highly speculative.)

However, if solving the hard problem of consciousness leads us to realize that some of our AI systems are conscious, then we have a whole new set of moral patients. (As an AGI researcher) I personally would become much more concerned with machine ethics in that case, and I suspect others would as well.

AGI Safety FAQ / all-dumb-questions-allowed thread

Rachel Freedman3y10

Short answer: Yep, probably.

Medium answer: If AGI has components that look like our most capable modern deep learning models (which I think is quite likely if it arrives in the next decade or two), it will probably be very resource-intensive to run, and orders of magnitude more expensive to train. This is relevant because it impacts who has the resources to develop AGI (large companies and governments; likely not individual actors), secrecy (it’s more difficult to secretly acquire a massive amount of compute than it is to secretly boot up an AGI on your la... (read more)

AGI Safety FAQ / all-dumb-questions-allowed thread

Rachel Freedman3y20

What can I read/look at to skill up with "alignment."

A good place to start is the "AGI Safety Fundamentals" course reading list, which includes materials from a diverse set of AI safety research agendas. Reading this can help you figure out who in this space is doing what, and which of that you think is useful. You can also join an official iteration of the course if you want to discuss the materials with a cohort and a facilitator (you can register interest for that here). You can also join the AI Alignment slack, to discuss these and other material... (read more)

1Jason Maskell3y

Thanks, that's a super helpful reading list and a hell of a deep rabbit hole. Cheers. I'm currently skilling up my rusty ML skills and will start looking in earnest in the next couple of months for new employment in this field. Thanks for the job board link as well.

[RETRACTED] It's time for EA leadership to pull the short-timelines fire alarm.

Rachel Freedman3y70

DeepMind and OpenAI both already employee teams of existential-risk focused AI safety researchers. While I don't personally work on any of these teams, I get the impression from speaking to them that they are much more talent-constrained than resource-constrained.

I'm not sure how to alleviate this problem in the short term. My best guess would be free bootcamp-style training for value-aligned people who are promising researchers but lack specific relevant skills. For example, ML engineering training or formal mathematics education for junior AIS researcher... (read more)

3mic3y

The low-effort version of this would be, instead of spinning up your own bootcamp, having value-aligned people apply for a grant to the Long-Term Future Fund to participate in a bootcamp.