OpenAI did this too, with GPT-4 pre-release. It was a small program, though — I think just 5-10 researchers.
I'd be surprised if this was employee-level access. I'm aware of a red-teaming program that gave early API access to specific versions of models, but not anything like employee-level.
This is so interesting. I had no idea that this was a thing! I would have assumed that test-writers wrote all of the answers out, then used a (pseudo-)randomizer to order them. But if that really is a pattern in multiple choice tests, it makes absolute sense that Llama would pick up on it.
I suspect that if you ask the model to reconsider its answer, it would double down even on the incorrect (B-biased) responses. LLMs really like being self-consistent. We haven’t run this experiment, but if you do, let us know the result!
If I understand correctly, your proposed fix is something like supervised finetuning on adversarial examples that trigger the B-bias. We can access the output logits directly (replacing step 1) and the ground-truth answer is provided in the dataset (removing the need for step 2), so this seems relatively doable.
The main cha...
I’d be interested to see this as well!
Thank you for such a detailed and thorough answer! This resolves a lot of my confusion.
Based on conversations around closing the wework Lightcone office, I had assumed that you didn't want to continue hosting office space, and so hadn't considered that counterfactual cost. But the Inn expenses you mention seem more reasonable if the alternative is continuing to rent wework space.
The FTX context also makes a lot of sense. I was confused how the purchase fit into your current strategy and funding situation, but I understand that both of those were quite diff...
These all sound like major benefits to owning the venue yourself!
To be clear, I don't doubt at all that using the Inn for events is much better than non-purpose-built space. However, the Inn also has costs that renting existing spaces wouldn't: I assume that purchasing and renovating it costs more than renting hotel spaces as-needed for events (though please correct me if I'm wrong!), and my impression is that it's taken the Lightcone team a lot of time and effort over the past year+ to purchase and renovate, which naturally has opportunity costs.
I'm askin...
Will much of that $3-6M go into renovating and managing the Rose Garden Inn, or to cover work that could have been covered by existing funding if the Inn wasn't purchased?
If so, I'm curious to hear more about the strategy behind buying and renovating the space, since it seems like a substantial capital investment, and a divergence from LightCone Infrastructure's previous work and areas of expertise. I'm aware that several (primarily social?) events were held there over the past year, and I see from an earlier comment that you're planning to host SERI MATS ...
Will much of that $3-6M go into renovating and managing the Rose Garden Inn, or to cover work that could have been covered by existing funding if the Inn wasn't purchased?
Thinking about the exact financing of the Inn is a bit messy, especially if we compare it to doing something like running the Lightcone Offices, because of stuff like property appreciation, rental income from people hosting events here, and the hard-to-quantify costs of tying up capital in real estate as opposed to more liquid assets like stocks.
If you assume something like 5% property ap...
Here are some things I like about owning this space:
I agree that human model misspecification is a severe problem, for CIRL as well as for other reward modeling approaches. There are a couple of different ways to approach this. One is to do cognitive science research to build increasingly accurate human models, or to try to just learn them. The other is to build reward modeling systems that are robust to human model misspecification, possibly by maintaining uncertainty over possible human models, or doing something other than Bayesianism that doesn't rely on a likelihood model. I’m more sympathetic to the l...
Thanks for the clarification! From OpenAI's announcement, it looks like this ranking only occurs during the finetuning portion of training (Step 2). But the user doesn't have the opportunity to provide this feedback after deployment. So are you suggesting that ChatGPT gets aligned to the values of the human contractor(s) that provide data during finetuning, and then carries these values forward when interacting with users? I'm asking because one of the key benefits of CIRL games (also called "assistance games") is that they allow the AI to continuously upd...
Where does the reward in step 1 come from? Is it assigned by H? Is it determined by an outside observer? Is the reward function somehow hardcoded into the context?
I think that the significant distinction is whether an AI system has a utility function that it is attempting to optimize at test time. A LLM does have an utility function, in that there is an objective function written in its training code that it uses to calculate gradients and update its parameters during training. However, once it is deployed, its parameters are frozen and its score on this objective function can no longer impact its behavior. In that sense, I don't think that it makes sense to think of a LLM as "trying to" optimize this objective after deployment. However, this answer could change in response to changes in model training strategy, which is why this distinction is significant.
Unfortunately, I think that this problem extends up a meta-level as well: AI safety research is extremely difficult to evaluate. There's extensive debate about which problems and techniques safety researchers should focus on, even extending to debates about whether particular research directions are actively harmful. The object- and meta-level problems are related -- if we had an easy-to-evaluate alignment metric, we could check whether various alignment strategies lead to models scoring higher on this metric, and use that as a training signal for alignmen...
Thank you for writing this! I've been trying to consolidate my own thoughts around reward modeling and theoretical v. empirical alignment research for a long time, and this post and the discussion has been very helpful. I'll probably write that up as a separate post later, but for now I have a few questions:
As an AI researcher, my favourite way to introduce other technical people to AI Alignment is Brian Christian’s book “The Alignment Problem” (particularly section 3). I like that it discusses specific pieces of work, with citations to the relevant papers, so that technical people can evaluate things for themselves as interested. It also doesn’t assume any prior AI safety familiarity from the reader (and brings you into it slowly, starting with mainstream bias concerns in modern-day AI).
I work on AI safety via learning from human feedback. In response to your three ideas:
Uniformly random human noise actually isn’t much of a problem. It becomes a problem when the human noise is systematically biased in some way, and the AI doesn’t know exactly what that bias is. Another core problem (which overlaps with the human bias), is that the AI must use a model of human decision-making to back out human values from human feedback/behavior/interaction, etc. If this model is wrong, even slightly (for example, the AI doesn’t realize that the noise i
In reward learning research, it’s common to represent the AI’s estimate of the true reward function as a distribution over possible reward functions, which I think is analogous to what you are describing. It’s also common to define optimal behavior, given a distribution over reward functions, as that behavior which maximizes the expected reward under that distribution. This is mathematically equivalent to optimizing a single reward function equal to the expectation of the distribution. So, this helps in that the AI is optimizing a reward function that is more likely to be “aligned” than one at an extreme end of the distribution. However, this doesn’t help with the problems of optimizing a single fixed reward function.
Consciousness, intelligence and human-value-alignment are probably mostly orthogonal, so I don’t think that solving the hard problem of intelligence would directly impact AGI alignment research. (Perhaps consciousness requires general intelligence, so understanding how consciousness works on a mechanistic level might dramatically accelerate timelines? But that’s highly speculative.)
However, if solving the hard problem of consciousness leads us to realize that some of our AI systems are conscious, then we have a whole new set of moral patients. (As an AGI researcher) I personally would become much more concerned with machine ethics in that case, and I suspect others would as well.
Short answer: Yep, probably.
Medium answer: If AGI has components that look like our most capable modern deep learning models (which I think is quite likely if it arrives in the next decade or two), it will probably be very resource-intensive to run, and orders of magnitude more expensive to train. This is relevant because it impacts who has the resources to develop AGI (large companies and governments; likely not individual actors), secrecy (it’s more difficult to secretly acquire a massive amount of compute than it is to secretly boot up an AGI on your la...
What can I read/look at to skill up with "alignment."
A good place to start is the "AGI Safety Fundamentals" course reading list, which includes materials from a diverse set of AI safety research agendas. Reading this can help you figure out who in this space is doing what, and which of that you think is useful. You can also join an official iteration of the course if you want to discuss the materials with a cohort and a facilitator (you can register interest for that here). You can also join the AI Alignment slack, to discuss these and other material...
DeepMind and OpenAI both already employee teams of existential-risk focused AI safety researchers. While I don't personally work on any of these teams, I get the impression from speaking to them that they are much more talent-constrained than resource-constrained.
I'm not sure how to alleviate this problem in the short term. My best guess would be free bootcamp-style training for value-aligned people who are promising researchers but lack specific relevant skills. For example, ML engineering training or formal mathematics education for junior AIS researcher...
It was a secretive program — it wasn’t advertised anywhere, and we had to sign an NDA about its existence (which we have since been released from). I got the impression that this was because OpenAI really wanted to keep the existence of GPT4 under wraps. Anyway, that means I don’t have any proof beyond my word.