Stephen McAleese

Computer science master's student interested in AI and AI safety.

Wiki Contributions

Comments

Sorted by

I agree. I don't see a clear distinction between what's in the model's predictive model and what's in the model's preferences. Here is a line from the paper "Learning to summarize from human feedback":

"To train our reward models, we start from a supervised baseline, as described above, then add a randomly initialized linear head that outputs a scalar value. We train this model to predict which summary y ∈ {y0, y1} is better as judged by a human, given a post x."

Since the reward model is initialized using the pretrained language model, it should contain everything the pretrained language model knows.

I strong upvoted as well. This post is thorough and unbiased and seems like one of the best resources for learning about representation engineering.

Answer by Stephen McAleese10

I'll use the definition of optimization from Wikipedia: "Mathematical optimization is the selection of a best element, with regard to some criteria, from some set of available alternatives".

Best-of-n or rejection sampling is an alternative to RLHF which involves generating  responses from an LLM and returning the one with the highest reward model score. I think it's reasonable to describe this process as optimizing for reward because its searching for LLM outputs that achieve the highest reward from the reward model.

I'd also argue that AlphaGo/AlphaZero is optimizing for reward. In the AlphaGo paper it says, "At each time step  of each simulation, an action  is selected from state  so as to maximize action value plus a bonus" and the formula is:  where  is an exploration bonus.

Action values Q are calculated as the mean value (estimated probability of winning) of all board states in the subtree below an action. The value of each possible future board state is calculated using a combination of a value function estimation for that state and the mean outcome of dozens of random rollouts until the end of the game (return +1 or -1 depending on who wins).

The value function predicts the return (expected sum of future reward) from a position whereas the random rollouts are calculating the actual average reward by simulating future moves until the end of the game when the reward function return +1 or -1.

So I think AlphaZero is optimizing for a combination of predicted reward (from the value function) and actual reward which is calculated using multiple rollouts until the end of the game.

SummaryBot summary from the EA Forum:

Executive summary: Geoffrey Hinton, a pioneer in AI, discusses the history and current state of neural networks, and warns about potential existential risks from superintelligent AI while suggesting ways to mitigate these risks.

Key points:

  1. Neural networks, initially unpopular, became dominant in AI due to increased computational power and data availability.
  2. Hinton argues that large language models (LLMs) truly understand language, similar to how the human brain processes information.
  3. Digital neural networks have advantages over biological ones, including easier information sharing and potentially superior learning algorithms.
  4. Hinton believes there's a 50% chance AI will surpass human intelligence within 20 years, with a 10-20% risk of causing human extinction.
  5. To mitigate risks, Hinton suggests government-mandated AI safety research and international cooperation.
  6. Two possible future scenarios: AI takeover leading to human extinction, or humans successfully coexisting with superintelligent AI assistants.

Maybe. The analogy he gives is that the AI could be like a very intelligent personal assistant to a relatively dumb CEO. The CEO is still in charge but it makes sense to delegate a lot of tasks to the more competent assistant.

The parent and child outcome seems a bit worse than that because usually a small child is completely dependent on their parent and all their resources are controlled by the parent unless they have pocket money or something like that.

It's an original LessWrong post by me. Though all the quotes and references are from external sources.

There's a rule of thumb called the "1% rule" on the internet that 1% of users contribute to a forum and 99% only read the forum.

Thank you for the insightful comment.

On the graph of alignment difficulty and cost, I think the shape depends on the inherent increase in alignment cost and the degree of automation we can expect which is similar to the idea of the offence-defence balance.

In the worst case, the cost of implementing alignment solutions increases exponentially with alignment difficulty and then maybe automation would lower it to a linear increase.

In the best case, automation covers all of the costs associated with increasing alignment difficulty and the graph is flat in terms of human effort and more advanced alignment solutions aren't any harder to implement than earlier, simpler ones.

The rate of progress on the MATH dataset is incredible and faster than I expected.

The MATH dataset consists of competition math problems for high school students and was introduced in 2021. According to a blog post by Jacob Steinhardt (one of the dataset's authors), 2021 models such as GPT-3 solved ~7% of questions, a Berkeley PhD student solved ~75%, and an IMO gold medalist solved ~90%.

The blog post predicted that ML models would achieve ~50% accuracy on the MATH dataset on June 30, 2025 and ~80% accuracy by 2028.

But recently (September 2024), OpenAI released their new o1 model which achieved ~95% on the MATH dataset.

So it seems like we're getting 2028 performance on the MATH dataset already in 2024.

Quote from the blog post:

"If I imagine an ML system getting more than half of these questions right, I would be pretty impressed. If they got 80% right, I would be super-impressed. The forecasts themselves predict accelerating progress through 2025 (21% in 2023, then 31% in 2024 and 52% in 2025), so 80% by 2028 or so is consistent with the predicted trend. This still just seems wild to me and I'm really curious how the forecasters are reasoning about this."

Thank you for writing this insightful and thorough post on different AI alignment difficulties and possible probability distributions over alignment difficulty levels.

The cost of advancing alignment research rises faster at higher difficulty levels: much more effort and investment is required to produce the same amount of progress towards adequacy at level 7 than at level 3. This cost increases for several reasons. Most obviously, more resources, time, and effort are required to develop and implement these more sophisticated alignment techniques. But there are other reasons, such as that higher level failures cannot yet be experimentally demonstrated, so developing mitigations for them has to rely on (possibly unrepresentative) toy models instead of reacting to the failures of current systems.

Note that although implementing better alignment solutions would probably be more costly, advancements in AI capabilities could flatten the cost curve by automating some of the work. For example, constitutional AI seems significantly more complex than regular RLHF, but it might not be much harder for organizations to implement due to partial automation (e.g. RLAIF). So even if future alignment techniques are much more complex than today, they might not be significantly harder to implement (in terms of human effort) due to increased automation and AI involvement.

Load More