Stephen McAleese

Computer science master's student interested in AI and AI safety.

Wiki Contributions

Comments

Currently, groups of LLM agents can collaborate using frameworks such as ChatDev, which simulates a virtual software company using LLM agents with different roles. Though I think human organizations are still more effective for now.  For example, corporations such as Microsoft have over 200,000 employees and can work on multi-year projects. But it's conceivable that in the future there could be virtual companies composed of millions of AIs that can coordinate effectively and can work continuously at superhuman speed for long periods of time.

I think I might create a new post using information from this post which covers the new AI alignment landscape.

I think this section of the post is slightly overstating the opportunity cost of doing a PhD. PhD students typically spend most of their time on research so ideally, they should be doing AI safety research during the PhD (e.g. like Stephen Casper). If the PhD is in an unrelated field or for the sake of upskilling then there is a more significant opportunity cost relative to working directly for an AI safety organization.

Thank you for explaining PPO. In the context of AI alignment, it may be worth understanding in detail because it's the core algorithm at the heart of RLHF. I wonder if any of the specific implementation details of PPO or how it's different from other RL algorithms have implications for AI alignment. To learn more about PPO and RLHF, I recommend reading this paper: Secrets of RLHF in Large Language Models Part I: PPO.

From reading the codebase, it seems to be a LangChain chatbot powered by the default LangChain OpenAI model which is gpt-3.5-turbo-instruct. The announcement blog post also says it's based on gpt-3.5-turbo.

LLMs aren't that useful for alignment experts because it's a highly specialized field and there isn't much relevant training data. The AI Safety Chatbot partially solves this problem using retrieval-augmented generation (RAG) on a database of articles from https://aisafety.info. There also seem to be plans to fine-tune it on a dataset of alignment articles.

OP says that this post is focused on RL policy gradient algorithms (e.g. PPO) where the RL signal is used by gradient descent to update the policy.

But what about Q-learning which is another popular RL algorithm? My understanding of Q-learning is that the policy network takes an observation as input, calculates the value (expected return) of each possible action in the state  and then chooses the action with the highest value.

Does this mean that reward is not the optimization target for policy gradient algorithms but is for Q-learning algorithms?

I agree. GPT-4 is an AGI for the kinds of tasks I care about such as programming and writing. ChatGPT4 in its current form (with the ability to write and execute code) seems to be at the expert human level in many technical and quantitative subjects such as statistics and programming.

For example, last year I was amazed when I gave ChatGPT4 one of my statistics past exam papers and it got all the questions right except for one which involved interpreting an image of a linear regression graph. The questions typically involve understanding the question, thinking of an appropriate statistical method, and doing calculations to find the right answer. Here's an example question:

Times (in minutes) for a sample of 8 players are presented in Table 1 below. Using an appropriate test at the 5% significance level, investigate whether there is evidence of a decrease in the players’ mean 5k time after the six weeks of training. State clearly your assumptions and conclusions, and report a p-value for your test statistic.

The solution to this question is a paired sample t-test.

Sure, GPT-4 has probably seen similar questions before but so do students since they can practice past papers.

This year, one of my professors designed his optimization assignment to be ChatGPT-proof but I found that it could still solve five out of six questions successfully. The questions involved converting natural language descriptions of optimization problems into mathematical formulations and solving them with a program.

One of the few times I've seen GPT-4 genuinely struggle to do a task is when I asked it to solve a variant of the Zebra Puzzle which is a challenging logical reasoning puzzle that involves updating a table based on limited information and using logical reasoning and a process of elimination to find the correct answer.

Answer by Stephen McAleeseMar 01, 202430

I wrote a blog post on whether AI alignment can be automated last year. The key takeaways:

  • There's a chicken-and-egg problem where you need the automated alignment researcher to create the alignment solution but the alignment solution is needed before you can safely create the automated alignment researcher. The solution to this dilemma is an iterative bootstrapping process where the AI's capabilities and alignment iteratively improve each other (a more aligned AI can be made more capable and a more capable AI can create a more aligned AI and so on).
  • Creating the automated alignment researcher only makes sense if it is less capable and general than a full-blown AGI. Otherwise, aligning it is just as hard as aligning AGI.

There's no clear answer to this question because it depends on your definition of "AI alignment" work. Some AI alignment work is already automated today such as generating datasets for evals, RL from AI feedback, and simple coding work. On the other hand, there are probably some AI alignment tasks that are AGI-complete such as deep, cross-domain, and highly creative alignment work.

The idea of the bootstrapping strategy is that as the automated alignment researcher is made more capable, it improves its own alignment strategies which enables further capability and alignment capabilities and so on. So hopefully there is a virtuous feedback loop over time where more and more alignment tasks are automated.

However, this strategy relies on a robust feedback loop which could break down if the AI is deceptive, incorrigible, or undergoes recursive self-improvement and I think these risks increase with higher levels of capability.

I can't find the source but I remember reading somewhere on the MIRI website that MIRI aims to do work that can't easily be automated so Eliezer's pessimism makes sense in light of that information.

Further reading:

Strong upvote. I think this is an excellent, carefully written, and timely post. Explaining issues that may arise from current alignment methods is urgent and important. It provides a good explanation of the unidentifiability or inner alignment problem that could arise from advanced AIs systems trained with current behavioral safety methods. It also highlights the difficulty of making AIs that can automate alignment research which is part of OpenAI's current plan. I also liked the in-depth description of what advanced science AIs would be capable of as well as the difficulty of keeping humans in the loop.

Load More