This is a brilliant post, thanks. I appreciate the breakdown of different types of contributors and how orgs have expressed the need for some types of contributors over others.

An Overview of the AI Safety Funding Situation

Stephen McAleese6d10

Thanks for the table, it provides a good summary of the post's findings. It might also worthwhile to also add it to the EA Forum post as well.

I think the table should include the $10 million in OpenAI Superalignment fast grants as well.

How do you feel about LessWrong these days? [Open feedback thread]

Stephen McAleese6d30

I think there are some great points in this comment but I think it's overly negative about the LessWrong community. Sure, maybe there is a vocal and influential minority of individuals who are not receptive to or appreciative of your work and related work. But I think a better measure of the overall community's culture than opinions or personal interactions is upvotes and downvotes which are much more frequent and cheap actions and therefore more representative. For example, your posts such as Reward is not the optimization target have received hundreds of upvotes, so apparently they are positively received.

LessWrong these days is huge with probably over 100,000 monthly readers so I think it's challenging to summarize its culture in any particularly way (e.g. probably most users on LessWrong live outside the bay area and maybe even outside the US). I personally find that LessWrong as a whole is fairly meritocratic and not that dogmatic, and that a wide variety of views are supported provided that they are sufficiently well-argued.

In addition to LessWrong, I use some other related sites such as Twitter, Reddit, and Hacker News and although there may be problems with the discourse on LessWrong, I think it's generally significantly worse on these other sites. Even today, I'm sure you can find people saying things on Twitter about how AIs can't have goals or that wanting paperclips is stupid. These kinds of comments wouldn't be tolerated on LessWrong because they're ignorant and a waste of time. Human nature can be prone to ignorance, rigidness of opinions and so on but I think the LessWrong walled garden has been able to counteract these negative tendencies better than most other sites.

Alexander Gietelink Oldenziel's Shortform

Stephen McAleese17d3-2

State-of-the-art models such as Gemini aren't LLMs anymore. They are natively multimodal or omni-modal transformer models that can process text, images, speech and video. These models seem to me like a huge jump in capabilities over text-only LLMs like GPT-3.

Catastrophic Goodhart in RL with KL penalty

Stephen McAleese17d20

Regularize by a function other than KL divergence. For heavy-tailed error distributions, KL divergence doesn’t work, but capping the maximum odds ratio for any action (similar to quantilizers) still results in positive utility.

A recent paper from UC Berkeley named Preventing Reward Hacking with Occupancy Measure Regularization proposes replacing KL divergence regularization with occupancy measure (OM) regularization. OM regularization involves regularizing based on the state or state-action distribution rather than the the action distribution:

"Our insight is that when reward hacking, the agent visits drastically different states from those reached by the safe policy, causing large deviations in state occupancy measure (OM). Thus, we propose regularizing based on the OM divergence between policies instead of AD [action distribution] divergence to prevent reward hacking"

The idea is that regularizing to minimize changes in the action distribution isn't always safe because small changes in the action distribution can cause large changes in the states visited by the agent:

Suppose we have access to a safe policy that drives slowly and avoids falling off the cliff. However, the car is optimizing a proxy reward function that prioritizes quickly reaching the destination, but not necessarily staying on the road. If we try to regularize the car’s action distributions to the safe policy, we will need to apply heavy regularization, since only slightly increasing the probability of some unsafe action (e.g., making a sharp right turn) can lead to disaster.
...
Our proposal follows naturally from this observation: to avoid reward hacking, regularize based on divergence from the safe policy’s occupancy measure, rather than action distribution. A policy’s occupancy measure (OM) is the distribution of states or state-action pairs seen by a policy when it interacts with its environment.

Alexander Gietelink Oldenziel's Shortform

Stephen McAleese18d54

I just asked GPT-4 a GSM8K problem and I agree with your point. I think what's happening is that GPT-4 has been fine-tuned to respond with chain-of-thought reasoning by default so it's no longer necessary to explicitly ask it to reason step-by-step. Though if you ask it to "respond with just a single number" to eliminate the chain-of-thought reasoning it's problem-solving ability is much worse.

Alexander Gietelink Oldenziel's Shortform

Stephen McAleese18d30

Chain-of-thought prompting makes models much more capable. In the original paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", PaLM 540B with standard prompting only solves 18% of problems but 57% of problems with chain-of-thought prompting.

I expect the use of agent features such as reflection will lead to similar large increases in capabilities as well in the near future.

We are headed into an extreme compute overhang

Stephen McAleese1mo30

Currently, groups of LLM agents can collaborate using frameworks such as ChatDev, which simulates a virtual software company using LLM agents with different roles. Though I think human organizations are still more effective for now. For example, corporations such as Microsoft have over 200,000 employees and can work on multi-year projects. But it's conceivable that in the future there could be virtual companies composed of millions of AIs that can coordinate effectively and can work continuously at superhuman speed for long periods of time.