Hey! This is another update from the distillers at AI Safety Info.
Here are a couple of the answers that we wrote up over the last month (July 2023). As always let us know in the comments if there are any questions that you guys have that you would like to see answered.
The list below redirects to individual links, and the collective URL above renders all of the answers in the list on one page at once.
These are the new articles. There has also been significant work in overhauling already live on site articles to improve their quality based on the feedback we have been receiving from readers.
- Isn't the real concern misuse?
- What is Vingean uncertainty?
- What is a "polytope" in a neural network?
- What are the power-seeking theorems?
- How does "chain-of-thought" prompting work?
- What is "Constitutional AI"?
- How can LLMs be understood as “simulators”?
- What evidence do experts usually base their timeline predictions on?
- Wouldn't AIs need to have a power-seeking drive to pose a serious risk?
- What is a “treacherous turn”?
- What is reinforcement learning from human feedback (RLHF)?
- What is an agent?
- What is the Alignment Research Center's research strategy?
- Wouldn't a superintelligence be smart enough not to make silly mistakes in its comprehension of our instructions?
- What are the differences between subagents and mesa-optimizers?
- What is the difference between verifiability, interpretability, transparency, and explainability?
Cross posted to EA Forum: https://forum.effectivealtruism.org/posts/ttBKSopeo59AedWZA/stampy-s-ai-safety-info-new-distillations-4-july-2023
Hi just a quick comment regarding the power-seeking theorems post: the definition you give of "power" as expected utility of optimal behavior is not the same as that used in the power-seeking theorems.
The theorems are not about any particular agent, but are statements about processes which produce agents. The definition of power is more about the number of states an agent can access. Colloquially, they're more of the flavor "for a given optimizing-process, training it on most utility functions will cause the agent to take actions which give it access to a wide-range of states"
Critically, this is a statement about behavior of different agents trained with respect to different utility functions, then averaged over all possible utility functions. Because you average over all possible utility-functions, states with high-power are just those which have more utility functions associated with them, which is, speaking loosely, proportional to the number of states the agent can get to from the state in question.
Moderate (85%) confidence; based on having read two of the three listed power-seeking papers a few months ago for MATS apps, and writing an unfinished summary about them.
I think we agree modulo terminology, with respect to your remarks up to the part about the Krakovna paper, which I had to sit and think a little bit more about.
For the Krakovna paper, you're right that it has a different flavor than I remembered - it still seems, though, that the proof relies on having some ratio of recurrent vs. non-recurrent states. So if you did something like 1000x the number of terminal states, the reward function is 1000x less retargetable to recurrent-states - I think this is still true even if the new terminal states are entirely ... (read more)