Hey! This is another update from the distillers at AI Safety Info.
Here are a couple of the answers that we wrote up over the last month (July 2023). As always let us know in the comments if there are any questions that you guys have that you would like to see answered.
The list below redirects to individual links, and the collective URL above renders all of the answers in the list on one page at once.
These are the new articles. There has also been significant work in overhauling already live on site articles to improve their quality based on the feedback we have been receiving from readers.
- Isn't the real concern misuse?
- What is Vingean uncertainty?
- What is a "polytope" in a neural network?
- What are the power-seeking theorems?
- How does "chain-of-thought" prompting work?
- What is "Constitutional AI"?
- How can LLMs be understood as “simulators”?
- What evidence do experts usually base their timeline predictions on?
- Wouldn't AIs need to have a power-seeking drive to pose a serious risk?
- What is a “treacherous turn”?
- What is reinforcement learning from human feedback (RLHF)?
- What is an agent?
- What is the Alignment Research Center's research strategy?
- Wouldn't a superintelligence be smart enough not to make silly mistakes in its comprehension of our instructions?
- What are the differences between subagents and mesa-optimizers?
- What is the difference between verifiability, interpretability, transparency, and explainability?
Cross posted to EA Forum: https://forum.effectivealtruism.org/posts/ttBKSopeo59AedWZA/stampy-s-ai-safety-info-new-distillations-4-july-2023
I think we agree modulo terminology, with respect to your remarks up to the part about the Krakovna paper, which I had to sit and think a little bit more about.
For the Krakovna paper, you're right that it has a different flavor than I remembered - it still seems, though, that the proof relies on having some ratio of recurrent vs. non-recurrent states. So if you did something like 1000x the number of terminal states, the reward function is 1000x less retargetable to recurrent-states - I think this is still true even if the new terminal states are entirely unreachable as well?
With respect to the CNN example I agree, at least at a high-level - though technically the theta reward vectors are supposed to be |S| and specify the rewards for each state, which is slightly different than being the weights of a CNN - without redoing the math, its plausible that an analogous theorem would hold. Regardless, the non-shutdown result gives retargetability because it assumes there's a single terminal state and many recurrent states. The retargetability is really just the ratio (number of terminal states) / (number of recurrent states), which needn't be greater than one.
Anyways, as the comments from Turntrout talk about, as soon as there's a nontrivial inductive bias over these different reward-functions (or any other path-dependence-y stuff that deviates from optimality), the theorem doesn't go through, as retargetability is all based on counting how many of the functions in that set are A-preferring vs. B-preferring - there may be an adaptation to the argument that uses some prior over generalizations and stuff, though - but then that prior is the inductive bias, which as you noted with those TurnTrout remarks, is its own whole big problem :')
I'll try and add a concise caveat to your doc, thanks for the discussion :)