User Comment Replies

Human-AI Complementarity: A Goal for Amplified Oversight

Thanks for another thoughtful response and explaining further. I think we can now both agree that we disagree (at least in certain respects) ;-)

We take seriously your argument that AI could get really smart and good at predicting human preferences and values, which could change the level of human involvement in training, evaluation, and monitoring. However, if we go with the approach you propose:

> Instead, I think our strategy should be "If humans are inconsistent and disagree, let's strive to learn a notion of human values that's robust to our inc... (read more)

Human-AI Complementarity: A Goal for Amplified Oversight

Sophie Bridgers3moΩ230

Hi Charlie, Thanks for your thoughtful feedback and comments! If we may, we think we actually agree more than we disagree. By “definitionally accurate”, we don’t necessarily mean that a group of randomly selected humans are better than AI at explicitly defining or articulating human values or better at translating those values into actions in any given situation. We might call this “empirical accuracy” – that is, under certain empirical conditions such as time pressure, expertise and background of the empirical sample, incentive structure of the empirical ... (read more)

6Charlie Steiner3mo

Thanks for the great reply :) I think we do disagree after all. Except about that - here we agree. This might be summarized as "If humans are inaccurate, let's strive to make them more accurate." I think this, as a research priority or plan A, is doomed by a confluence of practical facts (humans aren't actually that consistent, even in what we'd consider a neutral setting) and philosophical problems (What if I think the snap judgments and heuristics are important parts of being human? And, how do you square a univariate notion of 'accuracy' with the sensitivity of human conclusions to semi-arbitrary changes to e.g. their reading lists, or the framings of arguments presented to them?). Instead, I think our strategy should be "If humans are inconsistent and disagree, let's strive to learn a notion of human values that's robust to our inconsistency and disagreement." A committee of humans reviewing an AI's proposal is, ultimately, a physical system that can be predicted. If you have an AI that's good at predicting physical systems, then before it makes an important decision it can just predict this Committee(time, proposal) system and treat the predicted output as feedback on its proposal. If the prediction is accurate, then actual humans meeting in committee is unnecessary. (And indeed, putting human control of the AI in the physical world actually exposes it to more manipulation than if the control is safely ensconced in the logical structure of the AI's decision-making.)

LESSWRONG
LW

All of Sophie Bridgers's Comments + Replies