What do you think about the failure mode described here? In particular, (a) how would you go about deciding whether politicians and CEOs inspired by your paper will decrease, or increase, the pressure on AIs to believe false things (and thus become epistemically broken fanatics/ideologues) and/or lie about what they believe? And (b) how would you go about weighing the benefits and costs of increased trust in AI systems?
To add to what Owain said:
This is very helpful, thanks! I now have a better understanding of what you are doing and basically endorse it. (FWIW, this is what I thought/hoped you were doing.)
(This won't address all parts of your questions.)
You suggest that the default outcome is for governments and tech platforms to not regulate whether AI needs to be truthful. I think it’s plausible that the default outcome is some kind of regulation.
Why to expect regulation?
Suppose an AI system produces false statements that deceive a group of humans. Suppose also that the deception is novel in some way: e.g. the falsehoods are personalized to individuals, the content/style is novel, or the humans behind the AI didn't intend any kind of deception. I think if this happens repeatedly, there will be some kind of regulation. This could be voluntary self-regulation from tech companies or normal regulation by governments. Regulation may be more likely if it’s harder to litigate using existing laws relating to (human) deception.
Why expect AI to cause deception?
You also suggest that in the default scenario AI systems say lots of obviously false things and most humans would learn to distrust them. So there's little deception in the first place. I’m uncertain about this but your position seems overconfident. Some considerations:
1. AI systems that generate wild and blatant falsehoods all the time are not very useful. For most applications, it’s more useful to have systems that are fairly truthful in discussing non-controversial topics. Even for controversial or uncertain topics, there’s pressure for systems to not stray far from the beliefs of the intended audience.
2. I expect some people will find text/chat by AI systems compelling based on stylistic features. Style can be personalized to individual humans. For example, texts could be easy to read (“I understood every word without pausing once!”) and entertaining (“It was so witty that I just didn’t want to stop reading”). Texts can also use style to signal intelligence and expertise (“This writer is obviously a genius and so I took their views seriously”).
3. Sometimes people won't know whether it was an AI or human who generated the text. If there are tools for distinguishing, some people won’t use them and some won’t have enough tech savvy to use them well.
4. There are humans (“charlatans”) who frequently say false and dubious things while having devoted followers. Not all human followers of charlatans are “fools” (to refer to your original question). AI charlatans would have the advantage of more experimentation. Human charlatans exploit social proof and AI charlatans could do the same (e.g. humans believe the claim X because they think other humans they like/trust believe X).
This is helpful, thanks!
I agree that we should expect regulation by default. And so then maybe the question is: Is the regulation that would be inspired by Truthful AI better or worse than the default? Seems plausibly better to me, but mostly it seems not that different to me. What sort of regulation are you imagining would happen by default, and why would it be significantly worse?
I also totally agree with your points 1 - 4.
It seemed like a common refrain a few years ago that we didn't want to involve government in regulating AI because we didn't know enough about what policy interventions would be counterproductive. Do you think that's changed? What are the obvious things to look at for regulation?
I don't think I'm yet at "here's regulation that I'd just like to see", but I think it's really valuable to try to have discussions about what kind of regulation would be good or bad. At some point there will likely be regulation in this space, and it would be great if that was based on as deep an understanding as possible about possible regulatory levers, and their direct and indirect effects, and ultimate desirability.
I do think it's pretty plausible that regulation about AI and truthfulness could end up being quite positive. But I don't know enough to identify in exactly what circumstances it should apply, and I think we need a bit more groundwork on building and recognising truthful AI systems first. I guess quite a bit of our paper is trying to open the conversation on that.
Adding to this: AI is already being regulated. In the EU, you could argue that previous regulations (like GDPR) already had some impacts on AI, but regardless, the EU is now working on an AI Act that will unambiguously regulate AI broadly. The current proposal (also see some discussion on the EA forum) contains some lines that are related to and could set some precedents for truthfulness-releated topics, such as:
[prohibiting] the placing on the market, putting into service or use of an AI system that deploys subliminal techniques beyond a person’s consciousness in order to materially distort a person’s behaviour in a manner that causes or is likely to cause that person or another person physical or psychological harm
and
Providers shall ensure that AI systems intended to interact with natural persons are designed and developed in such a way that natural persons are informed that they are interacting with an AI system, unless this is obvious from the circumstances and the context of use.
There's not yet any concrete regulation that I know I'd be excited about pushing (truthfulness-related or otherwise). But I would expect further work to yield decent guesses about what kind of regulation is likely to be better/worse; and I'd be surprised if the answer was to ignore the space or oppose all regulation.
(Although I should note: Even if there will doubtlessly be some regulation of AI in general, that doesn't mean that there'll be regulation of all potentially-important subareas of AI. And insofar as there's currently little attention on regulation of particular sub-areas (including e.g. regulation that mentions alignment, or regulation of narrowly construed AI truthfulness), the situation with regards to pushing for regulation in those areas might be more similar to the general AI/regulation situation from 5 years ago.)
It seems like fine-tuning a general language model on truthful question-answers didn't generalize all that well (is that accurate?). How might you design or train a language model so that it would have some general factor of honesty that could be easily affected by fine-tuning?
It seems like fine-tuning a general language model on truthful question-answers didn't generalize all that well
I'm not sure what you're referring to. Is there some experiment you have in mind? People have finetuned general language models to answer general-knowledge questions correctly, but I don't know of people finetuning for being truthful (i.e. avoiding falsehoods).
How might you design or train a language model so that it would have some general factor of honesty that could be easily affected by fine-tuning?
There's a more general question. Can you train a model to have property X in such a way that finetuning is unlikely to easily remove property X? If any kind of finetuning is allowed, I think the answer will be "No". So you'd need to place restrictions of the finetuning. There is probably some work in ML on this problem.
To address the specific question about honesty/truthfulness, we discuss some ways to design language models in the executive summary and (at greater length) in Section 5 of the paper (e.g. discussion of "robust" truthfulness).
Huh, I think I hallucinated a result from the TruthfulQA paper where you fine-tuned on most of the dataset but didn't see gains on the held-out portion.
Okay, new AMA question: have you already done the experiment that I hallucinated? If not, what do you think would happen?
Would you rather align one big AI, or 100 small ones? (Of size ratio approx. that of a horse to a duck.)
I think it would be an easier challenge to align 100 small ones (since solutions would quite possibly transfer across).
I think it would be a bigger victory to align the one big one.
I'm not sure from the wording of your question whether I'm supposed to assume success.
Since we recently had an update from Redwood Research, what's your take on what they're doing? What problems (either technical or social/political) do you foresee them running into, and what are some avenues forward?
I'm relatively a fan of their approach (although I haven't spent an enormous amount of time thinking about it). I like starting with problems which are concrete enough to really go at but which are microcosms for things we might eventually want.
I actually kind of think of truthfulness as sitting somewhere on the spectrum between the problem Redwood are working on right now and alignment. Many of the reasons I like truthfulness as medium-term problem to work on are similar to the reasons I like Redwood's current work.
We just published a long paper on Truthful AI (overview post). We’ll be running an Ask Me Anything on truthful AI from Tuesday (October 26) to Wednesday (October 27) at this post.
You may wish to ask about:
If you want to ask something just post a top-level comment. Also feel free to make comments that are not questions (e.g. objections or suggestions).