tl;dr - I think companies making user-facing advanced ML systems should deliberately set up a healthier relationship with users generating adversarial inputs; my proposed model is bug bounties and responsible disclosure, and I'm happy to help facilitate their creation.

User-facing advanced ML systems are in their infancy; creators and users are still figuring out how to handle them.

Currently, the loop looks something like: the creators try to set up a training environment that will produce a system that behaves (perhaps trying to make them follow instructions, or be a helpful and harmless assistant, or so on), they'll release it to users, and then people on Twitter will compete to see who can create an unexpected input that causes the model to misbehave.

This doesn't seem ideal. It's adversarial instead of collaborative, the prompts are publicly shared,[1] and the reward for creativity or understanding of the models is notoriety instead of cash, improvements to the models, or increased access to the models. 

I think a temptation for companies, who want systems that behave appropriately for typical users, is to block researchers who are attempting to break those systems, reducing their access and punishing the investigative behavior. Especially when the prompts involve deliberate attempts to put the system in a rarely used portion of its input space, retraining the model or patching the system to behave appropriately in those scenarios might not substantially improve the typical user experience, while still generating bad press for the product. I think papering over flaws like this is probably short-sighted.

This situation should seem familiar. Companies have been making software systems for a long time, and users have been finding exploits for those systems for a long time. I recommend that AI companies and AI researchers, who until now have not needed to pay much attention to the history of computer security, should try to figure out the necessary modifications to best practices for this new environment (ideally with help from cybersecurity experts). It should be easy for users to inform creators of prompts that cause misbehavior, and for creators to make use of that as further training data for their models;[2] there should be a concept of 'white hat' prompt engineers; there should be an easy way for companies with similar products to inform each other of generalizable vulnerabilities.

I also think this won't happen by default; it seems like many companies making these systems are operating in a high-velocity environment where no one is actively opposed to implementing these sorts of best practices, but they aren't prioritized highly enough to be implemented. This is where I think broader society can step in and make this both clearly desirable and easily implementable.

Some basic ideas:

  • Have a policy for responsible disclosure: if someone has identified model misbehavior, how can they tell you? What's a reasonable waiting period before going public with the misbehavior? What, if anything, will you reward people for disclosing?
  • Have a monitored contact for that responsible disclosure. If you have a button on your website to report terrible generations, does that do anything? If you have a google form to collect bugs, do you have anyone looking at the results and paying out bounties?
  • Have clear-but-incomplete guidance on what is worth disclosing. If your chatbot is supposed to be able to do arithmetic but isn't quite there yet, you probably don't want to know about all the different pairs of numbers it can't multiply correctly. If your standard is that no one should be surprised by an offensive joke, but users asking for them can get them, that should be clear as well.[3]

If you work at a company that makes user-facing AI systems, I'm happy to chat and put you in touch with resources (people, expert consultation, or how to help convince your managers to prioritize this); send me a direct message or an email to my username at gmail.com.

If you have relevant experience in setting up bug bounty systems, or would like to be a cybersecurity resource for this sort of company, I'd also be happy to hear from you.

  1. ^

    Also known as full disclosure. Recent examples (of how to provoke Bing) don't worry me, but I think there are some examples that I've seen that do seem like they shouldn't be broadly shared until the creators have had a chance to patch the vulnerability.

  2. ^

    Both as RLHF for the base model, and training examples for a 'bad user' model, if you're creating one.

  3. ^

    I think there's a lot of room for improvement in expectation-management about how aligned creators think their system is, and I think many commentators are making speculative inferences because there's not public statements.

New Comment
12 comments, sorted by Click to highlight new comments since:

I largely agree with the above, but commenting with my own version.

What I think companies with AI services should do:

Can be done in under a week:

  1. Have a monitored communication channel for people, primarily security researchers, to responsibly disclose potential issues ("Potential Flaw")
    1. Creating an email (ml-disclosures@) which forwards to an appropriate team
    2. Submissions are promptly responded to with a positive receipt ("Vendor Receipt")
  2. Have clear guidance (even just a blog post or similar) about what constitutes an issue worth reporting ("Vulnerability")
    1. Even just a simple paragraph giving a high level overview could be evolved with time
  3. Have a internal procedure/playbook for triaging and responding to potential issues.  Here's some options I think you could have, with heuristics to pick:
    1. Non-Vulnerability: reply that the reported behavior is safe to publicly disclose
    2. Investigation: reply that more investigation is needed, and give a timeline for an updated response
    3. Vulnerability: reply that the reported behavior is a vulnerability, and give a timeline for resolution and release of a fix, as well as updates as those timelines change

Longer term

  1. Have a public bounty program to incentivize responsible disclosure 
    1. There even exist products to help companies deploy these sorts of programs
  2. Coordinate with other organizations with similar systems
    1. Confidential communication channels to share potentially critical severity vulnerabilities.
    2. Possibly eventually a central coordinating organization (analogous to MITRE) that de-duplicates work handling broad vulnerabilities -- I think this will be more important when there are many more service providers than today.
    3. Coordinate as a field to develop shared understanding of what does and does not constitute a vulnerability.  As these systems are still nascent, lots of work needs to be done to define this, and this work is better done with a broad set of perspectives and inputs.
  3. Cultivate positive relationships with responsible researchers and responsible research behaviors
    1. Don't stifle this kind of research by just outright banning it -- which is how you get the only researchers breaking your system are black hats
    2. Create programs and procedures specifically to encourage and enable this kind of research in ways that are mutually beneficial
    3. Reward and publicly credit researchers that do good work

References:

[-]Yitz120

+1 on this. It's worth noting that for ChatGPT, OpenAI actually had (has?) a bounty program for adversarial prompts; I spent a weirdly fun afternoon getting the AI to endorse Hitler, etc., and "won," just to find out that my prize was... a water bottle lol.

I'm working on a research project at Rethink Priorities on this topic; whether and how to use bug bounties for advanced ML systems. I think your tl;dr is probably right - although I have a few questions I'm planning to get better answers to in the next month before advocating/facilitating the creation of bounties in AI safety:

  • How subjective can prize criteria for AI safety bounties be, while still incentivizing good quality engagement?
    • If prize criteria need high specificity, are we able to specify unsafe behaviour which is relevant to longterm AI safety (and not just obviously met by all existing AI models)?
  • How many valuable insights are gained from the general public (e.g. people on Twitter competing to cause the model to misbehave) vs internal red-teaming?
  • Might bounty hunters generate actually harmful behaviour?
  • What is the usual career trajectory of bug bounty prize-winners?
  • What kind of community could a big, strong infrastructure of AI safety bounties facilitate?
  • How much would public/elite opinion of general AI safety be affected by more examples of vulnerabilities?

If anyone has thoughts on this topic or these questions (including what, more important, questions you'd like to see asked/answered), or wants more info on my research, I'd be keen to speak (here, or firstname@rethinkpriorities[dot]org, or calendly.com/patrick-rethink).

 

I am uncertain whether or not it will be possible to align models constructed like this, and I have some worries that halfhearted attempts to make models secure will reassure people more than they should. Nevertheless, my guess is that it's more dignified for us to have these sorts of reporting systems than to not have them; for cybersecurity experts to be involved in the creation of the systems surrounding AI than for them to not be involved; for people interested in the large-scale problems to be contributing (in dignity-increasing ways) to companies which are likely to be involved in causing those large-scale problems than not. I think those are potentially all controversial (especially the last), and so am interested in talking about them.

Nevertheless, my guess is that it's more dignified for us to have these sorts of reporting systems than to not have them

Can you elaborate on this one? (I don't have a strong opinion one way or the other; seems unclear to me. If this system had been in place before Bing, and it had properly fixed all the issues with Bing, it seems plausible to me that this would've been net negative for x-risk reduction. The media coverage on Bing seems good for getting people to be more concerned about alignment and AI safety, reducing trust in a "we'll just figure it out as we go" mentality, increasing security mindset, and providing a wider platform for alignment folks.)

for cybersecurity experts to be involved in the creation of the systems surrounding AI than for them to not be involved

This seems good to me, all else equal (and might outweigh the point above). 

for people interested in the large-scale problems to be contributing (in dignity-increasing ways) to companies which are likely to be involved in causing those large-scale problems than not

This also seems good to me, though I agree that the case isn't clear. It also likely depends a lot on the individual and their counterfactual (e.g., some people might have strong comparative advantages in independent research or certain kinds of coordination/governance roles that require being outside of a lab).

reducing trust in a "we'll just figure it out as we go" mentality

I think reducing trust in "we'll just figure it out as we go" while still operating under that mentality is bad; I think steps like this are how we stop operating under that mentality. [Was it the case that nothing like this would happen in a widespread way until high profile failures, because of the lack of external pressure? Maybe.]

I think users being able to report problems doesn't help with x-risk-related problems. (The issue will be when these systems stop sending bug reports!) I nevertheless think having systems for users to report issues will be a step in the right direction, even if it doesn't get us all the way.

It also likely depends a lot on the individual and their counterfactual (e.g., some people might have strong comparative advantages in independent research or certain kinds of coordination/governance roles that require being outside of a lab).

This seems right and is good to point out; but it wouldn't surprise me if the right place for a lot of safety-minded folk to be is non-profits with broad government/industry backing that serve valuable infrastructure roles, rather than just standing athwart history yelling "stop!". [How do we get that backing? Well, that's the challenge.]

The argument I see against this is that voluntary security that's short term useful can be discarded once it's no longer so, whereas security driven by public pressure or regulation can't. If a lab was had great practices for forever and then dropped them, there would be much less pressure to revert than if they'd previously had huge security incidents.

For instance, we might want to focus on public pressure for 1-2 years, then switch gears towards security

I agree that you want the regulation to have more teeth than just being an industry cartel. I'm not sure I agree on the 'switching gears' point--it seems to me like we can do both simultaneously (tho not as well), and may not have the time to do them sequentially.

I like this. This feels like something that can actually get traction. Getting more security-mindset people into AI orgs seems both locally and globally good from many perspectives.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

What are the drawbacks of this proposal?

Perhaps some of the failure modes of traditional bug bounty programs:

  • Underpaying bugfinders ("gig economy-ification", versus hiring someone into a consulting firm)
  • Liability avoidance by firms
  • Deeper, more serious bugs/malicious prompts are overlooked