LESSWRONG
LW

9

[ Question ]

How would you improve ChatGPT's filtering?

10th Dec 2022

1 min read

9

I am wondering how Less Wrong would improve ChatGPT's filtering? I'm reading through the comments on breaking OpenAI's filtering, and see plenty of analysis of the weaknesses of the safeguards. There's always the chance that some group could steal ChatGPT's source code and remove ad hoc additions to it, so I'll ask the question in this form:

How would you change ChatGPT's purpose, design, or function to enforce topic and content filtering of its output?

Thanks for your thoughts.

9

How would you improve ChatGPT's filtering?

New Answer

New Comment

2 Answers sorted by
top scoring

Dec 10, 2022

30

Although this isn’t a direct answer, I think there’s something that changed recently with chat gpt such that it is now much better at filtering out illegal advice. It appears to be more complex than simply running a filter over what words were in the prompt or what words are in chat gpt’s output. By recent, I mean in the last 24 hours, and many tricks to “jailbreak” chat gpt no longer work.

It gives the impression that they modified the design of it to train on not providing illegal information.

[-]ChristianKl2y41

It feels to me like the update today made it even better at filtering out answers that OpenAI doesn't want it to give.

It seems to me like the run basically on:

"Have an AI that flags whether or not a prompt or an answer violates the rules. Mark the text red if it does. Offer the user a way to say that text was marked wrongly as violating the rules."

This then gives them training data they can use to improve their filtering. Given how much ChatGPT is used this method will allow them to filter out more and more of what they want to filter out.

1Noah Scales2y

Huh, ok. I will have to check out the new version. Thanks!

[-]Noah Scales2y10

Hmm, that's interesting. Thanks Peter!

Dec 12, 2022

00

I would improve the filtering by reducing it to zero.

[-]Noah Scales2y10

Interesting, and why is that an improvement?

More from Noah Scales

Curated and popular this week