Emile Delcourt, David Baek, Adriano Hernandez, Erik Nordby with advising from Apart Lab Studio
Introduction & Problem Statement
Helpful, Harmless, and Honest (”HHH”, Askell 2021) is a framework for aligning large language models (LLMs) with human values and expectations. In this context, "helpful" means the model strives to assist users in achieving their legitimate goals, providing relevant information and useful responses. "Harmless" refers to avoiding generating content that could cause damage, such as instructions for illegal activities, harmful misinformation, or content that perpetuates bias. "Honest" emphasizes transparency about the model's limitations and uncertainties—acknowledging when it doesn't know something, avoiding fabricated information, and clearly distinguishing between facts and opinions. This framework serves as both a... (read 6496 more words →)
Edit: it looks like someone gave some links below. I don't have time to read it yet but I may do so in the future. I think that it's better to give examples and be dismissed than to give no examples and be dismissed.
It should be nice to see some good examples of illegible problems. I understand that their illegibility may be a core issue to this, but surely someone can at least name a few?
I think it's important so as to compare to legible problems. I assume legible problems can include things like jailbreaking-resistance, unlearning, etc.. for AI security. I don't see why these in particular necessarily bring forward the day... (read more)