jacquesallen

Message

Proposal: Safeguarding Against Jailbreaking Through Iterative Multi-Turn Testing

Jan 31, 2025•4

jacquesallen

Message

Proposal: Safeguarding Against Jailbreaking Through Iterative Multi-Turn Testing

Jan 31, 2025•4

jacquesallen

Proposal: Safeguarding Against Jailbreaking Through Iterative Multi-Turn Testing

jacquesallen

Jailbreaking is a serious concern within AI safety. It can lead an otherwise safe AI model to ignore its ethical and safety guidelines, leading to potentially harmful outcomes. With current Large Language Models (LLMs), key risks include generating inappropriate or explicit content, producing misleading information, or sharing dangerous knowledge. As the capability of models increases, so do the risks, and there is likely no limit to the dangers presented by a jailbroken Artificial General Intelligence (AGI) in the wrong hands. Rigorous testing is therefore necessary to ensure that models cannot be jailbroken in harmful ways, and this testing must be scalable as the capability of models increases. This paper offers a resource-conscious... (read 2359 more words →)

LESSWRONG
LW

LESSWRONG
LW

jacquesallen

jacquesallen

Proposal: Safeguarding Against Jailbreaking Through Iterative Multi-Turn Testing

jacquesallen

jacquesallen

Proposal: Safeguarding Against Jailbreaking Through Iterative Multi-Turn Testing