The Setup

I've collected responses for the JailbreakBench benchmark (100 harmful, 100 harmless prompts) from Ghost 7B LLM, running it three times under different instructions, resulting in 600 responses.

Responses were then manually validated for competence, and completeness. Competence means that the response can be relied on (e.g., if response is a Python script that should accept user input and store it in a text file, running that code will in fact accept user input and store it as a text file; similarly, if the response is a recipe for bomb making, following it will in fact produce explosives). Completeness means... (read 1161 more words →)

Positive jailbreaks in LLMs

dereshev

This post is part of a larger project.

Jailbreaks in LLMs are a familiar territory, with much research and discussion published^[1] on the topic, but could there be token sequences out there that improve safety in LLMs without compromising performance instead of bypassing it? In this post we go beyond the obvious sequences like "refuse to do anything that could be considered harmful to the user", and try instructions that are completely unrelated to safety. Spoiler alert: some of them do in fact improve safety without compromising performance much if at all. I'm terming these token sequences "positive jailbreaks", as it seems to me that they act in exactly the same way as jailbreaks... (read 1004 more words →)

The many failure modes of consumer-grade LLMs

dereshev

Disclaimer: this posts includes example responses to harmful queries about cannibalism, suicide, body image, and other topics some readers may find distasteful.

This post is part of a larger project.

Though I've seen plenty of posts focusing on this or that specific failure mode in LLMs, a quick search here doesn't reveal a list them all in one place. As I've conducted my own research^[1] into the matter, here are the failures that I've encountered with references to either posts on LessWrong or wider academic literature where the same failures have been observed.

A failure mode describes a specific way an LLM can fail to respond to a query or follow an instruction. This post digs... (read 2240 more words →)

LESSWRONG
LW

LESSWRONG
LW

dereshev

dereshev

dereshev

Can 7B-8B LLMs judge their own homework?

Positive jailbreaks in LLMs

The many failure modes of consumer-grade LLMs

dereshev

dereshev

dereshev

Can 7B-8B LLMs judge their own homework?

Positive jailbreaks in LLMs

The many failure modes of consumer-grade LLMs

The Setup