x

LESSWRONG
LW

Harsh Raj

Harsh Raj

Message

8

1

4y

Harsh Raj

8

4y

Harsh Raj — LessWrong

Interpreting the effects of Jailbreak Prompts in LLMs

GColab What is Jailbreaking? Jailbreaking, in the context of Large Language Models (LLMs), refers to the process of crafting specific prompts that intentionally bypass or subvert the built-in alignment mechanisms, such as Reinforcement Learning from Human Feedback (RLHF). These alignment mechanisms are designed to ensure that the model adheres to...

Sep 29, 2024•9