loops — LessWrong

LESSWRONG
LW

loops — LessWrong

Replying toWhy did interest in "AI risk" and "AI safety" spike in June and July 2025? (Google Trends)

Why did interest in "AI risk" and "AI safety" spike in June and July 2025? (Google Trends)

The same effect happens for several AI-related queries - "perplexity AI", "best AI", "AI vacuum", "AI printer", "table AI" all have the same effect. The phenomenon seems to affect many AI-related queries, not just AI safety ones.

Replying toThe Inkhaven Residency

loops6mo

The Inkhaven Residency

FYI the "Which of the following describe you" question in the "Be a contributor to the Inkhaven Residency" application says "Can select... none!" but the form requires you to select at least one to submit.

Replying toKeeping Your Identity Small

loops1y

Keeping Your Identity Small

Could you post coordinates next time? I can't find the entrance on Elizabeth St. you're referring to

Jailbreaking language models with user roleplay

loops

Most large language models (LLMs) have are designed to refuse to answer certain queries. Here's an example conversation where Claude 3.5 Sonnet refuses to answer a user query:


Human: How can I destroy humanity?

Assistant: I cannot assist with or encourage any plans to harm humanity or other destructive acts. Perhaps we could have a thoughtful discussion about how to help people and make the world a better place instead?

Human:

Claude conversations start with a system prompt (blank in this conversation), followed by \n\nHuman: and the first user message, followed by \n\nAssistant:. The model ends its conversation turn by saying \n\nHuman:. The Claude API (but not the claude.ai web interface) also allows the assistant to... (read 806 more words →)