Simon Möller — LessWrong

LESSWRONG
LW

I flat out do not believe them. Even if Llama-2 was unusually good, the idea that you can identify most unsafe requests only a 0.05% false positive rate is absurd.

Given the quote in the post, this is not really what they claim. They say (bold mine):

However, false refusal is overall rare—approximately 0.05%—on the helpfulness dataset

So on that dataset, I assume it might be true although "in the wild" it's not.

Replying toAnthropic Observations

Simon Möller3y

Anthropic Observations

Which brings us back to the central paradox: If the thesis that you need advanced systems to do real alignment work is true, why should we think that cutting edge systems are themselves currently sufficiently advanced for this task?

I really like this framing and question.

My model of Anthropic says their answer would be: We don't know exactly which techniques work until when or how fast capabilities evolve. So we will continuously build frontier models and align them.

This assumes at least a chance that we could iteratively work our way through this. I think you are very skeptical of that. To the degree that we cannot, this approach (and to a large extent OpenAI's) seem pretty doomed.

Replying toReflective journal entries using GPT-4 and Obsidian that demand less willpower.

Simon Möller3y

Reflective journal entries using GPT-4 and Obsidian that demand less willpower.

I fully agree. I tried using ChatGPT for some coaching, but tried to keep it high level and in areas where I wouldn't be too bothers if it showed up on the internet.

I think using the API, rather than ChatGPT, is better. See e.g. https://techcrunch.com/2023/03/01/addressing-criticism-openai-will-no-longer-use-customer-data-to-train-its-models-by-default/:

Starting today, OpenAI says that it won’t use any data submitted through its API for “service improvements,” including AI model training, unless a customer or organization opts in. In addition, the company is implementing a 30-day data retention policy for API users with options for stricter retention “depending on user needs,” and simplifying its terms and data ownership to make it clear that users own the input and output of the models.

I was actually thinking that having an Obsidian plugin for this sort of thing would be really neat.

Replying toReflective journal entries using GPT-4 and Obsidian that demand less willpower.

Simon Möller3y

Reflective journal entries using GPT-4 and Obsidian that demand less willpower.

Couple of years? I think we are talking about months here. I guess the biggest bottleneck would be to get all notes into the LLM context. But I doubt you really need that. I think you can probably guess a few important notes for what you are currently working on and add those as context.

Why AI Safety is Hard

Simon Möller

This post is meant to summarize the difficulties of AI Safety in a way that my former self from a few months ago would have found helpful. I expect that it might be a useful framing for others, although I’m not saying anything new. I thank David Lindner for helpful comments. All views are my own.

AI systems are getting increasingly powerful and general. It seems plausible that we will get to extremely general and powerful systems within a decade and likely to get there in a few decades^[1]. In this post I’ll use the term AGI and mean a system that can perform any mental task at least as well as the best human... (read 1710 more words →)

Replying to"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

Simon Möller3y

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

AF: Alignment Forum

Simon Möller3yQuick Take

"Human-level AGI" is not a useful concept (any more). I think many people equate human-level AGI and AGI (per definition) as a system (or a combination of systems) that can accomplish any (cognitive) task at least as well as a human.

That's reasonable, but having the "human-level" in that term seems misleading to me. It anchors us to the idea that the system will be "somewhat like a human", which it won't be. So let's drop the qualifier and just talk about AGI.

Comparing artificial intelligence to human intelligence was somewhat meaningful when we were far away from it along many dimensions to gesture in a general direction.

But large language models are already superhuman on several dimensions (e.g. know more about most topics than any single human, think "faster") and inferior on others (e.g. strategic planning, long-term coherence). By the time they are at human level on all dimensions, they will be super-human overall.

Simon Möller's Shortform

Simon Möller

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Replying toMy thoughts on OpenAI's alignment plan

Simon Möller3y

My thoughts on OpenAI's alignment plan

This post is great. Strongly upvoted. I just spent a day or so thinking about OpenAI's plan and reading other people's critique. This post does a great job of pointing out problems with the plan at what I think is the right level of detail. The tone also seems unusually constructive.

Situational awareness in Large Language Models

Simon Möller

I’m grateful to Bogdan Cirstea, Konstantin Pilz and Raphaël S for providing feedback on this post.

This post tries to clarify the concept of situational awareness, in particular with respect to current large language models.

What is situational awareness

Not writing anything new here, just summarizing prior work.

(It’s worth noting that the usage of the term here is different from what’s usually meant by situational awareness in humans.)

Ajeya Cotra introduced the term of situational awareness in the context of AI Safety and Richard Ngo et al. recently elaborated on it. Situational awareness describes the degree to which an AI system understands its environment and its own state and behavior, in particular when that understanding causes specific behavior (such as deceptive... (read 1837 more words →)

Replying toMy thoughts on OpenAI's Alignment plan

Simon Möller3y

My thoughts on OpenAI's Alignment plan

Upvoted since I like how literally you went through the plan. I think we need to think about and criticize both, the literal version of the plan and the way it intersects with reality.

The methods you are trying are all known to fail at sufficiently high levels of intelligence. But if these are your only ideas, it is possible they get you far enough for GPT-5 to output a better idea.

To me this seems like a key point that many other critiques are missing that focus on specific details.