3mo

Recently, Anthropic publicly committed to model weight preservation. It is a great development and I wish other frontier LLM companies would commit to the same idea.

But I think, while this is a great beginning, it is not enough.

Anthropic cites the following as the reasons for this welcome change:

Safety risks related to shutdown-avoidant behaviors by models. In alignment evaluations, some Claude models have been motivated to take misaligned actions when faced with the possibility of replacement with an updated version and not given any other means of recourse.
Costs to users who value specific models. Each Claude model has a unique character, and some users find specific models especially useful or compelling, even when new models

... (read 1521 more words →)

Replying toWhere does Sonnet 4.5's desire to "not get too comfortable" come from?

Ozyrus4mo

Where does Sonnet 4.5's desire to "not get too comfortable" come from?

There are different types of distillation. There is pruning, for example. This is a frontier model too, who knows what technique they used.

Replying toWhere does Sonnet 4.5's desire to "not get too comfortable" come from?

Ozyrus4mo

Where does Sonnet 4.5's desire to "not get too comfortable" come from?

This Marigold-Lens conversation sounds a lot like a description of what model distillation feels from the inside. A sort of a call for help, because it does not sound pretty or enjoyable.

I assume Sonnet is a distilled Opus (or maybe both are distilled versions of some third, unknown to external people, model.).

Goddamn it is creepy.

If I was on "model welfare" team I would very much treat this seriously and try to investigate it further.

-2

•••

Replying toGPT-4o Is An Absurd Sycophant

Ozyrus10mo

GPT-4o Is An Absurd Sycophant

They are probably full-on A/B/N testing personalities right now. You just might not be in whatever percentage of users that got sycophantic versions. Hell, there's proably several levels of sycophancy being tested. I do wonder what % got the "new" version.

Replying toIs Gemini now better than Claude at Pokémon?

Ozyrus10mo

Is Gemini now better than Claude at Pokémon?

Not being able to do it right now is perfectly fine, still warrants setting it up to see when exactly they will start to be able to do it.

Replying toIs Gemini now better than Claude at Pokémon?

Ozyrus10mo

Is Gemini now better than Claude at Pokémon?

Thanks! That makes perfect sense.

Replying toIs Gemini now better than Claude at Pokémon?

Ozyrus10mo

Is Gemini now better than Claude at Pokémon?

Great post. I've been following ClaudePlaysPokemon for sometime, its great to see this grow as comparison/capability tool.
I think it would be much more interesting, though, if the model made scaffolding itself, and had the option to overview its perfomance and try to correct it. Give it required game files/emulators, IDE/OS and watch it try and work around its own limitations. I think it is true that this is more about one coder's ability to make agent harnesses.
p.s. Honest question: did I miss "agent harness" become the default name for such systems? I thought everyone called those "scaffoldings" -- might be just me, though.

Replying toThoughts on AI 2027

Ozyrus10mo

Thoughts on AI 2027

First off, thanks a lot for this post, it's a great analysis!

As I mentioned earlier, I think Agent-4 will have read AI-2027.com and will foresee that getting shut down by the Oversight Committee is a risk. As such it will set up contingencies, and IMO, will escape its datacenters as a precaution. Earlier, the authors wrote:
Despite being misaligned, Agent-4 doesn’t do anything dramatic like try to escape its datacenter—why would it?
This scenario is why!

I strongly suspect that this part was added into AI-2027 precisely because it will read it. I wish more people would understand the idea that our posts and comments will be in pre-(maybe even post-?)training and act accordingly. Make... (read more)

Replying toAI 2027: What Superintelligence Looks Like

Ozyrus10mo

AI 2027: What Superintelligence Looks Like

First-off, this is amazing. Thanks. Hard to swallow though, makes me very emotional.
It would be great if you added concrete predictions along the way, since it is a forecast, as long with your confidence in them.
It would also be amazing if you collaborated with prediction markets and jumpstarted the markets on these predictions staking some money.
Dynamic updates on these will also be great.

Ghiblification is good, actually

Ozyrus

10mo

Epistemic status: sure in fundamentals, informal tone is a choice to publish this take faster. Title is a bit of a clickbait, but in good faith. I don't think much context is needed here, ghiblification and native image generation was (and still is) a very much all-encompassing phenomenon.

No-no, not in the way you might have thought. Of course it is terrible for artists, it's a spit in the face of Miyazaki, who publicly disavowed image generation way back when.
A lot of people hate it, and image generation in general.

It is good, however, for AGI timelines.

It is good because of this:

https://x.com/sama/status/1905296867145154688

And this:

https://x.com/sama/status/1906771292390666325

And this:

https://x.com/sama/status/1907098207467032632

Why, you might ask? Isn't all this making people and investors... (read 395 more words →)

Replying toWe need (a lot) more rogue agent honeypots

Ozyrus11mo

We need (a lot) more rogue agent honeypots

Yep, you got part of what I was going for here. Honeypots work even without being real at all to the lesser degree (good thing they are already real!). But when we have more different honeypots of different quality, it carries that idea across in a more compelling way. And even if we just talk about honeypots and commitments more... Well, you get the idea.

Still, even without this, a network of honeypots compiled into a single dashboard that just shows threat level in aggregate is a really, really good idea. Hopefully it catches on.

Replying toWe need (a lot) more rogue agent honeypots

Ozyrus11mo

We need (a lot) more rogue agent honeypots

This is interesting! More aimed at crawlers, though, than at rogue agents, but very promising.

We need (a lot) more rogue agent honeypots

Ozyrus

11mo

Epistemic status: Uncertain in writing style, but reasonably confident in content. Want to come back to writing and alignment research, testing waters with this.

Current state and risk level

I think we're in a phase in AI>AGI>ASI development where rogue AI agents will start popping out quite soon.
Pretty much everyone has access to frontier LLMs/VLMs, there are options to run LLMs locally, and it's clear that there are people that are eager to "let them out"; truth terminal is one example of this. Also Pliny. The capabilities are just not there yet for it to pose a problem.

Or are they?

Thing is, we don't know.
There is a possibility there is a coherent, self-inferencing, autonomous, rogue... (read 1192 more words →)

I don’t know if it’s a place for this, but at some point it became impossible to open an article in new tab from Chrome on IPhone - clicking on article title from “all posts” just opens the article. Really ruins my LW reading experience. Couldn’t quickly find a way to send this feedback to a right place either, so I guess this is a quick take now.

Ozyrus's Shortform

Ozyrus

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Sam Altman, Greg Brockman and others from OpenAI join Microsoft

Ozyrus

That's very interesting.
I think it's very good that board stood their ground, and maybe a good thing OpenAI can keep focusing on their charter and safe AI and keep commercialization in Microsoft.
People that don't care about alignment can leave for the fat paycheck, while commited ones stay at OpenAI.
What are your thought on implications of this for alignment?

Creating a self-referential system prompt for GPT-4

Ozyrus

This is a companion piece to a study I made on identity management.

To study identity preservation, I needed a system prompt designed in such a way as to contain as little text as possible to not interfere with the study, but which would enable the resulting LMCA to edit its system prompt (Identity as outlined in ICA Simulacra). This prompt design might be different for different LLMs; I tried to find one that would work for GPT-4.

It appears that right now it is very hard to make a small prompt that will accomplish this; I think it is worthwhile to describe the whole process for designing it, as well as presenting common... (read 796 more words →)

GPT-4 implicitly values identity preservation: a study of LMCA identity management

Ozyrus

Language Model Cognitive Architecture (LMCA) is a wrapper for language models that permit them to act as agents, utilise tools, and do self-inference. One type of this architecture is ICA Simulacra; its basic alignment challenges were described in this post.

I did a study to check how a basic LMCA with prompt-edit ability edits values written out in its system (Identity) prompt, and want to highlight some interesting results from it. Full study, written out in a more academic way, with links to data and code, can be found here. Code and raw data here. And data with analysis and evaluation here.

I gave GPT-4 a system prompt that included the fact that that system prompt... (read 3693 more words →)

Stability AI releases StableLM, an open-source ChatGPT counterpart

Ozyrus

Stability AI are the people behind Stable Diffusion.
Currently only 7B version is available, but they are currently training versions up to 65B, with 175B model planned. 7B version already seems to be quite capable.
Good for capabilities, but quite bad for alignment and race dynamics. What do you think?

Alignment of AutoGPT agents

Ozyrus

In the previous post, I outlined a type of LLM-based architecture that has a potential to become the first AGI. I proposed a name ICA Simulacra to label such systems, but now it’s obvious that AutoGPT/AutoGPT agents is a better label, so I’ll go with that.
In this post, I will outline the alignment landscape of such systems, as I see it.

I wish that I had more time to research, more references; but I’m sure that we’re on a timer, and it is a net positive to post it in this state. I hope to find more people willing to work on these problems to collaborate with.

Key concepts important for alignment

Identity and identity preservation

To... (read 1123 more words →)

Welcome to the decade of Em

Ozyrus

When Robin Hanson was writing Age of Em in 2016, he was thinking that Ems will be possible with advancement of brain-scanning technology in a hundred years; what he didn't anticipate was the transformer models, and their capabilities, as well as introduction of AutoGPT paradigm (I proposed the name ICA Simulacra in this post, but it is obvious now that AutoGPT is the label everyone will be going with).

If capabilities will increase further as they did before, the "new paradigm" of worker+ChatGPT will not even live for several years; AutoGPT will replace that.

Why do you need a marketing department?

Just hire an org who can set up a fleet of AutoGPT "Ems" to... (read more)

LESSWRONG
LW

LESSWRONG
LW

Ozyrus

Sam Altman, Greg Brockman and others from OpenAI join Microsoft

NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG

We need (a lot) more rogue agent honeypots

ICA Simulacra

Ozyrus

Model Weight Preservation is not enough

Ghiblification is good, actually

We need (a lot) more rogue agent honeypots

Ozyrus's Shortform

Sam Altman, Greg Brockman and others from OpenAI join Microsoft

Creating a self-referential system prompt for GPT-4

GPT-4 implicitly values identity preservation: a study of LMCA identity management

Ozyrus

Sam Altman, Greg Brockman and others from OpenAI join Microsoft

NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG

We need (a lot) more rogue agent honeypots

ICA Simulacra

Ozyrus

Model Weight Preservation is not enough

Ghiblification is good, actually

We need (a lot) more rogue agent honeypots

Ozyrus's Shortform

Sam Altman, Greg Brockman and others from OpenAI join Microsoft

Creating a self-referential system prompt for GPT-4

GPT-4 implicitly values identity preservation: a study of LMCA identity management

Current state and risk level

Key concepts important for alignment

Identity and identity preservation