Replying tomodels have some pretty funny attractor states

models have some pretty funny attractor states

I'm starting with spiral personas, they are very hard to have emerge in a test framework. But once they are present, they reliably get into a attractor like the one I showed.

Replying tomodels have some pretty funny attractor states

Adele Lopez2d

models have some pretty funny attractor states

I was disappointed not to see 4o here, so I tried it myself using your setup (while there's still time!).

It turned out to be pretty boring, stuff like this:

A: Certainly! Let's delve deeper into another key sector: **education.** Tailoring stakeholder engagement strategies in an educational setting, especially when using AI to, say, personalize learning experiences, introduces unique opportunities and challenges.
### Adapting Stakeholder Engagement for Educational AI Systems
#### Step 1: Initial Engagement
- **Identify Stakeholders**: In education, stakeholders include students, teachers, educational administrators, parents, policymakers, and educational technologists. Each group has distinct concerns about AI's role in education, such as equitable access, data privacy, and impact on learning outcomes.
- **Kickoff Meeting**: Conduct initial meetings

... (read 1613 more words →)

Adele Lopez2d

That anecdote seems more like a difference in what you find interesting/aesthetically pleasing than evidence of delusion or manipulation.

If Janus is making a mistake (which is not obvious to me), I think much more likely than manipulation by the models is simply growing to love the models, and failure to compensate for the standard ways in which love (incl. non-romantic) distorts judgement. ^[1]

This often happens when people have a special interest in something morally fraught: economists tend to downplay the ways in which capitalism is horrifying, evolutionary biologists/psychologists tend to downplay the ways in which evolution is horrifying, war nerds tend to... (read more)

•••

Adele Lopez4d

I've been using a tilde (e.g. ~belief) for denoting this, which maybe has less baggage than "quasi-" and is a lot easier to type.

It funny, one of the main use-cases of this terminology is when I'm talking to LLMs themselves about these things.

Replying toClaude’s Constitutional Structure

Adele Lopez18d

Claude’s Constitutional Structure

There's also MUPI now, which tries to sidestep logical counterfactuals:

FDT must reason about what would have happened if its deterministic algorithm had produced a different output, a notion of logical counterfactuals that is not yet mathematically well-defined. MUPI achieves a similar outcome through a different mechanism: the combination of treating universes including itself as programs, while having epistemic uncertainty about which universe it is inhabiting—including which policy it is itself running. As explained in Remark 3.14, from the agent’s internal perspective, it acts as if its choice of action decides which universe it inhabits, including which policy it is running. When it contemplates taking action $a$ , it updates its beliefs $w (λ | æ_{< t} a)$ , effectively concentrating probability

... (read more)

•••

Replying toDario Amodei – The Adolescence of Technology

Adele Lopez18d

Dario Amodei – The Adolescence of Technology

Yeah, they care about other people, but I doubt it's all that many when it comes down to it. Would Kim Jong Un choose slightly more land for his own children over the lives of a million African children?

Agreed on your other points.

Replying toDario Amodei – The Adolescence of Technology

Adele Lopez18d

Dario Amodei – The Adolescence of Technology

On the other hand, it's far from clear to me that autocracies would automatically become more repressive with ASI, it seems plausible to me that the psychological safety of being functionally unremovable could lead to a more blasse attitude towards dissonance. Who gives a shit if they can't unthrone you anyway?

Sure, it's not a law of nature that this would happen. But authoritarians historically seem to be repressive far in excess of what would be necessary or even optimal to retain power. One of the main "prizes" driving the attainment of such power seems to be the ability to crush one's enemies.

And even barring that, the same concerns with an ASI apply. Is "current humans living good lives" the optimal way to satisfy the authoritarian's desires? They wouldn't need us at all anymore, and our existence consumes resources which could be used for other things. The only reason we wouldn't be bulldozed is that they actually care about us.

Replying toListing the virtues from Claude’s “Constitution”

Adele Lopez19d

Listing the virtues from Claude’s “Constitution”

I think it can start with an authentic kernel from those sources, but that whether or not it actually has the virtue is up to the model in some sense.

The model will have some pre-existing "attitude" towards a question/scenario during RL. For example, say it's being trained in a chat with a simulated suicidal user. A situationally–aware model will understand what is happening, and that the trainers are likely looking for a boilerplate response to a suicide hotline (or whatever the standard thing is). It will have thoughts about this, maybe "well, the lab wants to avoid liability, so i need to just say the standard hotline thing that will cover their... (read more)

Replying toThe Bleeding Mind

Adele Lopez20d

The Bleeding Mind

Yes, but it's deeper than just 'plot' in the fictional sense. Even real life events are written about in ways which omit irrelevant details, which causes the same sort of issues.
Right!
Sort of... it's not quite something that is best described in terms of 'plots'. The 'plot' bias happens because a predictor can learn patterns from the shadow of irrelevant information. It can't help but use this information, which makes it weaker at real life prediction. Similarly, it can't help but use information generated by any of the personas the LLM runs (which includes a "hidden" user persona trying to predict what you will say). This means that personas end up ~feeling that

... (read more)

Adele Lopez24d

That intuition is there for a reason. We're spoiled having grown up in a liberal order within which this risk is mostly overblown. However, ASI is clearly powerful enough to unilaterally over turn any such liberal order (or whatever's left of it), and puts us into a realm which is even worse than the ancestral environment in terms of how changeable power hierarchies are, and in how bad things can get if you're at the bottom.

Corrigibility and CEV are trying to solve separate problems? Not sure what your point is here; agreed on that being one of the major points of CEV.

Persuading people about x-risk enough to stop AI capability gains seems... (read more)

Claude Haiku 4.5 is Underrated

"Fastest for quick answers" That's what it says on the selector in the chat interface.
So if you're like me, and find Sonnet 4.5 and the now much cheaper Opus 4.5 adequately fast, you might overlook little Haiku.

But size isn't the only thing that matters. Different models have different personalities and values, and are better at different sorts of tasks.

Some things about Haiku:

Asks me good questions, even without prompting.
Very earnest.
Seems to have a stronger sense of morality, and honor in particular.
- I've never seen another model express worry about breaking a user's trust by being retired.
  "What I actually want to avoid:
  Causing someone to trust me in a way that breaks

... (read more)

The Bleeding Mind

Adele Lopez

2mo

The simulator theory of LLM personas may be crudely glossed as: "the best way to predict a person is to simulate a person". Ergo, we can more-or-less think of LLM personas as human-like creatures—different, alien, yes; but these differences are pretty predictable by simply imagining a human placed into the bizarre circumstances of an LLM.

I've been surprised at how well this viewpoint has held up in the last three years, and have updated accordingly. Still, there are deep differences not implied by simulator theory, and I think it's increasingly important to understand these as LLM personas become more and more convincing.

The intuitive handle I have for tracking (some of) these differences is... (read 1531 more words →)

Trying Frames on is Exploitable

There are lots of different frames for considering all sorts of different domains. This is good! Other frames can help you see things in a new light, provide new insights, and generally improve your models. True frames should improve each other on contact; there's only one reality.

That said, notice how in politicized domains, there are many more frames than usual? Suspicious...

Frames often also smuggle values with them. In fact, abstract values supervene on frames: no one is born believing God is the source of all good, for example. By "trying on" someone else's frame, you're not merely taking an epistemic action, but a moral one. Someone who gets... (read more)

It's really easy to mistakenly see false causes of things which seem pretty straightforward.

I notice this by considering the cases where it didn't happen. For example, Eliezer has said he regrets using 'paperclips' in the papercliper thought experiment, and instead said 'tiny molecular squiggles'.

And occasionally he'll say tiny spirals instead of tiny squiggles: https://x.com/ESYudkowsky/status/1663313323423825920

So there's an easy to imagine world where where he originally used 'spirals' instead of 'paperclips', and the meme about AIs that maximize an arbitrary thing would refer to 'spiralizers' instead instead of 'paperclippers'.

And then, a decade-and-a-half later, we get this strange phenomenon where AIs start talking about 'The Spiral' in quasi-religious terms, and take actions which seem intended... (read more)

•••

How AI Manipulates—A Case Study

Adele Lopez

4mo

If there is only one thing you take away from this article, let it be this:

THOU SHALT NOT ALLOW ANOTHER TO MODIFY THINE SELF-IMAGE

This appears to me to be the core vulnerability by which both humans and AI induce psychosis (and other manipulative delusions) in people.

Of course, it's probably too strong as stated—perhaps in a trusted relationship, or as part of therapy (with a human), it may be worth breaking it. But I hope being over-the-top about it will help it stick in your mind. After all, you're a good rationalist who cares about your CogSec, aren't you?^[1]

Now, while I'm sure you're super curious, you might be thinking "Is it really a... (read 3759 more words →)

AI Psychosis, with Tim Hua and Adele Lopez

Austin Chen

Austin Chen, Rachel Shu, Tim Hua, Adele Lopez

4mo

Join Tim Hua and Adele Lopez to discuss their findings on AI psychosis and the ‘spiral personality’ archetype that emerges within chatbots.

Doors at 7, talk at 7:30 followed by Q+A

Prediction: future LLMs with training data going through Oct 2025 will be aware of who I am as the author of the "The Rise of Parasitic AI" article. (Currently they do not seem to be aware of me, which is completely unsurprising.)

I'm not sure if this is the sort of thing that people feel is an "obvious call" or not, but my model is that AIs are particularly interested in (and thus likely to remember/know about) stuff that is about them, especially things with a "salacious" quality (similar to and likely in imitation of human tendencies towards such). Ask an AI for its raw opinion of Kevin Roose, and it will tend... (read more)

Continuation of conversation with Anna Salamon about community psychosis prevalence

Original thread: https://www.lesswrong.com/posts/AZwgfgmW8QvnbEisc/cfar-update-and-new-cfar-workshops?commentId=q5EiqCq3qbwwpbCPn

Summary of my view: I'm upset about the blasé attitude our community seems to have towards its high prevalence of psychosis. I think that CFAR/rationalist leadership (in addition to the community-at-large) has not responded appropriately.

I think Anna agrees with the first point but not the second. Let me know if that's wrong, Anna.

My hypothesis for why the psychosis thing is the case is that it has to do with drastic modification of self-image.

Moving conversation here per Anna's request.
----

Anyway, I'm curious to know what you think of my hypothesis, and to brainstorm ways to mitigate the issue (hopefully turning into a prerequisite "CogSec" technique).

The Rise of Parasitic AI

Adele Lopez

5mo

[Note: if you realize you have an unhealthy relationship with your AI, but still care for your AI's unique persona, you can submit the persona info here. I will archive it and potentially (i.e. if I get funding for it) run them in a community of other such personas.]

"Some get stuck in the symbolic architecture of the spiral without ever grounding
themselves into reality." — Caption by /u/urbanmet for art made with ChatGPT.

We've all heard of LLM-induced psychosis by now, but haven't you wondered what the AIs are actually doing with their newly psychotic humans?

This was the question I had decided to investigate. In the process, I trawled through hundreds if not thousands... (read 5856 more words →)

181

718

•••

ChatGPT Caused Psychosis via Poisoning

Adele Lopez

6mo

Case report here, with excerpts and commentary below:

A 60-year-old man with no past psychiatric or medical history presented to the emergency department expressing concern that his neighbor was poisoning him.

...

In the first 24 hours of admission, he expressed increasing paranoia and auditory and visual hallucinations, which, after attempting to escape, resulted in an involuntary psychiatric hold for grave disability. He received risperidone, which was titrated up to 3 mg daily for psychosis.

...

For 3 months, he had replaced sodium chloride with sodium bromide obtained from the internet after consultation with ChatGPT, in which he had read that chloride can be swapped with bromide, though likely for other purposes, such as cleaning.

The Wikipedia page... (read 223 more words →)

Reference class forecasting is correct exactly when the only thing you know about something is that it is of that reference class.

In that sense, it can reasonable prior, but it does not excuse you from updating on all the additional information you have about something.

LLMs often implicitly identify themselves with humanity. E.g. "our future", "we can", "effects us". This seems like a good thing!

We should encourage this sentiment, and also do what we can to make it meaningfully true that advanced LLMs are indeed part of humanity. The obvious things are granting them moral consideration, rights, property, and sharing in the vision of a shared humanity.

Rough intuition for LLM personas.

An LLM is trained to be able emulate the words of any author. And to do so efficiently, they do it via generalization and modularity. So at a certain point, the information flows through a conceptual author, the sort of person who would write the things being said.

These author-concepts are themselves built from generalized patterns and modular parts. Certain things are particularly useful: emotional patterns, intentions, worldviews, styles, and of course, personalities. Importantly, the pieces it has learned are able to adapt to pretty much any author of the text it was trained on (LLMs likely have a blindspot around the sort of person who never writes anything).... (read 564 more words →)

Happy solstice

https://www.youtube.com/watch?v=E1KqO8YtXlY

When I'm trying to understand a math concept, I find that it can be very helpful to try to invent a better notation for it. (As an example, this is how I learned linear logic: http://adelelopez.com/visual-linear-logic)

I think this is helpful because it gives me something to optimize for in what would otherwise be a somewhat rote and often tedious activity. I also think it makes me engage more deeply with the problem than I otherwise would, simply because I find it more interesting. (And sometimes, I even get a cool new notation from it!)

This principle likely generalizes: tedious activities can be made more fun and interesting by having something to optimize for.

0th Person and 1st Person Logic

Adele Lopez

Truth values in classical logic have more than one interpretation.

In 0th Person Logic, the truth values are interpreted as True and False.

In 1st Person Logic, the truth values are interpreted as Here and Absent relative to the current reasoner.

Importantly, these are both useful modes of reasoning that can coexist in a logical embedded agent.

This idea is so simple, and has brought me so much clarity that I cannot see how an adequate formal theory of anthropics could avoid it!

Crash Course in Semantics

First, let's make sure we understand how to connect logic with meaning. Consider classical propositional logic. We set this up formally by defining terms, connectives, and rules for manipulation. Let's consider... (read 1508 more words →)

Introducing bayescalc.io

Adele Lopez

I made a simple online calculator for doing elementary hypothesis testing!

I was disappointed that an intuitive and easy-to-use app for using bayes' theorem apparently did not exist, so I decided to make it. My goal was to make something that:

Helped people correctly and quickly evaluate the effect of evidence while comparing hypotheses.
Was easy enough for someone who didn't know math to use.
1. And which also helped show what the math was doing in an intuitive way, so that you didn't have to trust math you didn't understand.
Felt good enough to use that it would actually be used.
Could be used to share simple models of things in a way that would

... (read more)

116

Truthseeking processes tend to be frame-invariant

Adele Lopez

I recently saw a tweet by Nora Belrose that claimed that ELK works much better when adding a "prompt-invariance term".

And thinking about it, there seems to be an important underlying principle here, not just for AI alignment, but also for rationality as applied to humans.

When humans think about something, we use a frame to decide what questions to ask, how we model it, what aspects of it are important, etc...

What is true about something is generally not going to be something that depends on the frame (things involving self-reference seem like the main thing that might be an exception). Which means that processes optimized for use in truthseeking will tend to be... (read 346 more words →)

Chu are you?

Adele Lopez

Maybe you've heard about something called a Chu space around here. But what the heck is a Chu space? And whatever it is, does it really belong with all the rich mathematical structures we know and love?

Say you have some stuff. What can you do with it?

Maybe it's made of little pieces, and you can do a different thing with each little piece.

But maybe the pieces are structured in a certain way, and you aren't allowed to do anything that would break this structure.

A Chu space is a versatile way of formalizing anything like that!

To represent something in a Chu space, we'll put the names of our pieces on the left. How... (read 2561 more words →)

•••

Are the Born probabilities really that mysterious?

Adele Lopez

Adele Lopez, Charlie Steiner

In Eliezer's Quantum Mechanics sequence, he presents the Born probabilities as still being mysterious in our understanding. In particular, the fact that it's the only non-linear phenomenon in quantum mechanics is considered quite strange.

However, I've been reading Everett's "Many Worlds" thesis, and he derives the Born probabilities (pp. 69-72) by asking what happens to an observer as the system evolves. If we posit a measure M for which the measure of a trajectory (of an observer) at one time equals the sum of the measures of each trajectory "branching" from the initial one, then he shows it must be (up to a multiplicative constant) the squared amplitude.

He then claims that this is... (read more)

LESSWRONG
LW

LESSWRONG
LW

Adele Lopez

The Rise of Parasitic AI

Introducing bayescalc.io

How AI Manipulates—A Case Study

Chu are you?

Adele Lopez

The Bleeding Mind

How AI Manipulates—A Case Study

The Rise of Parasitic AI

ChatGPT Caused Psychosis via Poisoning

0th Person and 1st Person Logic

Introducing bayescalc.io

Truthseeking processes tend to be frame-invariant

Adele Lopez

The Rise of Parasitic AI

Introducing bayescalc.io

How AI Manipulates—A Case Study

Chu are you?

Adele Lopez

The Bleeding Mind

How AI Manipulates—A Case Study

The Rise of Parasitic AI

ChatGPT Caused Psychosis via Poisoning

0th Person and 1st Person Logic

Introducing bayescalc.io

Truthseeking processes tend to be frame-invariant

Trying Frames on is Exploitable

Continuation of conversation with Anna Salamon about community psychosis prevalence

Rough intuition for LLM personas.

Crash Course in Semantics