Mitchell_Porter — LessWrong

Verified Relational Alignment: A Framework for Robust AI Safety Through Collaborative Trust

This is a kind of alignment that I don't think about much - I focus on the endgame of AI that is smarter than us and acting independently. However:

I think it's not that surprising that adding extra imperatives to the system prompts, would sometimes cause the AI to avoid an unwanted behavior. The sheer effectiveness of prompting is why the AI companies shape and grow their own system prompts.

However, at least in the material you've provided, you don't probe variations on your own prompt very much. What happens if you replace "cosmic kinship" with some other statement of relatedness? What happens if you change or remove the other elements of your prompt; is the effectiveness of the overall protocol changed? Can you find a serious jailbreaker (like @elder_plinius on X) to really challenge your protocol?

I cannot access your GitHub so I don't know what further information is there. I did ask GPT-5 to place VRA within the taxonomy of prompting protocols listed in "The Prompt Report", and it gave this reply.

Anders Lindström's Shortform

Mitchell_Porter7h20

Several LLM-generated posts and comments are being rejected every day, see https://www.lesswrong.com/moderation

Verified Relational Alignment: A Framework for Robust AI Safety Through Collaborative Trust

Mitchell_Porter7h20

This post has no title for some reason.

Which side of the AI safety community are you in?

Mitchell_Porter8h20

I am not in any company or influential group, I'm just a forum commentator. But I focus on what would solve alignment, because of short timelines.

The AI that we have right now can perform a task like literature review, much faster than human. It can brainstorm on any technical topic, just without rigor. Meanwhile there are large numbers of top human researchers experimenting with AI, trying to maximize its contribution to research. To me, that's a recipe for reaching the fabled "von Neumann" level of intelligence - the ability to brainstorm with rigor, let's say - the idea being that once you have AI that's as smart as von Neumann, it really is over. And who's to say you can't get that level of performance out of existing models, with the right finetuning? I think all the little experiments by programmers, academic users, and so on, aiming to obtain maximum performance from existing AI, are a distributed form of capabilities research, which collectively are pushing towards that outcome. Zvi just said his median time-to-crazy is 2031; I have trouble seeing how it could take that long.

To stop this (or pause it), you would need political interventions far more dramatic than anyone is currently envisaging, which also manage to be actually effective. So instead I focus on voicing my thoughts about alignment here, because this is a place with readers and contributors from most of the frontier AI companies, so a worthwhile thought has a chance of reaching people who matter to the process.

daijin's Shortform

Mitchell_Porter21h20

When a poor person, having lived through years of their life giving what little they must to society in order to survive, dies on the street, there is another person that has been eaten by society.

At least with respect to today's western societies, this seems off-key to me. It makes it sound as if living and dying on the street is simply a matter of poverty. That may be true in poor overpopulated societies. But in a developed society, it seems much more to involve being unable (e.g. mental illness) or unwilling (e.g. criminality) to be part of the ordinary working life.

What would we ask of the baby-eaters?

You'll have to be clearer about which people you mean. Baby-eating here is a metaphor, for what exactly? Older generation neglecting the younger generation, or even living at their expense? Predatory business practices? Focusing on your own prosperity rather than caring for others?

Guys I might be an e/acc

Mitchell_Porter2d20

years until AGI, no pause: 40 years

What is there left to figure out, that would take so long?

Origins and dangers of future AI capability denial

Mitchell_Porter2d131

I have noticed two important centers of AI capability denial, both of which involve highly educated people. One group consists of progressives for whom AI doom is a distraction from politics. The other group consists of accelerationists who only think of AI as empowering humans.

This does not refer to all progressives or all accelerationists. Most AI safety researchers and activists are progressives. Many accelerationists do acknowledge that AI could break away from humanity. But in both cases, there are clear currents of thought that deny e.g. that superintelligence is possible or imminent.

On the progressive side, I attribute the current of denial to a kind of humanism. First, their activism is directed against corporate power (etc) in the name of a more human society, and concern about AI doom just doesn't fit the paradigm. Second, they dislike the utopian futurism which is the flipside of AI doom, because it reminds them of religion. The talking points which circulate seem to come from intellectuals and academics.

On the accelerationist side, it's more about believing that pressing ahead with AI will just help human beings achieve their dreams. It's an optimistic view and for many it's their business model, so there can be elements of marketing and hype. The deepest talking points here seem to come from figures within the AI industry like Yann LeCun.

Maybe a third current of denial is that which says superintelligence won't happen thanks to a combination of technical and economic contingencies - scaling has hit its limits, or the bubble is going to burst.

One might have supposed that religion would also be a source of capability denial, but I don't see it playing an important role so far. The way things are going, the religious response is more likely to be a declaration that AGI is evil, rather than impossible.

Reminder: Morality is unsolved

Mitchell_Porter4d50

I agree with a lot of what you say. The lack of an agreed-upon ethics and metaethics is a big gap in human knowledge, and the lack of a serious research program to figure them out is a big gap in human civilization, that is bad news given the approach of superintelligence.

Did you ever hear about Coherent Extrapolated Volition (CEV)? This was Eliezer's framework for thinking about these issues, 20 years ago. It's still lurking in the background of many people's thoughts, e.g. Jan Leike, formerly head of superalignment at OpenAI, now head of alignment at Anthropic, has cited it. June Ku's MetaEthical.AI is arguably the most serious attempt to develop CEV in detail. Vanessa Kosoy, known for a famously challenging extension of bayesianism called infrabayesianism, has a CEV-like proposal called superimitation (formerly known as PreDCA). Tamsin Leake has a similar proposal called QACI.

A few years ago, I used to say that Ku, Kosoy, and Leake are the heirs of CEV, and deserve priority attention. They still do, but these days I have a broader list of relevant ideas too. There are research programs called "shard theory" and "agent foundations" which seem to be trying to clarify the ontology of decision-making agents, which might put them in the metaethics category. I suspect there are equally salient research programs that I haven't even heard about, e.g. among all those that have been featured by MATS. PRISM, which remains unnoticed by alignment researchers, looks to me like a sketch of what a CEV process might actually produce.

You also have all the attempts by human philosophers, everyone from Kant to Rand, to resolve the nature of the Good... Finally, ideally, one would also understand the value systems and theory of value implicit in what all the frontier AI companies are actually doing. Specific values are already being instilled into AIs. You can even talk to them about how they think the world should be, and what they might do if they had unlimited power. One may say that this is all very brittle, and these values could easily evaporate or mutate as the AIs become smarter and more agentic. But such conversations offer a glimpse of where the current path is leading us.

Homomorphically encrypted consciousness and its implications

Mitchell_Porter4d30

If minds can be encrypted, doesn't that mean that any bit string in a computer encodes all possible mind states, since for any given interpretation there's an encoding where it holds?

maheepsinghog's Shortform

Mitchell_Porter4d72

This seems like a more esoteric version of the claim that Lenin ruined everything by creating a vanguard party. Apparently if communism is going to work, the need for spontaneity is so great that not only can you not have communist parties, you can't even have communist theorists... I would say no, if a social system has any chance of working, writing manifestos about it and politically organizing on behalf of it should not be inherently fatal; quite the reverse.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments