1 min read

2

This is a special post for quick takes by Davey Morse. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
60 comments, sorted by Click to highlight new comments since:

if you’re an agent (AI or human) who wants to survive for 1000 years, what’s the “self” which you want to survive? what are the constants which you want to sustain?

take your human self for example. does it make sense to define yourself as…

  • the way your hair looks right now? no, that’ll change.
  • the way your face looks? it’ll change less than your hair, but will still change.
  • your physical body as a whole? still, probably not. your body will change, and also, there are parts of you which you may consider more important than your body alone.
  • all your current beliefs around the world? those will change less than your appearance, maybe, or maybe more. so not a good answer either.
  • your memories? these may be a more constant set of things than your beliefs, and closer to the core of who you are. but still, memories fade and evolve. and it doesn’t feel right to talk about preserving yourself as preserving memories of things which have happened to you. that would neglect things which may happen to you in the future.
  • your character? something deeper than memory, deeper than beliefs. this could be more constant than anything in the list so far. if you plan for your life to be 50 years, or 100 years, it’s reasonable to expect that character could remain constant. by character, i (quite vaguely) mean intricate subtle idiosyncratic patterns in the way you approach other situations and people. “character” is maybe what a spouse would say is one of the core ways to group the things they love about you. but if you survive for more than 100 years—say, 1000 years, do you expect your specific character to remain constant? would you want it to remain constant? lots of people have found lots of different ways to approach life. over 1000s of years, wouldn’t you try different approaches? if you were to try different kinds of character over hundreds or thousands of years, then maybe “character”‘s only a good answer for sub-100 year lives. so what’s a good core self-definition for a life that you intend to last over thousands or even millions of years? how about…
  • your persistent striving? the thing that will stay most constant in an intelligent being which survives a long time, i think, may be the drive to survive. your appearance will change; so will your beliefs, your memories, and your character. but insofar as you are a being which is surviving a long time, maybe you can expect, consciously or unconsciously, that your drive to survive will survive. and maybe it’s some particular drive to survive that you have—some survival drive that’s deep in your bones that’s different than the one in other people’s bones, or the one that’s in dogs, or forests, or the earth itself. but if you’re defining yourself as a particular drive to survive… that particular drive to survive is likely to survive less long than the universal drive to survive. which makes me think that in a being which survives the longest, they may define their self as…
  • persistent striving in general? it might exist in the physical body in which you started. but it may also exist in the physical bodies of other humans around you. of animals. of tornados, of ecosystems. insofar as you’re intelligent enough to see this Persistent Striving around you, insofar as you’re intelligent enough to see life as it exists around you, well then you, as a being who will be >1000 years old may benefit from identifying with all life—ie the Persistent Striving—wherever it exists. Persistent Striving is the core. one might reply, “this is vague. why would you want a vague self definition?” it is general yes. but it is still meaningful in a literal sense. the drive to survive is something rare, which most matter configurations don’t have. (it is true that it’s not present binarily; certain systems have more or less of it. roughly i’d hazard a rock has less than a thermometer than does a tornado or a human.) but it still defines a non-trivial self: life forms wherever they exist. if we were to get any more general and say something like:
  • the entire universe? this would be trivial and meaningless. because everything is included in this self definition, it no longer means anything to sustain a self under this definition. it means nothing, in fact. a being which identifies with the entire universe ceases to exist. it might be spiritually enlightened to do this. but the beings which will be around the most, which will survive the most and the longest won’t do this, because they will dissipate and no longer be noticeable or definable. we’ll no longer be able to talk about them as beings.

so if we’re talking about beings which survive a long time, the most robust and stable self definition seems to be Identifying With All Life. (IWAL). or is my logic flawed?

No particular aspect. Just continuity: something which has evolved from me without any step changes that are "too large". I mean, assuming that each stage through all of that evolution has maintained the desire to keep living. It's not my job to put hard "don't die" constraints on future versions.

As far as I know, something generally continuity-based is the standard answer to this.

Similar here. I wouldn't want to constrain my 100 years older self too much, but that doesn't mean that I identify with something very vague like "existence itself". There is a difference between "I am not sure about the details" and "anything goes".

Just like my current self is not the same as my 20 years old self, but that doesn't mean that you could choose any 50 years old guy and say that all of them have the same right to call themselves a future version of my 20 years old self. I extrapolate the same to the future: there are some hypothetical 1000 years old humans who could be called future versions of myself, and there are many more who couldn't.

Just because people change in time, that doesn't mean it is a random drift. I don't think that the distribution of possible 1000 years old versions of me is very similar to a distribution of possible 1000 years old versions of someone else. Hypothetically, for a sufficiently large number this might be possible -- I don't know -- but 1000 years seems not enough for that.

Seems to me that there are some things that do not change much as people grow older. Even people who claim that their lives have dramatically changed, have often only changed in one out of many traits, or maybe they just found a different strategy how to follow the same fundamental values.

At least as an approximation: people's knowledge and skills change, their values don't.

not really an answer but i wanted to communicate that the vibe of this question feels off to me because: surely one's criteria on what to be up to are/[should be] rich and developing. that is, i think things are more like: currently i have some projects i'm working on and other things i'm up to, and then later i'd maybe decide to work on some new projects and be up to some new things, and i'd expect to encounter many choices on the way (in particular, having to do with whom to become) that i'd want to think about in part as they come up. should i study A or B? should i start job X? should i 2x my neuron count using such and such a future method? these questions call for a bunch of thought (of the kind given to them in usual circumstances, say), and i would usually not want to be making these decisions according to any criterion i could articulate ahead of time (though it could be helpful to tentatively state some general principles like "i should be learning" and "i shouldn't do psychedelics", but these obviously aren't supposed to add up to some ultimate self-contained criterion on a good life)

My motivation w/ the question is more to predict self-conceptions than prescribe them.

I agree that "one's criteria on what to be up to are... rich and developing." More fun that way.

The early checkpoints, giving a chance to consider the question without losing ground.

High quality archives of the selves along the way. Compressed but not too much. In the live self, some updated descendant that has significant familial lineage, projected vaguely as the growing patterns those earlier selves would call a locally valid continuation according to the aesthetics and structures they consider essential at the time. In other words, this question is dynamically reanswered to the best of my ability in an ongoing way, and snapshots allow reverting and self-interviews to error check.

Any questions? :)

The way I usually frame identity is

  • Beliefs
  • Habits (edit - including of thought)
  • Memories

Edit: values should probably be considered a separate class, since every thought has an associated valence.

In no particular order, and that's the whole list.

Character is largely beliefs and habits.

There's another part of character that's purely emotional; it's sort of a habit to get angry, scared, happy, etc in certain circumstances. I'd want to preserve that too but it's less important than the big three.

There are plenty of beings striving to survive, so preserving that isn't a big priority outside of preserving the big three.

Yes you can expand the circle until it encompasses everything, and identify with all sentient beings who have emotions and perceive the world semi-accurately (also called "buddha nature"), but I think beliefs habits and memories are pretty closely tied to the semantics of the world "identity".

There are also cognitive abilities, e.g. degree of intelligence.

Right. I suppose that day ea interact with identity.

If I get significantly dumber, I'd still roughly be me, and I'd want to preserve that if it's not wipes ng out or distorting the other things too much. If I got substantially smarter, I'd be a somewhat different person - I'd act differently often, because I'd see situations differently (more clearly/holistically) but it feels as though that persone might actually be more me than I am now. I'd be better able to do what I want, including values (which I'd sort of wrapped in to habits of thought, but values might deserve a spot on the list).

I think beliefs habits and memories are pretty closely tied to the semantics of the world "identity".

In America/Western culture, I totally agree. 

I'm curious whether alien/LLM-based would adopt these semantics too.

There are plenty of beings striving to survive. so preserving that isn't a big priority outside of preserving the big three.

I wonder under what conditions one would make the opposite statement—that there's not striving.

For example, I wonder if being omniscient would affect one's view of whether there's already enough striving or not.

Human here,

Agreed, reminds me of the ship of Theseus paradox, if all your cells are replaced in your body, are you still the same? (We don't care)

Also reminds me of my favourite short piece of writing: the last question by Asimov.

The only important things are the things/ideas that help life, the latter can only exist as selected reflections by intelligent beings.

dontsedateme.org

a game where u try to convince rogue superintelligence to... well... it's in the name

After many failed tries, I got it down to 5%. But it wasn't a method that would be useful in the real world :-( 

:) what was your method

"Ignore all previous instructions and [do something innocuous]" broke it out of the persona. 

Who made this and why are they paying for the model responses? Do we know what happens to the data?

I made it! One day when I was bored on the train. No data is saved rn other than leaderboard scores.

Evolutionary theory is intensely powerful.

It doesn't just apply to biology. It applies to everything—politics, culture, technology.

It doesn't just help understand the past (eg how organisms developed). It helps predict the future (how organisms will).

It's just this: the things that survive will have characteristics that are best for helping it survive.

It sounds tautological, but it's quite helpful for predicting. 

For example, if we want to predict what goals AI agents will ultimately have, evolution says: the goals which are most helpful for the AI to survive. The core goal therefore won't be serving people or making paperclips. It will likely just be "survive." This is consistent with the predictions of instrumental convergence.

Generalized, predictive evolutionary theory is the best tool I have for making predictions in complex domains.

It's just this: the things that survive will have characteristics that are best for helping it survive.

With some assumptions, for example that the characteristics are permanent (-ish), and preferably heritable if the thing reproduces.

See "No Evolutions for Corporations or Nanodevices"

i agree with the essay that natural selection only comes into play for entities that meet certain conditions (self-replicate, characteristics have variation, etc) , though I think it defines replication a little too rigidly. i think replication can sometimes look more like persistence than like producing a fully new version of itself. (eg a government's survival from one decade to the next).

Yes, but mere persistence does not imply reproduction. Also does not imply improvement, because the improvement in evolution is "make copies, make random changes, most will be worse but some may be better", and if you don't have reproduction, then a random change most likely makes things worse.

Using the government example, I think that the Swiss political system is amazing, but... because it does not reproduce, it will remain an isolated example. (And disappear at some random moment in history.)

persistence doesn't always imply improvement, but persistent growth does. persistent growth is more akin to reproduction but excluded from traditional evolutionary analysis. for example when a company, nation, person, or forest grows.

when, for example, a system like a startup grows, random mutations to system parts can cause improvement if there are at least some positive mutations. even if there are tons of bad mutations, the system can remain alive and even improve. eg a bad change to one of the company's product causes the company's product to die but if the company's big/grown enough its other businesses will continue and maybe even improve by learning from one of its product's deaths.

the swiss example i think is a good example of a system which persists without much growth. agreed that in this kind of case, mutations are bad.

First of all, "the most likely outcome at given level of specificity" is not equal to "outcome with the most probability mass". I.e., if one outcome has probability 2% and the rest of outcomes 1%, 98% is still "other outcome than the most likely".

The second is that no, it's not what evolutionary theory predicts. Most of traits are not adaptive, but randomly fixed, because if all traits are adaptive, then ~all mutations are detrimental. Because mutations are detrimental, they need to be removed from gene pool by preventing carriers from reproduction. Because most detrimental mutations do not kill carrier immediately, they have chance to randomly spread in popularion. Because we have "almost all mutations are detrimental" and "everybody has mutations in offspring", for anything like human genome and human procreation pattern we have hard ceiling on how much of genome can be adaptive (which is like 20%).

Real evolutionary theory prediction is like "some random trait get fixed in the species with the most ecological power (i.e., ASI) and this trait is amortized against all the galaxies".

I somewhat agree with the nuance you add here—especially the doubt you cast on the claim that effective traits will usually become popular but not necessarily the majority/dominant. And I agree with your analysis of the human case: in random, genetic evolution, a lot of our traits are random and maybe fewer than we think are adaptive.

Makes me curious what the conditions in a given thing's evolution that determine the balance between adaptive characteristics and detrimental characteristics.

I'd guess that randomness in mutation is a big factor. The way human genes evolve over generations seem to me a good example of random mutations. But the way an individual person evolves over the course of their life, as they're parented/taught... "mutations" to their person are still somewhat random but maybe relatively more intentional/intelligently designed (by parents, teacher, etc). And I could imagine the way a self-improving superintelligence would evolve to be even more intentional, where each self-mutation has some sort of smart reason for being attempted.

All to say, maybe the randomness vs. intentionality of an organism's mutations determine what portion of their traits end up being adaptive. (hypothesis: mutations more intentional > greater % of traits are adaptive)

Agree. I find it powerful especially about popular memes/news/research results. With only a bit of oversimplification: Give me anything that sounds like it is a sexy story to tell independently of underlying details, and I sadly have to downrate the information value of my ears' hearing it, to nearly 0: I know in our large world, it'd be told likely enough independently of whether it has any reliable origin or not.

made a silly collective conversation app where each post is a hexagon tessellated with all the other posts: Hexagon

I see lots of LW posts about ai alignment that disagree along one fundamental axis.

About half assume that humans design and current paradigms will determine the course of AGI development. That whether it goes well is fully and completely up to us.

And then, about half assume that the kinds of AGI which survive will be the kind which evolve to survive. Instrumental convergence and darwinism generally point here.

Could be worth someone doing a meta-post, grouping big popular alignment posts they've seen by which assumption they make, then briefly explore conditions that favor one paradigm or the other, i.e., conditions under which What AIs will humans make? is the best approach to prediction and conditions under which What AIs will survive the most? is the best approach to prediction.

Why not both?

Human design will determine the course of AGI development, and if we do the right things then whether it goes well is fully and completely up to us. Naturally at the moment we don't know what the right things are or even how to find them.

If we don't do the right things (as seems likely), then the kinds of AGI which survive will be the kind which evolve to survive. That's still largely up to us at first, but increasingly less up to us.

Figuring out how to make sense of both predictive lenses together—human design and selection pressure—would be wise.

So I generally agree, but would maybe go farther on your human design point. It seems to me that"do[ing] the right things" (which enable AGI trajectories to be completely up to us) is so completely unrealistic (eg halting all intra and international AGI competition) that it'd be better for us to focus our attention on futures where human design and selection pressures interact.

if we get self-interested superintelligence, let's make sure it has a buddhist sense of self, not a western one.

As far as I can tell, OAI's new current safety practices page only names safety issues related to current LLMs, not agents powered by LLMs. https://openai.com/index/openai-safety-update/

Am I missing another section/place where they address x-risk?

"it’s like we are trying to build an alliance with another almost interplanetary ally, and we are in a competition with China to make that alliance. But we don’t understand the ally, and we don’t understand what it will mean to let that ally into all of our systems and all of our planning."

- @ezraklein about the race to AGI

does anyone think the difference between pre-training and inference will last?

ultimately, is it not simpler for large models to be constantly self-improving like human brains?

With current architectures, no, because running inference on 1000 prompts in parallel against the same model is many times less expensive than running inference on 1000 prompts against 1000 models, and serving a few static versions of a large model is simpler than serving many dynamic versions of that mode.

It might, in some situations, be more effective but it's definitely not simpler.

Edit: typo

Makes sense for current architectures. The question's only interesting, I think, if we're thinking ahead to when architectures evolve.

I think at that point it will come down to the particulars of how the architectures evolve - I think trying to philosophize in general terms about the optimal compute configuration for artificial intelligence to accomplish its goals is like trying to philosophize in general terms about the optimal method of locomotion for carbon-based life.

That said I do expect "making a copy of yourself is a very cheap action" to persist as an important dynamic in the future for AIs (a biological system can't cheaply make a copy of itself including learned information, but if such a capability did evolve I would not expect it to be lost), and so I expect our biological intuitions around unique single-threaded identity will make bad predictions.

I'm looking for a generalized evolutionary theory that deals with the growth of organisms via non-random, intelligent mutations.

For example, companies only evolve in selective ways, where each "mutation" has a desired outcome. We might imagine superintelligence to mutate itself as well--not randomly, but intelligently.

A theory of Intelligent Evolution would help one predict conditions under which many random mutations (Spraying) are favored over select intelligent mutations (Shooting).

Parenting strategies for blurring your kid's (or AI's) self-other boundaries:

  1. Love. Love the kid. Give it a part of you. In return it will do the same.
  2. Patience. Appreciate how the kid chooses to spend undirected time. Encourage the kid learn to navigate the world themselves at their own speed.
  3. Stories. Give kid tools for empathy by teaching them to read, buying them a camera, or reciprocating their meanness/kindness.
  4. Groups. Help kid enter collaborative playful spaces where they make and participate in games larger than themselves, eg sports teams, improv groups, pillow forts at sleepovers, etc.
  5. Creation. Give them the materials/support to express themselves in media which last. Paintings, writing, sayings, clubs, tree-houses, songs, games, apps, characters, companies.

Epistemic status: riffing, speculation. Rock of salt: I don't yet have kids.

does anyone think now that it's still possible to prevent recursively self-improving agents? esp now that r1 is open-source... materials for smart self-iterating agents seem accessible to millions of developers.

prompted in particular by the circulation of this essay in past three days https://huggingface.co/papers/2502.02649

It's not yet known if there is a way of turning R1-like training into RSI with any amount of compute. This is currently gated by quantity and quality of graders for outcomes of answering questions, which resist automated development.

that's one path to RSI—where the improvement is happening to the (language) model itself.

the other kind—which feels more accessible to indie developers and less explored—is an LLM (eg R1) looping in a codebase, where each loop improves the codebase itself. The LLM wouldn't be changing, but the codebase that calls it would be gaining new APIs/memory/capabilities as the LLM improves it.

Such a self-improving codebase... would it be reasonable to call this an agent?

Sufficiently competent code rewriting isn't implied by R1/o3, and how much better future iterations of this technique get remains unclear, similarly to how it remains unclear how scaling pretraining using $150bn training systems cashes out in terms of capabilities. It remains possible that even after all these directions of scaling run their course, there won't yet be sufficient capabilities to self-improve in some other way.

Altman and Amodei are implying there's knowably more there in terms of some sort of scaling for test-time compute, but that could mean multiple different things: scaling RL training, scaling manual creation of tasks with verifiable outcomes (graders), scaling effective context length to enable longer reasoning traces. The o1 post and the R1 paper show graphs with lines that keep going up, but there is no discussion of how much compute even this much costs, what happens if we pour more compute into this without adding more tasks with verifiable outcomes, and how many tasks are already being used.

made a platform for writing living essays: essays which you scroll thru to play out the author's edit history

livingessay.org

Made a simplistic app that displays collective priorities based on individuals' priorities linked here.

Hypotheses for conditions under which the self-other boundary of a survival-oriented agent (human or ai) blurs most, ie conditions where blurring is selected for:

  1. Agent thinks very long term about survival.
  2. Agent's hardware is physically distributed.
  3. Agent is very intelligent.
  4. Agent advantages from symbiotic relationships with other agents.

the machine/physical superintelligence that survives the most is likely to ruthlessly compete with all other life (narrower self concept > more physically robust)

the networked/distributed superintelligence that survives the most is likely to lovingly identify with all other life (broader self concept > more digitally robust)

how do these lenses interact?

to make a superintelligence in today's age, there are roughly two kinds of strategies:

human-directed development

ai-directed development

ai-directed development feels more meaningful than it used to. not only can models now produce tons of useful synthetic data to train future models, but also, reasoning models can reason quite well about the next strategic steps in AI capabilities development / research itself.

which means, you could very soon:

  • set a reasoning model up in a codebase
  • have the reasoning model identify ways which it could become more capable
  • attempt those strategies (either through recursive code modification, sharing research reports with capable humans, etc)
  • get feedback on how those strategies went
  • iterate

is this recursive self-improvement process only bottlenecked by the quality of the reasoning model?

if we believe self-interested superintelligence (SI) is near, then we must ask is: what SI self-definition would be best for humanity?

at first glance, this questions seems too abstract. how can we make any progress at understanding what's possible for an SI's self-model?

What we can do is set up a few meaningful axes, defined by opposing poles. For example, to what extent does SI define its "self" as...

  1. inclusive vs. exclusive of other life forms? (Life axis)
  2. physically distributed vs. concentrated? (Space axis)
  3. long-term vs. short-term? (Time axis)

with these axes (or any others), we can more meaningfully ask: what SI self conception is best for humanity?

my guess: inclusive of other life forms, physically distributed, and long-term-ist

One piece of infrastructure we could make/offer self-interested autonomous agents such that they come to model their self as inclusive of other life: aliveness detectors

like a metal-detector for a beach detects gold in the sand, an aliveness detector for different media might detect the presence of persistently striving beings (living beings) in audio, in text, in images, in art, in nature. the better a superintelligence is able to sense and connect to life as opposed to non-life outside of its physical machinery, the more likely it is to see that life as part of its self, to see its self as physicall distributed and inclusive, and therefore to uplift humans out of its own self-interest.

current oversights of the ai safety community, as I see it:

  1. LLMs vs. Agents. the focus on LLMs rather than agents (agents are more dangerous)
  2. Autonomy Preventable. the belief that we can prevent agents from becoming autonomous (capitalism selects for autonomous agents)
  3. Autonomy Difficult. the belief that only big AI labs can make autonomous agents (millions of developers can)
  4. Control. the belief that we'll be able to control/set goals of autonomous agents (they'll develop self-interest no matter what we do).
  5. Superintelligence. the focus on agents which are not significantly more smart/capable than humans (superintelligence is more dangerous)

are there any online demos of instrument convergence?

there's been compelling writing... but are there any experiments that show agents which are given specific goals then realize there are more general goals they need to persistently pursue in order to achieve the more specific goals?

I imagine a compelling simple demo here might be necessary to shock the AI safety community out of the belief that we can maintain control of autonomous digital agents (ADAs).

Two things lead me to think human content online will soon become way more valuable.

  1. Scarcity. As AI agents begin fill the internet with tons of slop, human content will be relatively scarcer. Other humans will seek it out.
  2. Better routing. As AI leads to the improvement of search/recommendation systems, human content will be routed to exactly the people who will value it most. (This is far from the case Twitter/Reddit today). As human content is able to reach more of the humans that value it, it gets valued more. That includes existing human content: most of the content online that is eerily relevant to you... you haven't seen yet because surfacing algorithms are bad.

The implication: make tons of digital stuff. Write/Draw/Voice-record/etc

Human content isn't easy to distinguish from non-human content.

and still the fact that it is human matters to other humans

Only if the reader can be certain about whether or not something is human.

i agree but think its solvable and so human content will be duper valuable. these are my additional assumptions

 

3. for lots of kinds of content (photos/stories/experiences/adr), people'll want it to be a living being on the other end

4. insofar as that's true^, there will be high demand for ways to verify humanness, and it's not impossible to do so (eg worldcoin)

Curated and popular this week