LESSWRONG
LW

All of Kerrigan's Comments + Replies

SIA > SSA, part 1: Learning from the fact that you exist

Under an Occam prior the laws already lean simple. SSA leaves that tilt unchanged, whereas SIA multiplies each world’s weight by the total number of observers in the reference class. That means SSA, relative to SIA, favors worlds that stay simple, while SIA boosts those that are populous once the simplicity penalty is paid. Given that, can we update our credence in SSA vs. SIA by looking at how simple our universe’s laws appear and how many observers it seems to contain?

why assume AGIs will optimize for fixed goals?

Kerrigan14d10

Is this trivializing the concept of a Utility Function?

No Universally Compelling Arguments

Kerrigan14d10

This post was from a long time ago. I think it is important to reconsider everything written, after developments in machine learning.

We Don't Have a Utility Function

Kerrigan14d10

How are humans exploitable, given that they don't have utility functions?

Coherent decisions imply consistent utilities

Kerrigan18d10

Since humans are not EU maximizers and are exploitable, can someone give an example of how they are exploitable?

What do coherence arguments actually prove about agentic behavior?

Kerrigan18d30

Is exploitability necessarily unstable? Could there be a tolerable level of exploitability, especially if it allows for tradeoffs with desirable characteristics that are only available to non-EU maximizers?"

2Nathan Helm-Burger18d

Also, both human and deep neural net agents are somewhat stochastic, so they may be randomly intermittently exploitable.

Clarifying Power-Seeking and Instrumental Convergence

Kerrigan24d10

Why is this not true for most humans? Many religious people would not want to modify the lightcone as they think that it's God's territory to modify.

why assume AGIs will optimize for fixed goals?

Kerrigan24d10

The initial distribution of values need not be highly related to the resultant values after moral philosophy and philosophical self-reflection. Optimizing hedonistic utilitariansm, for example, looks very little like any values from the outer optimization loop of natural selection.

Coherent decisions imply consistent utilities

Kerrigan24d10

Although there would be pressure for an AI to not be exploitable, wouldn't there also be pressure for adaptability and dynamism? The ability to alter preferences and goals given new environments?

Humans aren't agents - what then for value learning?

Kerrigan1mo10

Why can't the true values live at the level of anatomy and chemistry?

2Charlie Steiner1mo

If God has ordained some "true values," and we're just trying to find out what pattern has received that blessing, then yes this is totally possible, God can ordain values that have their most natural description at any level He wants. On the other hand, if we're trying to find good generalizations of the way we use the notion of "our values" in everyday life, then no, we should be really confident that generalizations that have simple descriptions in terms of chemistry are not going to be good.

The Anthropic Trilemma

Kerrigan2mo10

Would this be solved if cresting a copy is creating someone functionally the same as you but who is someone else's identity, and not you?

Stupid Questions - April 2023

Kerrigan3mo10

Is there a page similar to this, but for alignment solutions?

2ChristianKl3mo

Not as far as I know, feel free to create one.

The Assassination of Trump's Ear is Evidence for Time-Travel

Kerrigan6mo10

What about from a quantum immortality perspective?

Understanding and avoiding value drift

Kerrigan7mo1-2

Could there not be AI value drift in our favor, from a paperclipper AI to a moral realist AI?

2Martin Randall3mo

Possible but unlikely to occur by accident. Value-space is large. For any arbitrary biological species, most value systems don't optimize in favor of that species.

The alignment stability problem

Kerrigan7mo10

Both quotes are from your above post. Apologies for confusion.

The alignment stability problem

Kerrigan7mo30

“A sufficiently intelligent agent will try to prevent its goals^[1] from changing, at least if it is consequentialist.”

It seems that in humans, smarter people are more able and likely to change their goals. A smart person may change his/her views about how the universe can best be arranged upon reading Nick Bostrom’s book Deep Utopia, for example.

‘I think humans are stable, multi-objective systems, at least in the short term. Our goals and beliefs change, but we preserve our important values over most of those changes. Even when gaining or losing... (read more)

2Seth Herd7mo

I think my terminology isn't totally clear. By "goals" in that statement, I mean what we mean by "'values" in humans. The two are used in overlapping and mostly interchangable ways in my writing 1. Humans aren't sufficiently intelligent to be all that internally consistent 2. In many cases of humans changing goals, I'd say they're actually changing subgoals, while their central goal (be happy/satisfied/joyous) remains the same. This may be described as changing goals while keeping the same values. 3. Note 'in the short term' (I think you're quoting Bostrom? The context isn't quite clear). In the long term, with increasing intelligence and self-awareness, I'd expect some of people's goals to change as they become more self-aware and work toward more internal coherence (e.g., many people change their goal of eating delicious food when they realize it conflicts with their more important goal of being happy and living a a long life). Yes, humans may change exactly that way. A friend said he'd gotten divorced after getting a CPAP to solve his sleep apnea: "When we got married, we were both sad and angry people. Now I'm not." But that's because we're pretty random and biology determined.

Understanding and avoiding value drift

Kerrigan7mo120

How do humans, for example, read a philosophy book and update their views about what they value about the world?

1Kerrigan7mo

Could there not be AI value drift in our favor, from a paperclipper AI to a moral realist AI?

Decision theory does not imply that we get to have nice things

Kerrigan7moΩ110

“Similarly, it's possible for LDT agents to acquiesce to your threats if you're stupid enough to carry them out even though they won't work. In particular, the AI will do this if nothing else the AI could ever plausibly meet would thereby be incentivized to lobotomize themselves and cover the traces in order to exploit the AI.

But in real life, other trading partners would lobotomize themselves and hide the traces if it lets them take a bunch of the AI's lunch money. And so in real life, the LDT agent does not give you any lunch money, for all that you claim to be insensitive to the fact that your threats don't work.”

Can someone please why trading partners would lobotomize themselves?

2quetzal_rainbow7mo

Let's suppose that you give in to threats if your opponent is not capable to predict that you do not give in to threats, so they carry the threat anyway. Therefore, other opponents are incentivised to pretend very hard to be such opponent, up to "literally turn themselves into sort of opponent that carries on useless threats".

Stupid Questions - April 2023

Kerrigan1y10

How does inner misalignment lead to paperclips? I understand the comparison of paperclips to ice cream, and that after some threshold of intelligence is reached, then new possibilities can be created that satisfy desires better than anything in the training distribution, but humans want to eat ice cream, not spread the galaxies with it. So why would the AI spread the galaxies with paperclips, instead of create them and
”consume“ them? Please correct any misunderstandings of mine,

2ChristianKl1y

Paperclips are a metaphor for some things but don't really help here. The AIs that are productive need a lot of compute to do so. Spreading to other solar systems means accessing more compute.

Stupid Questions - April 2023

Kerrigan1y10

And a subset might value drift towards optimizing the internal experiences of all conscious minds?

2ChristianKl1y

That's a much more complex goal than wireheading for a digital mind that can self-modify. In any case, those agents that care a lot about getting more power over the world are more likely to get power than agents that don't.

Stupid Questions - April 2023

Kerrigan1y*10

If an AGI achieves consciousness, why would its values not drift towards optimizing its own internal experience, and away from tiling the lightcone with something?

2ChristianKl1y

If some AGI's only care about their internal experience and not affecting the outside world, they are basically wireheading. If a subset of AGI wireheads and some AGIs don't wirehead the AGIs that don't wirehead will have all the power over the world. Wireheaded AGIs are also economically useless so people try to develop AGIs that don't do that.

AGI Safety FAQ / all-dumb-questions-allowed thread

Kerrigan1y40

How can utility be a function of worlds, if an agent doesn‘t have access to the state of the world, but only the sense data?

AGI Safety FAQ / all-dumb-questions-allowed thread

Kerrigan1y10

How can utility be a function of worlds, if an agent doesn‘t have access to the state of the world, but only the sense data?

An Orthodox Case Against Utility Functions

Kerrigan1y70

How can utility be a function of worlds, if the agent doesn’t have access to the state of the world, but only the sense data?

2abramdemski1y

The post is making the distinction between seeing preferences as a utility function of worlds (this is the regular old idea of utility functions as random variables) vs seeing preferences as an expectation function on events (the jeffrey-bolker view). Both perspectives hold that an agent can optimize things it does not have direct access to. Agency is optimization at a distance. Optimization that isn't at a distance is selection as opposed to control.