Jim Buhler

www.jimbuhler.site

Wiki Contributions

Comments

Sorted by

Without an objective standard of “winning” to turn to, this leaves us searching for new principles that could guide us in the face of indeterminacy. But that’s all for another post.

First time ever I am left hanging by a LW post. Genuinely.

Thanks! I guess myopia is a specific example of one form of scope-insensitivity (which has to do with longterm thinking, according to this at least), yes. 

> This is plausibly a beneficial alignment property, but like every plausibly beneficial alignment property, we don't yet know how to instill them in a system via ML training.

I didn't follow discussions around myopia and didn't have this context (e.g., I thought maybe people didn't find myopia promising at all to begin with or something) so thanks a lot. That's very helpful.

Answer by Jim Buhler0-2

GPT-4's answer fwiw (with the extra prompt "please, respond as if you were the most brilliant logntermist AI safety researcher"):


When considering the development of artificial general intelligence (AGI) and the implications for humanity, a key concern is ensuring that AGIs act in ways that are beneficial rather than harmful. One proposal to address this concern is to make AGIs scope-insensitive. To analyze whether this approach could make AGIs less likely to harm humanity, we'll first define and then discuss the implications of scope-insensitivity within AGI goals.

Scope-Insensitivity and AGI

Scope-insensitivity in this context refers to goals that do not excessively amplify their aims with scale. For instance, a scope-insensitive AGI designed to reduce suffering might not distinguish much between preventing a minor inconvenience for millions of people and alleviating a major disaster for a few people. These AGIs would have a form of goal-setting where small-scale improvements and large-scale interventions do not differ significantly in perceived value, avoiding the classic trap of prioritizing extreme versions of their objectives at the cost of human well-being.

How Scope-Insensitivity Could Disincentivize Harm

  1. Stability in Goal Pursuit: Scope-insensitive AGIs are less likely to pursue extreme actions to achieve their goals. For example:
    • An AGI instructed to "reduce suffering" might improve local conditions incrementally rather than attempting a global overhaul that could incapacitate human structures.
    • This averts scenarios where AGIs conclude that radical interventions (like controlling or incapacitating humans) are necessary for achieving massive-scale benefits.
  2. Reduced Singular Focus on Extreme Outcomes:
    • Scope-insensitive AGIs would distribute their efforts more evenly rather than focusing singularly on high-leverage interventions which might involve control or coercion of humans.
    • They would value smaller, manageable improvements, which inherently reduces the likelihood of executing grand, disruptive plans that could incapacitate human society.
  3. Compatibility with Human Incremental Evolution:
    • Humans generally make progress through incremental changes. Scope-insensitive AGIs, focusing on steady, small-scale improvements, would align more closely with the human method of development.
    • This harmony reduces friction and conflict, promoting cooperation over coercion or incapacitation.
  4. Embedded Humble Goals:
    • Training AGIs with humble, scope-insensitive goals inherently de-emphasizes grandiosity.
    • It promotes modest, beneficial interventions over transformative ambitions that can disrupt human autonomy and infrastructure.

Potential Counterpoints

  1. Risk of Under-Attenuation:
    • If too scope-insensitive, AGIs might lack the drive to address large-scale issues that genuinely need extensive efforts, potentially making them less effective in critical situations.
  2. Complexity of True Scope-Insensitivity:
    • Defining and implementing scope-insensitive goals might be complex, especially ensuring they do not covertly evolve into over-reaching plans.
  3. Alignment Challenges:
    • While scope-insensitivity might curb extreme actions, it doesn't inherently solve alignment. An AGI's methods to achieve modest goals still need alignment to avoid unintended harm.

Conclusion

Training AGIs to have scope-insensitive or modest goals could significantly reduce the likelihood of them incapacitating humanity, as their approach would inherently avoid extreme, radical solutions. Instead, these AGIs would favor incremental and potentially more cooperative strategies. Nonetheless, comprehensive safeguards and iterative monitoring are essential to ensure these systems remain aligned and truly beneficial in all scales of their actions. Integrating scope-insensitivity with broader alignment frameworks might offer a robust path forward in AGI safety research, ensuring that AGIs act in ways that uphold and enhance human values and well-being.

Interesting points, thanks!

> The problem is that the public correctly perceives that economic growth and technological progress make the average life better, so it is hard to get political support for any measures to slow them down.

I mean, if we think these things are actually bad overall (which I'm not convinced of but maybe), we could at least avoid doing things that directly or indirectly promote or generate more economic growth for example. There are some very low-hanging fruits.

Thanks, that's fair! Such a threshold exists if and only if you assume 
- non-zero AI research (which is the scenario we're interested in here I guess), 
- technological progress correlates with AI progress (which as you say is not guaranteed but that still seems very likely to me),
- maybe a few other crucial things I implicitly assume without realizing.

I don't know and this is outside the scope of this post I guess. There are a few organizations like the Center on Long-Term Risk studying cooperation and conflict between ASIs, however.

Interesting, thanks! This is relevant to question #2 in the post! Not sure everyone should act as if they were the first considering the downsides of interciv conflicts, but yeah, that's a good point.

Oh nice, thanks for this! I think I now see much more clearly why we're both confused about what the other thinks. 

Say Alice has epistemic algorithm A with inputs x that gives rise to beliefs b and Bob has a completely different algorithm A' with completely different inputs x' that happens to give rise to beliefs b as well. Alice and Bob both use decision algorithm D to make decisions. Part of b is the belief that Alice and Bob have the same beliefs and the same decision algorithm. It seems that Alice and Bob should cooperate.

(I'll respond using my definitions/framing which you don't share, so you might find this confusing, but hopefully, you'll understand what I mean and agree although you would frame/explain things very differently.)

Say Bob is CooperateBot.  Alice may believe she's decision-entangled with them, in which case she (subjectively) should cooperate, but that doesn't mean that their decisions are logically dependent (i.e., that her belief is warranted). If Alice changes her decision and defects, Bob's decision remains the same.  So unless Alice is also a CooperateBot, her belief b ("my decision and Bob's are logically dependent / entangled such that I must cooperate") is wrong. There is no decision-entanglement.  Just "coincidental" mutual cooperation. You can still argue that Alice should cooperate given that she believes b of course, but b is false. If only she could realize that, she would stop naively cooperating and get a higher payoff.

So it seems that the whole A,x,A',x' stuff just doesn't matter for what they should do. It only matters what their beliefs are. 

It matters what their beliefs are to know what they will do, but two agents believing their decisions are logically dependent doesn't magically create logical dependency.  

If I play a one-shot PD against you and we both believe we should cooperate, that doesn't mean that we necessarily both defect in a counterfactual scenario where one of us believes they should defect (i.e., that doesn't mean there is decision-entanglement / logical dependency, i.e., that doesn't mean that our belief that we should cooperate is warranted, i.e., that doesn't mean that we're not two suckers cooperating for wrong reasons while we could be exploiting the other and avoid being exploited). And whether we necessarily both defect in a counterfactual scenario where one of us believes they should defect (i.e., whether we are decision-entangled) depends on how we came to our beliefs that our decisions are logically dependent and that we must cooperate (as illustrated -- in a certain way -- in my above figures).

(Of course, you need to have some requirement to the extent that Alice can't modify her beliefs in such a way that she defects but that she doesn't (non-causally) make it much more likely that Bob also defects. But I view this as an assumption about decision-theoretic not epistemic entanglement: I don't see why an epistemic algorithm (in the usual sense of the word) would make such self-modifications.). 

After reading that, I'm really starting to think that we (at least mostly) agree but that we just use incompatible framings/definitions to explain things. 

Fwiw, while I see how my framing can seem unnecessarily confusing, I think yours is usually used/interpreted oversimplistically (by you but also and especially by others) and is therefore extremely conducive to Motte-and-bailey fallacies[1] leading us to widely underestimate the fragility of decision-entanglement. I might be confused though, of course.

Thanks a lot for your comment! I think I understand you much better now and it helped me reclarify things in my mind. :)

  1. ^

    E.g., it's easy to argue that widely different agents may converge on the exact same DT, but not if you include intricacies like the one in your last paragraph.

Load More