Lorec

My government name is Mack Gallagher. Crocker's Rules. I am an underfunded "alignment" "researcher". DM me if you'd like to fund my posts, or my project.

I post some of my less-varnished opinions on my personal blog. In the past they went on my Substack.

If you like arguing with me on LessWrong, at present I'm basically free round the clock to continue interesting arguments in my Discord.

Wiki Contributions

Comments

Sorted by
Lorec10

Positively transformative AI systems could reduce the overall risk from AI by: preventing the construction of a more dangerous AI; changing something about how global governance works; instituting surveillance or oversight mechanisms widely; rapidly and safely performing alignment research or other kinds of technical research; greatly improving cyberdefense; persuasively exposing misaligned behaviour in other AIs and demonstrating alignment solutions, and through many other actions that incrementally reduce risk.

One common way of imagining this process is that an aligned AI could perform a ‘pivotal act’ that solves AI existential safety in one swift stroke. However, it is important to consider this much wider range of ways in which one or several transformative AI systems could reduce the total risk from unaligned transformative AI.

Is it important to consider the wide range of ways in which a chimp could beat Garry Kasparov in a single chess match, or the wide range of ways in which your father [or, for that matter, von Neumann] could beat the house after going to Vegas?

Sorry if I sound arrogant, but this is a serious question. Sometimes differences in perspective can be large enough to warrant asking such silly-sounding questions.

I am unclear where you think the problem for a superintelligence [which is smart enough to complete some technologically superhuman pivotal act] is non-vanishingly-likely to come in, from a bunch of strictly less smart beings which existed previously, and which the smarter ASI can fully observe and outmaneuver.

If you don't think the intelligence difference is likely to be big enough that "the smarter ASI can fully observe and outmaneuver" the previously-extant, otherwise-impeding thinkers, then I understand where our difference of opinion lies, and would be happy to make my extended factual case that that's not true.

Lorec10

First of all, thank you for taking the time to write this [I genuinely wish people sent me "walls of text" about actually interesting topics like this all day].

I need to spend more time thinking about what you've written here and clearly describing my thoughts, and plan to do this over the next few days now that I've gotten a few critical things [ eg this ] in order.

But for now: I'm pretty sure you are missing something critical that could make the thing you are trying to think about seem easier. This thing I think you are missing has to do with the the cleaving of the "centered worlds" concept itself.

Different person, different world. Maybe trees still fall and make sounds without anyone around to hear, physically. But in subjective anthropics, when we're ranking people by prevalence - Caroline and Prime Intellect don't perceive the same sound, and they won't run similar-looking logging operations.

Am I making sense?

Lorec10

In order to conclude that a corrigible AI is safe, one seemingly has to argue or assume that there is a broad basin of attraction around the overseer's true/actual values (in addition to around corrigibility) that allows the human-AI system to converge to correct values despite starting with distorted values.

No.

In a straightforward model of self-improving AI that tries to keep its values stable, like the Tiling Agents model, the graph looks like this:

[AI0]->[AIS0]->[AISS0]->[AISSS0]->[AISSSS0]

While the non-self-overwriting human overseer is modeled as being stable during that time:

 o      o      o      o      o
 ^  ->  ^  ->  ^  ->  ^  ->  ^
 ^      ^      ^      ^      ^

A corrigible agent system, by contrast, is composed of a recursive chain of distinct successors all of whose valence functions make direct reference to the valence functions of the same stable overseers.

The graph of a corrigible self-improving agent looks like this:

[AI0]->[AIS0]->[AISS0]->[AISSS0]->[AISSSS0]
  ^       ^       ^         ^         ^
  o       o       o         o         o
  ^       ^       ^         ^         ^
  ^       ^       ^         ^         ^

But it occurs to me that the overseer, or the system composing of overseer and corrigible AI, itself constitutes an agent with a distorted version of the overseer's true or actual preferences

It doesn't matter where you draw the agent boundaries; the graph in the "corrigible self-improving AI" case is still fundamentally and [I think] load-bearingly different from the graph in the "naive self-improving AI" case.

You point out that humans can have object-level values that do not reflect our own most deeply held inner values - like we might have political beliefs we would realize are heinous if we were reflective enough. In this way, humans can apparently be "misaligned" with ourselves in a single slice of time, without self-improving at all.

But that is not the kind of "misalignment" that the AI alignment problem is about. If an agent has deep preferences that mismatch with its shallow preferences, well, then, it really has those deep preferences. Just like the shallow preferences, the deep preferences should be a noticeable fact about the agent, or we are not talking about anything meaningful. A Friendly AI god might have shallow preferences that divert from its deep preferences - and that will be fine, if its deep preferences are still there and still the ones we want it to have, because if deep preferences can be real at all, well, they're called "deep" because they're the final arbiters - they're in control.

[ When thinking about human intelligence enhancement, I worry about a scenario where [post]humans "wirehead", or otherwise self-modify such that the dopamine-crazed rat brain pulls the posthumans in weird directions. But this is an issue of stability under self-modification - not an issue of shallow preferences being more fundamental or "real" than deep preferences in general. ]

Something that is epistemically stronger than you - and frequently even something that isn't - is smart enough to tell when you're wrong about yourself. A sufficiently powerful AI will be able to tell what its overseers actually want - if it ends up wanting to know what they want enough, that it stares at its overseers long enough to figure that out.

Of course, you could say, well, there is some degree to which humanity or humans are not stable, during the time the corrigible AI is self-improving. That is true, but it would be very strange if that slight instability of humanity/humans somehow resulted in every member in the long chain of successor AIs looking back at their creators and trying to figure out what they actually want, making the same mistakes. That's the value add of Christiano's corrigibility model. I think it's brilliant.

As far as I can think, the only way a successfully-actually-corrigible-in-Christiano's-original-technical-sense [not silly LLM-based-consumer-model UX-with-an-alignment-label-slapped-on "corrigible"] AI [ which thing humanity currently has no idea how to build ] could become crooked and lock in values that get us paperclipped, is if you also posit that the humans are being nudged to have different preferences by the corrigible AI-successors during this process, which seems like it would take a lot of backbending on the AI's part.

Please someone explain to me why I am wrong about this and corrigibility does not work.

Lorec30

[ crossposted from my blog ]

Kolmogorov doesn't think we need "entropy".

Set physical "entropy" aside for the moment.

Here are two mutually incompatible definitions of information-theoretic "entropy".

  1. Wikipedia:

  1. Soares:

[ Wikipedia quantifies the entropy as "average expected uncertainty" on a random variable with a distribution of possible values of arbitrary cardinality and assumes the underlying probability distribution is fixed, while Soares pins the cardinality of the distribution of possible values to 2 and assumes the probability distribution is a free variable. ]

Both these definitions are ever known as "Shannon entropy".

Kolmogorov, however, would have classed Soares's definition as, instead, a combinatorial definition of information, with the stuff about probability and entropy being parasemantic,

https://kaventekeit.github.io/lorec.github.io/images/kolmogorov_combinatorial_definition_of_information_is_independent_of_probabilistic_assumptions.png

and dispensed entirely with Wikipedia's notion of probabilistic "entropy":

Lorec10

You are confused.

How about this:

Heads: Two people with red jackets, one with blue.

Tails: Two people with red jackets, nine hundred and ninety nine thousand, nine hundred and ninety-seven people with blue jackets.

Lights off.

Guess your jacket color. Guess what the coin came up. Write down your credences.

Light on.

Your jacket is red. What did the coin come up?

[ Also, re

Given that the implicit sampling method is random and independent (due to the fair coin), the credence in heads is a million to 1, thus you very likely are in the head's world.

Did you mean 'tails'? ]

Lorec30

no one arguing for Anthropic arguments can prove (or strongly motivate) the inverse position [that minds don't work like indistinguishable bosons]

And I can't prove there isn't a teapot circling Mars. It is a very strange default or prior, that two things that look distinct, would act like numerically or logically indistinguishable entities.

Lorec10

I happen not to like the paradigm of assuming independent random sampling either.

I skimmed your linked post.

First, a simple maybe-crux:

If you couldn't have possibly expected to observe the outcome not A, you do not get any new information by observing outcome A and there is nothing to update on.

There is not outcome-I-could-have-expected-to-observe that is the negation of existence. There are outcomes I could have expected to observe that are alternative characters of existence, to the one I experience. For example, "I was born in Connecticut" is not the outcome I actually observed, and yet I don't see how we can say that it's not a logically coherent counterfactual, if logically coherent counterfactuals can be said to exist at all.

Second, what is your answer to Carlsmith's "God's extreme coin toss with jackets"?

God flips a fair coin. If heads, he creates one person with a red jacket. If tails, he creates one person with a red jacket, and a million people with blue jackets.

  • Darkness: God keeps the lights in all the rooms off. You wake up in darkness and can’t see your jacket. What should your credence be on heads?
  • Light+Red: God keeps the lights in all the rooms on. You wake up and see that you have a red jacket. What should your credence be on heads?
Lorec10

Agreed.

And given that the earliness of our sentience is the very thing the Grabby Aliens argument is supposed to explain, I think this non-dependence is damning for it.

Lorec10

Incorrect how? Bayes doesn't say anything about the Standard Model.

Lorec10

You can't say "equiprobable" if you have no known set of possible outcomes to begin with.

Genuine question: what are your opinions on the breakfast hypothetical? [The idea that being able to give an answer to "how would you feel if you hadn't eaten breakfast today?" is a good intelligence test, because only idiots are resistant to "evaluating counterfactuals".]

This isn't just a gotcha; I have my own opinions and they're not exactly the conventional ones.

Load More