User Comment Replies

Orthogonality or the "Human Worth Hypothesis"?

Thank you. You are helping my thinking.

Orthogonality or the "Human Worth Hypothesis"?

(I'm liking my analogy even though it is an obvious one.)

To me, it feels like we're at the moment when Szilard has conceived of the chain reaction, letters to presidents are getting written, and GPT-3 was a Fermi pile-like moment.

I would give it a 97% chance you feel we are not nearly there, yet. (And I should quit creating scientific by association feelings. Fair point.)

To me, I am convinced intelligence is a superpower because the power and control we have over all the other animals. That is enough evidence for me to believe the boom co... (read more)

2[anonymous]1y

So in your analogy, it would be reasonable given the evidence to wonder: 1. How long before this exotic form of explosive works at all. Imagine how ridiculous it sounds to someone in 1943 that special rocks will blow up like nothing else 2. How much yield are we talking? Boosted bombs over what can already fit in a b-29? (Say 10 times yield). Kilotons? Megatons? Continent destroying devices? Technically if you assume total conversion the bigger yields are readily available. 3. Should I believe the worst case, that you can destroy the planet, when you haven't started a chain reaction yet at all. And then shut down everything. Oh by the way the axis powers are working on it.... So yeah I think my view is more evidence based than those who declare that doom is certain. A "nuke doomer" in 1943 would be saying they KNOW a teraton or greater size device is imminent, with a median timeline of 1949... As it turned out, no, the bomb needed would be the size of an oil tanker, use expensive materials, and the main "doomsday" element wouldn't be the crater it leaves but the radioactive cobalt-60 transmuted as a side effect. And nobody can afford to build a doomsday nuke, or least hasn't felt the need to build one yet. Scaling and better bomb designs eventually saturated at only about 3 orders of magnitude improvement. All that would have to be true for "my" view to be correct is that compute vs intelligence curves saturate, especially on old hardware. And that no amount of compute at any level of superintelligence level can actually allow reliable social engineering or hacking of well designed computer systems. That stops ASIs from escaping and doom can't happen. Conversely well maybe nukes can set off the atmosphere. Then doom is certain, there is nothing you can do, you can only delay the experiment.

Orthogonality or the "Human Worth Hypothesis"?

Jeffs1y10

I am applying myself to try and come up with experiments. I have a kernel of an idea I'm going to hound some Eval experts with and make sure it is already being performed.

Orthogonality or the "Human Worth Hypothesis"?

Jeffs1y10

A rationalist and an empiricist went backpacking together. They got lost, ended up in a desert, and were on the point of death from thirst. They wander to a point where they can see a cool, clear stream in the distance but unfortunately there is a sign that tells them to BEWARE THE MINE FIELD between them and the stream.

The rationalist says, "Let's reason through this and find a path." The empiricist says, "What? No. We're going to be empirical. Follow me." He starts walking through the mind field and gets blown to bits ... (read more)

2[anonymous]1y

So this came up in an unpublished dialogue. How do we know a nuclear war would be devastating? 1. We know megaton devices are real and they work because we set them off 2. We set off the exact warheads mounted in ICBMs 3. We measured the blast effects at varying levels of overpressure and other parameters on mock structures 4. We fired the ICBMs without a live warhead to test the arming and firing mechanisms and accuracy many times. 5. We fired live ICBMs into space with live warheads during the starfish prime tests 6. Despite all this, we are worried that ICBMs may not all work, so we also maintain a fleet of bombers and gravity warheads because these are actually tested with live warheads. 7. Thus everything but "nuke an actual city with and count how many people died and check all the buildings destroyed"...oh right we did that also. This is how we know nukes are a credible threat that everyone takes a seriously. With you analogy, there isn't a sign saying there's mines. Some "concerned citizen" who leads a small organization some call a cult, who is best known for writing fiction, with no formal education, says there are mines ahead, and produces thousands of pages of text arguing that mines exist and are probably ahead. More recently some Credible Experts (most of whom have no experience in sota AI) signed a couple letters saying there might be mines. (Conspicuously almost no one from SOTA mine labs signed the letter, though one famous guy retired and has spoken out) The Government ordered mine labs to report if they are working on mines above a certain scale, and there are various lawsuits trying to make mines illegal for infringing on copyright. Some people say the mines might be nuclear and your stick method won't work, but no nuclear weapons have ever existed. In fact, in your analogy world, nobody has quite made a working mine. They got close but a human operator still has to sit there and press a button when the mine thinks it is time to

1Jeffs1y

I am applying myself to try and come up with experiments. I have a kernel of an idea I'm going to hound some Eval experts with and make sure it is already being performed.

Orthogonality or the "Human Worth Hypothesis"?

Jeffs1y10

Totally agreed that we are fumbling in the dark. (To me, though, I'm fairly convinced there is a cliff out there somewhere given that intelligence is a superpower.)

And, I also agree on the need to be empirical. (Of course, there are some experiments that scare me.)

I am hoping that, just maybe, this framing (Human Worth Hypothesis) will lead to experiments.

Orthogonality or the "Human Worth Hypothesis"?

Jeffs1y30

I would predict your probability of doom is <10%. Am I right? And no judgment here!! I'm testing myself.

3[anonymous]1y

Depends on doom definition? There's an awful lot of weird futures that might happen that aren't "everyone is dead and nothing but some single ai is turning the universe to cubes" and "human paradise". Nature is weird, even our own civilization is barely recognizable to our distant ancestors. We have all kinds of new problems they could not relate to. I think my general attitude is more that I am highly uncertain what will happen but I feel that an AI "pause" or "shutdown" at this time is clearly not the right decision, because in the past, civilizations that refused to adopt and arm themselves new technologies did not get good outcomes. I think such choices need to be based on empirical evidence that would convince any rational person. Claiming you know what will happen in the future without evidence is not rational. There is no direct evidence of the Orthogonality hypothesis or most of the arguments for AI doom. There is strong evidence that gpt-4 is useful and a stronger model than gpt-4 is needed to meet meaningful thresholds for general utility.

Orthogonality or the "Human Worth Hypothesis"?

Jeffs1y10

I interpret people who disbelieve Orthogonality to think there is some cosmic guardrail that protects against such process failures like poor seeking. How? What mechanism? No idea. But I believe they believe that. Hence my inclusion of "...regardless of the process to create the intelligence."

Most readers of Less Wrong believe Orthogonality.

But, I think the term is confusing and we need to talk about it in simpler terms like Human Worth Hypothesis. (Put the cookies on the low shelf for the kids.)

And, its worth some creative effort ... (read more)

2the gears to ascension1y

I don't think one needs to believe the human worth hypothesis to disbelieve strong orthogonality, one only needs to believe that gradient descent is able to actually find representations that correctly represent the important parts of the things the training data was intended by the algorithm designer to represent, eg for the youtube recommender this would be "does this enrich the user's life enough to keep them coming back", but what's actually measured is just "how long do they come back".

Orthogonality or the "Human Worth Hypothesis"?

Jeffs1y21

Well, if it doesn't really value humans, it could demonstrate good behavior, deceptively, to make it out of training. If it is as smart as a human, it will understand that.

I think there are a lot of people banking on the good behavior towards humans being intrinsic: Intelligence > Wisdom > Benevolence towards these sentient humans. That's what I take Scott Aaronson to be arguing.

In addition to people like Scott who engage directly with the concept of Orthogonality, I feel like everyone saying things like "Those terminator sci-fi scenarios... (read more)

2the gears to ascension1y

importantly the concept of orthogonality needs to be in the context of a reasonable training set in order to avert the counterarguments of irrelevance that are typically deployed. the relevant orthogonality argument is not just that arbitrary minds don't implement resilient seeking towards the things humans want - whether that's true depends on your prior for "arbitrary", and it's hard to get a completely clean prior for something vague like "arbitrary minds"; it's that even from the developmental perspective of actual AI tech, ie when you do one of {imitation learn/unsupervised train on human behavior; or, supervised train on specific target behavior; or, rl train on a reasonably representative reward}, that the actual weights that are locally discoverable by a training process do not have as much gradient pressure as expected to implement resilient seeking of the intended outcomes, and are likely to generalize in ways that are bad in practical usage.

Orthogonality or the "Human Worth Hypothesis"?

Jeffs1y43

I believe you are predicting that resource constraints will be unlikely. To use my analogy from the post, you are saying we will likely be safer because the ASI will not require our habitat for its highway. There are so many other places for it to build roads.

I do not think that is a case that it values our wellbeing...just that it will not get around to depriving us of resources because of a cost/benefit analysis.

Do you think the Human Worth hypothesis is likely true? That the more intelligent an agent is the more it will positively value human wellbeing?

2[anonymous]1y

That's not the precise argument. Currently Humans believe the universe as far as we can see is cold and dead. The earth itself - not humans specifically, but this rich biosphere that appears to have evolved through cosmic levels of luck - has value to humans. Kinda how Mayan ruins have value to humans, we have all the other places on the planet to exploit for resources, we do not need to destroy one of a kind artifacts of an early civilization. It's not even utility, technically the land the ruins are on would make more money covered in condos, but we humans want to remember and understand our deep past. Anthropically I am imagine that "ultra smart" means similar long term thinking to humans, just the ASI is better at it, and therefore some ASI would model regretting having destroyed the only evolved life on the universe in the future and not do the bad act of destroying it all. This does not mean the ASI would help or harm individual humans or avoid killing humans that interfere with it. Just it probably wouldn't wipe out the entire species and the ecosystem of the planet to make more robots. Eliezer says exponential growth will exhaust all resources quickly and hes right...but will superintelligence waste a priceless biosphere for less than 0.1 percent more resources? This is possible but seems stupid.

Orthogonality or the "Human Worth Hypothesis"?

Jeffs1y63

One experiment is worth more than all the opinions.

IMHO, no, there is not a coherent argument for the human worth hypothesis. My money is on it being disproven.

But, I assert the human worth hypothesis is the explicit belief of smart people like Scott Aaronson and the implicit belief of a lot of other people who think AI will be just fine. As Scott says Orthogonality is "a central linchpin" of the doom argument.

Can we be more clear about what people do believe at get at it with experiments?? That's the question I'm asking.

It's hard ... (read more)

2[anonymous]1y

You can simply make a reinforcement learning environment that does not reward being nice to "humans" in grid world and "prove" Orthogonality. We don't even have to do the experiment, I know if we made a grid world crazy taxi environment where there is no penalty for running over pedestrians, and use any RL algorithm, the algorithm will...run over pedestrians once it finds the optimal solution. We also know the "gridworld" of real physics we exist in, big picture, doesn't penalize murdering your ancestors and taking their stuff, because our ancestors were the winners of such contests. Hence why we know at a cosmic scale this is the ultimate solution any optimizing algorithm will find. It's just that it is difficult to imagine a true SOTA model that humans actually use for anything that doesn't care about human values, empirically in the output. Meaning it doesn't have to "really care" but any model that consistently advised humans to suicide in chats, or crashes autonomous cars will never make it out of training. (Having it fail sometimes, or with specific inputs, is expected behavior with current tech)

Orthogonality or the "Human Worth Hypothesis"?

Jeffs1y82

Okay, a "hard zone" rather than a no-go zone. Which begs the question "How hard?" and consequently how much comfort should one take in the belief?

Thank you for reading and commenting.

If we had known the atmosphere would ignite

Jeffs2y10

Yes. Valid. How to avoid reducing to a toy problem or such narrowing assumptions (in order to achieve a proof) that allows Mr. CEO to dismiss it.

When I revise, I'm going to work backwards with CEO/Senator dialog in mind.

If we had known the atmosphere would ignite

Jeffs2y30

Agreed. Proof or disproof should win.

If we had known the atmosphere would ignite

Jeffs2y21

All the way up meaning at increasing levels of intelligence…your 10,000 becomes 100,000X, etc.

At some level of performance, a moral person faces new temptations because of increased capabilities and greater power for damage, right?

In other words, your simulation may fail to be aligned at 20,000...30,000...

If we had known the atmosphere would ignite

Jeffs2y20

Okay, maybe I'm moving the bar, hopefully not and this thread is helpful...

Your counter-example, your simulation would prove that examples of aligned systems - at a high level - are possible. Alignment at some level is possible, of course. Functioning thermostats are aligned.

What I'm trying to propose is the search for a proof that a guarantee of alignment - all the way up - is mathematically impossible. We could then make the statement: "If we proceed down this path, no one will ever be able to guarantee that humans remain in control." &... (read more)

3Yair Halberstadt2y

Sorry, could you elaborate what you mean by all the way up?

If we had known the atmosphere would ignite

Jeffs2y30

Great question. I think the answer must be "yes." The alignment-possible provers must get the prize, too.

And, that would be fantastic. Proving a thing is possible, accelerates development. (US uses atomic bomb. Russia has it 4 years later.) Okay, it would be fantastic if the possible proof did not create false security in the short term. It's important when alignment gets solved. A peer-reviewed paper can't get the coffee. (That thought is an aside and not enough to kill the value of the prize, IMHO. I... (read more)

1Sherrinford2y

This reminds me of General Equilibrium Theory. This was once a fashionable field, were very smart people like Ken Arrow and Gérard Debreu proved the conditions for the existence of general equilibrium (demand = supply for all commodities at once). Some people then used the proofs to dismiss the idea of competitive equilibrium as an idea that could direct economic policy, because the conditions are extremely demanding and unrealistic. Others drew the opposite conclusion: Look, competitive markets are great (in theory), so actual markets are (probably) also great!

If we had known the atmosphere would ignite

Jeffs2y10

I envision the org that offers the prize, after broad expert input, would set the definitions and criteria.

Yes, surely the definition/criteria exercise would be a hard thing...but hopefully valuable.

If we had known the atmosphere would ignite

Jeffs2y20

Yes, surely the proof would be very difficult or impossible. However, enough people have the nagging worry that it is impossible to justify the effort to see if we can prove that it is impossible...and update.

But, if the effort required for a proof is - I don't know - 120 person months - let's please, Humanity, not walk right past that one into the blades.

I am not advocating that we divert dozens of people from promising alignment work.

Even if it failed, I would hope the prove-impossibility effort would throw off beneficial by-products like:

Jeffs2y62

Like dr_s stated, I'm contending that proof would be qualitatively different from "very hard" and powerful ammunition for advocating a pause...

Senator X: “Mr. CEO, your company continues to push the envelope and yet we now have proof that neither you nor anyone else will ever be able to guarantee that humans remain in control. You talk about safety and call for regulation but we seem to now have the answer. Human control will ultimately end. I repeat my question: Are you consciously working to replace humanity? Do you have children, sir?”... (read more)

6Karl von Wendt2y

Like I wrote in my reply to dr_s, I think a proof would be helpful, but probably not a game changer. Mr. CEO: "Senator X, the assumptions in that proof you mention are not applicable in our case, so it is not relevant for us. Of course we make sure that assumption Y is not given when we build our AGI, and assumption Z is pure science-fiction." What the AI expert says to Xi Jinping and to the US general in your example doesn't rely on an impossibility proof in my view.

Book Review: How Minds Change

Jeffs2y*10

I would love to see a video or transcript of this technique in action in a 1:1 conversation about ai x-risk.

Answer to my own question: https://www.youtube.com/watch?v=0VBowPUluPc

LESSWRONG
LW

All of Jeffs's Comments + Replies