Eli Tyre — LessWrong

[Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering

Who cares if a cortex by itself is safe? A cortex by itself was never the plan!

Well, to be fair, I care a lot about if a cortex by itself is safe, specifically because if so, the plan maybe should be to build a cortex (approximately) by itself, directed by control systems very different than those of biological brains—like text prompts.

[Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering

Eli Tyre6h20

And as discussed above (and more in later posts), even if the researchers start trying in good faith to give their AGI an innate drive for being helpful / docile / whatever, they might find that they don’t know how to do so.

Feel free not to respond if this is answered in later posts, but how relevant is it to your model that current LLMs (which are not brain-like and not AGIs), are helpful and docile in the vast majority of contexts?

Is this evidence that actually would be AGI developers do know how to making their AGIs helpful and docile? Or is it missing the point?

Is the argument that AI is an xrisk valid?

Eli Tyre6h64

The "Singularity" claim assumes general intelligence

I'm not sure exactly how you're using the term "general intelligence", but why does the singularity assume that? Why can't an "instrumental intelligence" recursively self-improve and seize the the universes's available resources in service of it's goals?

Is the argument that AI is an xrisk valid?

Eli Tyre6h22

but on our interpretation the orthogonality thesis says that one cannot consider this

The orthogonality thesis doesn't make any claims that agents can't consider various propositions. Agents can consider whatever propositions, but that doesn't mean they'll be moved by them.

[Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering

Eli Tyre6h20

To be more specific, I think this is a bootstrapping issue—I think we need a curiosity drive early in training, but can probably turn it off eventually. Specifically, let’s say there’s an AGI that’s generally knowledgeable about the world and itself, and capable of getting things done, and right now it’s trying to invent a better solar cell. I claim it probably doesn’t need to feel an innate curiosity drive. Instead it may seek new information, and seek surprises, as if it were innately curious, because it has learned through experience that seeking those things tends to be an effective strategy for inventing a better solar cell. In other words, something like curiosity can be motivating as a means to an end, even if it’s not motivating as an end in itself—curiosity can be a learned metacognitive heuristic. See instrumental convergence. But that argument does not apply early in training, when the AGI starts from scratch, knowing nothing about the world or itself. Instead, early in training, I think we really need the Steering Subsystem to be holding the Learning Subsystem’s hand, and pointing it in the right directions, if we want AGI.

Presumably another strategy would be to start with an already trained model as the center of our learning subsystem, and a steering subsystem that points to concepts in that trained model?

Something like, you have an LLM-based agent that can take actions in text-based game. There's some additional reward machinery that magically updates the weights of the LLM (based on simple heuristic evaluations of the the text context of the game?). You could presumably(?) instantiate such an agent such that it had some goals out of the gate, instead of needing to reward curiosity?

Perhaps this already strays too far from the human-setup to count as "brain-like."

Relitigating the Race to Build Friendly AI

Eli Tyre7h40

Trying to solve philosophical problems like these on a deadline with intent to deploy them into AI is not a good plan, especially if you're planning to deploy it even if it's still highly controversial (i.e., a majority of professional philosophers think you are wrong).

If the majority of profesional philosophers do endorse your metaethics, how seriously should you take that?

And inversely, do you think it's implausible that you could have correctly reasoned your way to correct metaethics, as validated by a more narrow community of philosophers, but not yet have convinced everyone in the field?

The attitude of the sequences emphasizes often that most people in the world believe in god, so if you're interested in figuring out the truth, you gotta be comfortable confidently disclaiming widely held beliefs. What do you say to the person who assesses that academic philosophy is a sufficiently broken field with warped incentives that prevent intellectual progress, and thinks that they should discard the opinion of the whole thing?

Do you just claim that they're wrong about that, on the object level, and that hypothetical person should have more respect for the views of philosophers?

(That said, I'll observe that there's an important in practice asymmetry between "almost everyone is wrong in their belief of X, and I'm confident about that" and "I've independently reasoned my way to Y, and I'm very confident of it." Other people are wrong != I am right.)

Relitigating the Race to Build Friendly AI

Eli Tyre8h40

Did you mean to write “build a Task AI to perform a pivotal act in service of reducing x-risks”? Or did MIRI switch from one to the other at some point early on? I don’t know the history. …But it doesn’t matter, my comment applies to both.

I believe that there was an intentional switch, around 2016 (though I'm not confident in the date), from aiming to design a Friendly CEV-optimizing sovereign AI, to aiming to design a corrigible minimal-Science-And-Engineering-AI to stabilize the world (after which a team of probably-uploads could solve the full version of Friendliness and kick off a foom.)

Relitigating the Race to Build Friendly AI

Eli Tyre8h130

How much was this MIRI's primary plan? Maybe it was 12 years ago before I interfaced with MIRI?

Reposting this comment of mine from a few years ago, which seems germane to this discussion, but certainly doesn't contradict the claim that this hasn't been their plan in the past 12 years.

Here is a video of of Eliezer, first hosted on vimeo in 2011. I don't know when it was recorded.

[Anyone know if there's a way to embed the video in the comment, so people don't have to click out to watch it?]

He states explicitly:

As a research fellow of the Singularity institute, I'm supposed to first figure out how to build a friendly AI, and then once I've done that go and actually build one.

And later in the video he says:

The Singularity Institute was founded on the theory that in order to get a friendly artificial intelligence someone's got to build one. So there. We're just going to have an organization whose mission is 'build a friendly AI'. That's us. There's like various other things that we're also concerned with, like trying to get more eyes and more attention focused on the problem, trying to encourage people to do work in this area. But at the core, the reasoning is: "Someone has to do it. 'Someone' is us."

Buck's Shortform

Eli Tyre1d106

None of these advancements have direct impacts on most people's day-to-day lives.

In contrast, the difference between "I've heard of cars, but they're play things for the rich" and "my family owns a car", is transformative for individuals and societies.

Unless its governance changes, Anthropic is untrustworthy

Eli Tyre1d20

(I see that you offered the second as an example to Tsvi.)

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments