Moebius314 - LessWrong

Really enjoyed the read - adding links with references to current day developments makes it tie very nicely with the recent past and present!

Minor typo: OpenMind should be OpenEye

meemi's Shortform

Moebius3142mo61

I agree entirely that it would not be very reassuring, for the reasons you explained. But I would still consider it a mildly interesting signal to see if OpenAI would be willing to provide such an agreement in writing, and maybe make a public statement on the precise way they used the data so far.

Also: if they make a legally binding commitment, and then later evidence shows up that they violated the terms of this agreement (e.g. via whistleblowers), I do think that this is a bigger legal risk for them than breeching some fuzzy verbal agreement.

meemi's Shortform

Moebius3142mo1912

Thank you for the clarification! What I would be curious about: you write

OpenAI does have access to a large fraction of FrontierMath problems and solutions

Does this include the detailed solution write-up (mathematical arguments, in LaTeX) or just the final answer (numerical result of the question / Python script verifying the correctness of the AI response)?

A possible AI-inoculation due to early "robot uprising"

Moebius3143y30

Maybe one scenario in this direction is that a non-super-intelligent AI gains access to the internet and then spreads itself to a significant fraction of all computational devices, using them to solve some non-consequential optimization problem. This would aggravate a lot of people (who lose access to their computers) and also demonstrate the potential of AIs to have significant impact on the real world.

As the post mentions, there is an entire hierarchy of such unwanted AI behavior. The first such phenomena like reward hacking are already occurring now. The next level (such as an AI creating a copy of itself in anticipation of an operator trying to shut it down) might occur at levels below those representing a threat of an intelligence explosion, but it's unclear whether the general public will see a lot of information about these. I think it's an important empirical question how wide the window is between the AI levels producing publically visible misalignment-events and the threshold where the AI becomes genuinely dangerous.

On A List of Lethalities

Moebius3143y10

Surely all pivotal acts that safeguard humanity long into the far future are entirely rational in explanation.

I agree that in hindsight such acts would appear entirely rational and justified, but to not represent a PR problem, they must appear justified (or at least acceptable) to a member of the general public/a law enforcement official/a politician.

Can you offer a reason for why a pivotal act would be a PR problem, or why someone would not want to tell people their best idea for such an act and would use the phrase "outside the Overton window" instead?

To give one example: the oft-cited pivotal act of "using nanotechnology to burn all GPUs" is not something you could put as the official goal on your company website. If the public seriously thought that a group of people pursued this goal and had any chance of even coming close to achieving it, they would strongly oppose such a plan. In order to even see why it might be a justified action to take, one needs to understand (and accept) many highly non-intuitive assumptions about intelligence explosions, orthogonality, etc.

More generally, I think many possible pivotal acts will to some degree be adversarial since they are literally about stopping people from doing or getting something they want (building an AGI, reaping the economic benefits from using an AGI, etc). There might be strategies for such an act which are inside the overton window (creating a superhuman propaganda-bot that convinces everyone to stop), but all strategies involving anything resembling force (like burning the GPUs) will run counter to established laws and social norms.

So I can absolutely imagine that someone has an idea about a pivotal act which, if posted publically, could be used in a PR campaign by opponents of AI alignment ("look what crazy and unethical ideas these people are discussing in their forums"). That's why I was asking what the best forms of discourse could be that avoid this danger.

On A List of Lethalities

Moebius3143y100

I am not as convinced that there don’t exist pivotal acts that are importantly easier than directly burning all GPUs (after which I might or might not then burn most of the GPUs anyway). There’s no particular reason humans can’t perform dangerous cognition without AGI help and do some pivotal act on their own, our cognition is not exactly safe. But if I did have such an idea that I thought would work I wouldn’t write about it, and it most certainly wouldn’t be in the Overton window. Thus, I do not consider the failure of our public discourse to generate such an act to be especially strong evidence that no such act exists.

Given how central the execution of a pivotal act seems to be to our current best attempt at an alignment strategy (see point 6 of EY's post) I was confused about finding very little discussion about possible approaches here in the forum. Does the quote above already fully explain this (since all promising approaches are too far out of the Overton window to discuss publically)? Or has no one gotten around to initializing such a conversation? Or, quite possibly, have I overlooked extensive discussions in this direction?

It seems to me that having a long document with the 20 most commonly proposed approaches to such a pivotal act together with an analysis of their strengths and weaknesses, possibilities to give comments etc could be quite valuable for people who want to start thinking about such approaches. Also there is always a possibility of someone just having a really great idea (or maybe person A having a flawed idea containing the seed of a great idea, that inspires person B to propose a fix). Would other people also find this useful?

On the other hand, given possible counter-indications of such public discourse (proposals outside the Overton window representing a PR problem, or some proposals only being feasible without being publically announced), are there other strategies for reaping the benefits of many people with different backgrounds thinking about this problem? Things that come to mind: maybe a non-public essay contest where people can hand in a description of a possible pivotal act together with their own analysis concerning its feasibility. Those could be read by a panel of trusted experts (trusted both to have some competence in their judgement and in their confidentiality). Then harmless but insightful ones could be released for the public. Dangerous and/or non-insightful ones could be returned to their creators with a brief description why they are deemed a bad idea. And finally promising ones could be brought to the attention of people with ressources to further pursue them.

AGI Ruin: A List of Lethalities

Moebius3143y2-1

Concerning point 35 about playing AIs off against each other: I analyzed a particular scenario like this in a recent post and also came to the conclusion that cooperation between the AIs is the default outcome in many scenarios. However, in the last subsection of that post, I start thinking about some ways to prevent an acausal trade as Eliezer describes it here (committing to sharing the universe with any AI reviewing the code). The idea is roughly that the code and as much information as possible about the AI doing the checking will be deleted before the checked nanosystem is released. In this way, it's not possible for the escaping AI to share half the universe with its checker, since not enough information about the checker and its goals is left in the universe to enact this compromise. This reduces the incentive of the checker to cooperate/hide alignment faults.

The post above describes various caveats and other things that could go wrong in such a setup. I tried my best at being pessimistic and pointing out various failure modes.

Why the AI Alignment Problem Might be Unsolvable?

Moebius3146y50

I think that "the value alignment problem" is not something that currently has a universally acknowledged and precise definition and a lot of the work that is currently being done is to get less confused about what is meant by this.

From what I see, in your proof you have started from a particular meaning of this term and then went on to show it is impossible.

Which means that human values, or at least the individual non-morality-based values don’t converge, which means that you can’t design an artificial superintelligence that contains a term for all human values

Here you observe that if "the value alignment problem" means to construct something which has the values of all humans at the same time, it is impossible because there exist humans with contradictory values. So you propose the new definition "to construct something with all human moral values". You continue to observe that the four moral values you give are also contradictory, so this is also impossible.

And even if somehow you could program an intelligence to optimize for those four competing utility functions at the same time,

So now we are looking at the definition "to program for the four different utility functions at the same time". As has been observed in a different comment, this is somewhat underspecified and there might be different ways to interpret and implement it. For one such way you predict

that would just cause it to optimize for conflict resolution, and then it would just tile the universe with tiny artificial conflicts between artificial agents for it to resolve as quickly and efficiently as possible without letting those agents do anything themselves.

It seems to me that the scenario behind this course of events would be: we build an AI, give it the four moralities and noticing their internal contradictions, it analyzes them to find that they serve the purpose of conflict resolution. Then it proceeds to make this its new, consistent goal and builds these tiny conflict scenarios. I'm not saying that this is implausible, but I don't think it is a course of events without alternatives (and these would depend on the way the AI is built to resolve conflicting goals).

To summarize, I think out of the possible specifications of "the value alignment problem", you picked three (all human values, all human moral values, "optimizing the four moralities") and showed that the first two are impossible and the third leads to undesired consequences (under some further assumptions).

However, I think there are many things which people would consider a solution of "the value alignment problem" and which don't satisfy one of these three descriptions. Maybe there is a subset of the human values without contradiction, such that most people would be reasonably happy with the result of a superhuman AI optimizing these values. Maybe the result of an AI maximizing only the "Maximize Flourishing"-morality would lead to a decent future. I would be the first to admit that those scenarios I describe are themselves severely underspecified, just vaguely waving at a subset of the possibility space, but I imagine that these subsets could contain things we would call "a solution of the value alignment problem".

LESSWRONG
LW

Posts

Wikitag Contributions

Comments