Wei Dai

I think I need more practice talking with people in real time (about intellectual topics). (I've gotten much more used to text chat/comments, which I like because it puts less time pressure on me to think and respond quickly, but I feel like I now incur a large cost due to excessively shying away from talking to people, hence the desire for practice.) If anyone wants to have a voice chat with me about a topic that I'm interested in (see my recent post/comment history to get a sense), please contact me via PM.

www.weidai.com

Comments

Sorted by
Wei Dai1311

It therefore seems perfectly plausible for AIs to simply get rich within the system we have already established, and make productive compromises, rather than violently overthrowing the system itself.

So assuming that AIs get rich peacefully within the system we have already established, we'll end up with a situation in which ASIs produce all value in the economy, and humans produce nothing but receive an income and consume a bunch, through ownership of capital and/or taxing the ASIs. This part should be non-controversial, right?

At this point, it becomes a coordination problem for the ASIs to switch to a system in which humans no longer exist or no longer receive any income, and the ASIs get to consume or reinvest everything they produce. You're essentially betting that ASIs can't find a way to solve this coordination problem. This seems like a bad bet to me. (Intuitively it just doesn't seem like a very hard problem, relative to what I imagine the capabilities of the ASIs to be.)

I'm simply arguing against the point that smart AIs will automatically turn violent and steal from agents who are less smart than they are unless they're value aligned. This is a claim that I don't think has been established with any reasonable degree of rigor.

I don't know how to establish anything post-ASI with "with any reasonable degree of rigor" but the above is an argument I recently thought of, which seems convincing, although of course you may disagree. (If someone has expressed this or a similar argument previously, please let me know.)

  1. Why? Perhaps we'd do it out of moral uncertainty, thinking maybe we owe something to our former selves, but future people probably won't think this.
  2. Currently our utility is roughly log in money, partly because we spend money on instrumental goals and there's diminishing returns due to limited opportunities being used up. This won't be true of future utilitarians spending resources on their terminal values. So "one in hundred million fraction" of resources is a much bigger deal to them than to us.
Wei Dai1510

I have a slightly different take, which is that we can't commit to doing this scheme even if we want to, because I don't see what we can do today that would warrant the term "commitment", i.e., would be binding on our post-singularity selves.

In either case (we can't or don't commit), the argument in the OP loses a lot of its force, because we don't know whether post-singularity humans will decide to do this kind scheme or not.

So the commitment I want to make is just my current self yelling at my future self, that "no, you should still bail us out even if 'you' don't have a skin in the game anymore". I expect myself to keep my word that I would probably honor a commitment like that, even if trading away 10 planets for 1 no longer seems like that good of an idea.

This doesn't make much sense to me. Why would your future self "honor a commitment like that", if the "commitment" is essentially just one agent yelling at another agent to do something the second agent doesn't want to do? I don't understand what moral (or physical or motivational) force your "commitment" is supposed to have on your future self, if your future self does not already think doing the simulation trade is a good idea.

I mean imagine if as a kid you made a "commitment" in the form of yelling at your future self that if you ever had lots of money you'd spend it all on comic books and action figures. Now as an adult you'd just ignore it, right?

Over time I have seen many people assert that “Aligned Superintelligence” may not even be possible in principle. I think that is incorrect and I will give a proof - without explicit construction - that it is possible.

The meta problem here is that you gave a "proof" (in quotes because I haven't verified it myself as correct) using your own definitions of "aligned" and "superintelligence", but if people asserting that it's not possible in principle have different definitions in mind, then you haven't actually shown them to be incorrect.

Wei Dai113

Apparently the current funding round hasn't closed yet and might be in some trouble, and it seems much better for the world if the round was to fail or be done at a significantly lower valuation (in part to send a message to other CEOs not to imitate SamA's recent behavior). Zvi saying that $150B greatly undervalues OpenAI at this time seems like a big unforced error, which I wonder if he could still correct in some way.

What hunches do you currently have surrounding orthogonality, its truth or not, or things near it?

I'm very uncertain about it. Have you read Six Plausible Meta-Ethical Alternatives?

as far as I can tell humans should by default see themselves as having the same kind of alignment problem as AIs do, where amplification can potentially change what's happening in a way that corrupts thoughts which previously implemented values.

Yeah, agreed that how to safely amplify oneself and reflect for long periods of time may be hard problems that should be solved (or extensively researched/debated if we can't definitely solve them) before starting something like CEV. This might involve creating the right virtual environment, social rules, epistemic norms, group composition, etc. A few things that seem easy to miss or get wrong:

  1. Is it better to have no competition or some competition, and what kind? (Past "moral/philosophical progress" might have been caused or spread by competitive dynamics.)
  2. How should social status work in CEV? (Past "progress" might have been driven by people motivated by certain kinds of status.)
  3. No danger or some danger? (Could a completely safe environment / no time pressure cause people to lose motivation or some other kind of value drift? Related: What determines the balance between intelligence signaling and virtue signaling?)

can we find a CEV-grade alignment solution that solves the self-and-other alignment problems in humans as well, such that this CEV can be run on any arbitrary chunk of matter and discover its "true wants, needs, and hopes for the future"?

I think this is worth thinking about as well, as a parallel approach from the above. It seems related to metaphilosophy in that if we can discover what "correct philosophical reasoning" is, we can solve this problem by asking "What would this chunk of matter conclude if it were to follow correct philosophical reasoning?"

Wei DaiΩ340

As a tangent to my question, I wonder how many AI companies are already using RLAIF and not even aware of it. From a recent WSJ story:

Early last year, Meta Platforms asked the startup to create 27,000 question-and-answer pairs to help train its AI chatbots on Instagram and Facebook.

When Meta researchers received the data, they spotted something odd. Many answers sounded the same, or began with the phrase “as an AI language model…” It turns out the contractors had used ChatGPT to write-up their responses—a complete violation of Scale’s raison d’être.

So they detected the cheating that time, but in RLHF how would they know if contractors used AI to select which of two AI responses is more preferred?

BTW here's a poem(?) I wrote for Twitter, actually before coming across the above story:

The people try to align the board. The board tries to align the CEO. The CEO tries to align the managers. The managers try to align the employees. The employees try to align the contractors. The contractors sneak the work off to the AI. The AI tries to align the AI.

but we only need one person or group who we’d be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this.

Why do you think this, and how would you convince skeptics? And there are two separate issues here. One is how to know their CEV won't be corrupted relative to what their values really are or should be, and the other is how to know that their real/normative values are actually highly altruistic. It seems hard to know both of these, and perhaps even harder to persuade others who may be very distrustful of such person/group from the start.

Another is that even if we don’t die of AI, we get eaten by various moloch instead of being able to safely solve the necessary problems at whatever pace is necessary.

Would be interested in understanding your perspective on this better. I feel like aside from AI, our world is not being eaten by molochs very quickly, and I prefer something like stopping AI development and doing (voluntary and subsidized) embryo selection to increase human intelligence for a few generations, then letting the smarter humans decide what to do next. (Please contact me via PM if you want to have a chat about this.)

AI companies don't seem to be shy about copying RLHF though. Llama, Gemini, and Grok are all explicitly labeled as using RLHF.

Load More