Morphism

AI Alignment researcher. Alternatively known as Pi Rogers.

Wiki Contributions

Comments

Sorted by

Edit: There are actually many ambiguities with the use of these words. This post is about one specific ambiguity that I think is often overlooked or forgotten.

The word "preference" is overloaded (and so are related words like "want"). It can refer to one of two things:

  • How you want the world to be i.e. your terminal values e.g. "I prefer worlds in which people don't needlessly suffer."
  • What makes you happy e.g. "I prefer my ice cream in a waffle cone"

I'm not sure how we should distinguish these. So far, my best idea is to call the former "global preferences" and the latter "local preferences", but that clashes with the pre-existing notion of locality of preferences as the quality of terminally caring more about people/objects closer to you in spacetime. Does anyone have a better name for this distinction?

I think we definitely need to distinguish them, however, because they often disagree, and most "values disagreements" between people are just disagreements in local preferences, and so could be resolved by considering global preferences.

I may write a longpost at some point on the nuances of local/global preference aggregation.

Example: Two alignment researchers, Alice and Bob, both want access to a limited supply of compute. The rest of this example is left as an exercise.

Emotions can be treated as properties of the world, optimized with respect to constraints like anything else. We can't edit our emotions directly but we can influence them.

Oh no I mean they have the private key stored on the client side and decrypt it there.

Ideally all of this is behind a nice UI, like Signal.

I mean, Signal messenger has worked pretty well in my experience.

But safety research can actually disproportionally help capabilities, e.g. the development of RLHF allowed OAI to turn their weird text predictors into a very generally useful product.

I could see embedded agency being harmful though, since an actual implementation of it would be really useful for inner alignment

Morphism105

Some off the top of my head:

  • Outer Alignment Research (e.g. analytic moral philosophy in an attempt to extrapolate CEV) seems to be totally useless to capabilities, so we should almost definitely publish that.
  • Evals for Governance? Not sure about this since a lot of eval research helps capabilities, but if it leads to regulation that lengthens timelines, it could be net positive.

Edit: oops i didn't see tammy's comment

Morphism10-1

Idea:

Have everyone who wants to share and recieve potentially exfohazardous ideas/research send out a 4096-bit RSA public key.

Then, make a clone of the alignment forum, where every time you make a post, you provide a list of the public keys of the people who you want to see the post. Then, on the client side, it encrypts the post using all of those public keys. The server only ever holds encrypted posts.

Then, users can put in their own private key to see a post. The encrypted post gets downloaded to the user's machine and is decrypted on the client side. Perhaps require users to be on open-source browsers for extra security.

Maybe also add some post-quantum thing like what Signal uses so that we don't all die when quantum computers get good enough.

Should I build this?

Is there someone else here more experienced with csec who should build this instead?

Is this a massive exfohazard? Should this have been published?

Yikes, I'm not even comfortable maximizing my own CEV.

What do you think of this post by Tammy?

Where is the longer version of this? I do want to read it. :)

Well perhaps I should write it :)

Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn't RL environments for AI cause the same or perhaps a different set of irrationalities?

Mostly that thing where we had a lying vs lie-detecting arms race and the liars mostly won by believing their own lies and that's how we have things like overconfidence bias and self-serving bias and a whole bunch of other biases. I think Yudkowsky and/or Hanson has written about this.

Unless we do a very stupid thing like reading the AI's thoughts and RL-punish wrongthink, this seems very unlikely to happen.

If we give the AI no reason to self-deceive, the natural instrumentally convergent incentive is to not self-deceive, so it won't self-deceive.

Again, though, I'm not super confident in this. Deep deception or similar could really screw us over.

Also, how does RL fit into QACI? Can you point me to where this is discussed?

I have no idea how Tammy plans to "train" the inner-aligned singleton on which QACI is implemented, but I think it will be closer to RL than SL in the ways that matter here.

Load More