Could you explain what exactly your organization is intending to do?
This post envisions a world where various kinds of decision-making work differently from how they work at present, and involve various mechanisms that don't currently exist. But it's not clear to what extent you're proposing (1) to try to bring that world into existence, (2) to build the mechanisms that it requires, or (3) to provide services that would be useful in that world.
It also gestures towards some sort of connection between the political reforms you propose and AI alignment, but I don't really understand what the connection is supposed to be. It seems like you hope (1) to contribute to AI alignment by "solving the human alignment problem" and thus discovering "what human values actually are", and (2) for your organization to offer "AI alignment certification". But I don't understand (1) how even a very smoothly functioning coordination-markets-and-liquid-democracy system would tell us "what human values actually are" in any sense sufficient to be relevant to AI alignment, nor (2) what "AI alignment certification" is supposed to mean or why an organization mostly dedicated to political reform would be either competent to offer it or trusted to do so.
All very good questions. Let me try to answer them in order:
I plan to do 1/2/3 in parallel. I will start small (maybe city of Berkeley, where I live) and increase scope as the technology is proven to work.
1 - human values are what people think they are. To the extent that different humans have incompatible values, the AI alignment problem is strongly unsolvable.
2.A - AI alignment certification would be sort of like a drivers license for AI's. The AI needs to have a valid license in order to be hooked up to the internet and switched on. Every time the AI does something dishonorable, its license is revoked until mechanistic interpretability researchers can debug the issue.
2.B - I don't think the organization would be trusted to certify AI's right away. I think the organization would have to build up a lot of goodwill from doing good works, turn that into capital somehow, and then hire a world-class mechanistic interpretability team from elsewhere. Or the organization could simply contract out the mechanistic interpretability work to other, more experienced labs.
To the extent that different humans have incompatible values, the AI alignment problem is strongly unsolvable.
There are whole realms of theory about how to reconcile orthogonal values.
On (1):
It seems incredibly unlikely to me that your organization is going to make it no longer true that people have incompatible values.
If "AI alignment" is taken to mean "the AI wants exactly the same things that humans want" and hence to imply "all humans want the same things" then, sure, mutually incompatible human values => no AI alignment. But I don't think that's what any reasonable person takes "AI alignment" to mean. I would consider that we'd done a pretty good job of "AI alignment" if, say, the state of the world 20 years after the first superhuman AI was such that for all times between now and then, (1) >= 75% of living humans (would) consider the post-AI state better than the pre-AI state and (2) <= 10% of living humans (would) consider the post-AI state much worse than the pre-AI state. (Or something along those lines.) And I don't see why anything along these lines requires humans never to have incompatible values.
But never mind that: I still don't see how your coordination-market system could possibly make it no longer true that humans sometimes have incompatible values.
On (2):
I still don't see how your proposed political-reform organization would be in any way suited to issuing "AI alignment certification", if that were a thing. And, since you say "hire a world-class mechanistic interpretability team from elsewhere", it sounds as if you don't either. So I don't understand why any of that stuff is in your post; it seems entirely irrelevant to the organization you're actually hoping to build.
Well, fair enough I suppose. I was personally excited about the AI alignment piece, and thought that coordination markets would help with that.
Humans have always and will always hold incompatible values. That's why we feel the need to murder each other with such frequency. But, as Steven Pinker argues, while we murder each other in greater numbers every day, we also do it with less frequency every day. Maybe this will converge to a world in which a superhuman AI knows approximately what is expected of it. Maybe it won't, I don't know.
AI alignment certification and peacebuilding seem like two very different and distinct projects. I'd strongly suggest picking one.
Agreed. You'll bifurcate the mission and end up doing both things worse than you would have done if you'd just picked one and focused.
Well OK. I guess I'll pick the peacebuilding one. I am 99% convinced that both can be done and each will strengthen the other, but the choice is quite easy if I have to choose. I can do alignment work on the side.
Our alignment philosophy is simple: we cannot align AI's to human values until we know approximately what human values actually are, and we cannot know that until we solve the human alignment problem.
what do you make of coherent extrapolated volition, which is the usual solution for solving alignment without having a full understanding of our values?
what do you mean by "human alignment problem"? here it seems that you mean "understanding the values of humans", but many people use that term to mean a variety of things (usually they use it to mean "making humans aligned with one another")
I think CEV is approximately the right framework. The real correct framework would be something like PAC CEV.
I'm using human alignment problem to mean making people love their neighbor as themselves. Again, PAC is the best you're ever going to get.
"human alignment" as you put it seems undesirable to me — i want people to get their values satisfied and then conflicts resolved in some reasonable manner, i don't want to change people's values so they're easier to satisfy-all-at-once. changing other people's values is very rude and, almost always, a violation of their current values.
any idea how you'd envision "making people love their neighbor as themselves" ? sounds like modifying everyone on earth like that would be much more difficult than, say, changing the mind of the people who would make the AIs that are gonna kill everyone.
I agree with this. It's super undesirable. On the other hand, so are wars and famines and what have you. Tradeoffs exist.
Think of it like the financial system. Some people are going for a high score in the money economy, and that powers both good and bad things. If we built coordination markets, some people would hyperfixate on them in a very unhealthy way, become fabulously wealthy in reputation terms, and then be exposed as child molestors or what have you. Again, tradeoffs exist.
oh, so this is a temporary before-AI-inevitably-either-kills-everyone-or-solves-everything thing, not a plan for making the AI-that-solves-everything-including-X-risk?
It's an adjunct to the AI that solves everything, maybe? It can coexist with everything else in human society, and I would argue that it will improve those things along all the axes that any of have a right to care about.
And like, the only way you can get people to stop building the AI that's gonna kill everyone is some sort of massive labor strike against the companies building that stuff. Another enormous coordination problem - it's not in any one capabilities researcher's self-interest to stop the train, but if the train doesn't slow down, then we all die.
we cannot align AI's to human values until we know approximately what human values actually are, and we cannot know that until we solve the human alignment problem
Fairly sure this isn't true, and isn't how things are going to go. Instead we're going to explain what it means for a thing to have values, then make a system that investigates us and figures out what our values are, as it pursues them to the extent that it understands them, by whatever means seem best to it at the time (discussed a bit by Russel, value learning, and 'inverse reinforcement learning'). How load-bearing is this assumption to your strategy?
And I'd anticipate that human peace is way harder than AGI-supported peace for various reasons (Cognitive opacity, and intelligence having superlinear returns (ie, human myopia having superlinear costs)), they probably aren't continuous. I work on this sort of thing, systems for human peace. I don't exaggerate its importance. AGI will mostly not need our systems.
I wasn't planning to use any particularly serious AI in building it, beyond the coordination market framework and maybe a language model where people can argue with a fictional but realistic sparring partner.
It's going to turn out that having a llm ask "but are you sure that's correct?" 7 times increases the reflective consistency of peoples' votes, and I am going to lmao.
You say this like it's a devastating putdown, or invalidates my idea. I think you're right that this is what would happen, but I think hooking that up to a state-of-the-art language model could cure schizophrenic delusions, mass delusions like 9/11 trutherism, and all sorts of amazing things like that. Maybe not right away (schizophrenics are not generally known for engaging with reality diligently and aggressively). But over time, this is a capability I predict would emerge from the system I am trying to build. Why should anyone believe that? I don't know.
You say this like it's a devastating putdown, or invalidates my idea.
No, not laughing at you, laughing at the absurdity of the human critter, the fact that rubberducking works and so on.
I would like to announce my new alignment organization. We have no funding as of yet, but we have a lot of exciting plans and schemes.
Our alignment philosophy is simple: we cannot align AI's to human values until we know approximately what human values actually are, and we cannot know that until we solve the human alignment problem. Thus we will both operate as an AI alignment certification organization, and as a peacebuilding organization to solve human-on-human conflicts.
Our main novel idea is the Coordination Market. This is an equivalent to a prediction market for issues that are matters of opinion rather than matters of fact. In the rest of this post, we outline how a coordination market works, and provide a worked example, allowing more housing to be built in the Bay area.
A coordination market has futures that resolve upon a specified event in the real world, just like a prediction market. So, for the Bay area housing market, the futures will resolve when a proposed development is approved, built, and open for tenants to move into.
The futures are associated with different drafts of proposals. People betting in a coordination market are trying to put their money on the proposal that will end up winning. Since 90% of getting a winning political coalition together is signaling that you already have most of the coalition you need, these bets also function as a sort of "thumb on the scale": people with lots of money can influence the outcome, a little bit, but if their thumb is too heavy on the scale, then they'll lose whatever money they're using to weight the scale, and they won't have a thumb to put on the scale next time it comes up.
In parallel to the coordination market, we add a liquid democracy component to gauge public opinion in the relevant part of the world. For instance, to build a building in Berkeley one presumably needs a coalition of voters sufficient to replace whoever is responsible for appointing the zoning committee. So, a developer trying to build a development would need to set up a coalition of likely voters sufficient to intimidate any zoning board members who stood in the way of the project. At the same time, the developer could only assemble this coalition by genuinely seeking and obtaining meaningful consent from local stakeholders. So the liquid democracy component does a nice job of balancing local democratic ideals with the overwhelming public need for more housing in Berkeley.
We also envision that there would be a forum that people could post on, that would intertwine with the liquid democracy structure. So people talking on the forum would have their posts ranked and decorated by the amount of democracy liquid that the poster has acquired. If Peacecraft ever receives any funding, then people would eventually receive compensation for their posts commensurate with the trust that the community has placed in them. This would have an effect similar to congresspeople being able to draw salaries and hire staffers.
Finally, whoever drafts a winning proposal in a coordination market receives a fraction of the total payout called the "rake". The rake is specified at the time of proposal creation by the creator of the proposal. It is sort of like the spread in a traditional market; it is a reward for good market-making services. The drafter of the winning proposal also receives the right to administer whatever solution is adopted. For housing developments, this would be the right to actually build the development and receive the profits from doing so. For more nebulous issues like "animal rights", it would have to be decided on a case-by-case basis by the administrators of the coordination market what that would actually mean.
Thank you for reading my post, and please wish us luck!