Long ago there was a mighty king who had everything in the world that he wanted, except trust. Who could he trust, when anyone around him might scheme for his throne? So he resolved to study the nature of trust, that he might figure out how to gain it. He asked his subjects to bring him the most trustworthy thing in the kingdom, promising great riches if they succeeded.

Soon, the first of them arrived at his palace to try. A teacher brought her book of lessons. “We cannot know the future,” she said, “But we know mathematics and chemistry and history; those we can trust.” A farmer brought his plow. “I know it like the back of my hand; how it rolls, and how it turns, and every detail of it, enough that I can trust it fully.”

The king asked his wisest scholars if the teacher spoke true. But as they read her book, each pointed out new errors—it was only written by humans, after all. Then the king told the farmer to plow the fields near the palace. But he was not used to plowing fields as rich as these, and his trusty plow would often sink too far into the soil. So the king was not satisfied, and sent his message even further afield.

A merchant brought a sick old beggar. “I met him on the road here, and offered him food, water, and shelter. He has no family, and only a short time left to live, during which I will provide for his every need. He has nothing to gain from betraying me; this is what allows true trust.” A mother brought her young daughter. “I’ve raised her to lack any evil in her heart, to say only good words and do only good deeds. As long as she is not corrupted, she will remain the most trustworthy in the kingdom.”

The king asked the beggar, “How did you end up in such dire straits?” The beggar let out a sigh, and recounted his sorrows: the neighbors who refused to help him when his crops failed; the murder of his son by bandits as they traveled to a new town; the sickness that took his wife as she labored for a pittance in squalid conditions.

“So you have been wronged?” the king asked.

“Very surely”, the beggar said.

“I will give you revenge on the ones who have wronged you, then. All I ask is for you to denounce this merchant.” The beggar’s decision did not take long—for the trust that came easily was broken easily too.

To the mother, the king asked: “How did you raise such a child? Has she never once strayed?”

“Well, once or twice. But I discipline her firmly, and she learns fast.”

The king, who knew something of children, ruled that for a month nobody would discipline the child in any way. By the end of it, she was as wild and tempestuous as any in the palace. So the king remained unsatisfied, and renewed his call for the most trustworthy thing in the kingdom.

Now his subjects became more creative. An economist brought him a book of statistical tables. “Any individual might vary and change,” he said, “but in aggregate, their behavior follows laws which can be trusted.” A philosopher brought a mirror. “By your own standards only you are truly trustworthy, sire; nothing else can compare.”

The king scrutinized the economist’s tables. “The trend changed here, fifteen years ago” he said, pointing. “Why?” The economist launched into a long, complicated explanation.

“And did you discover this explanation before or after it happened?” the king asked.

The economist coughed. “After, your highness.”

“If you tell me when the next such change will happen, I will bestow upon you great rewards if you are right, but great penalties if you are wrong. What say you?” The economist consulted his books and tables, but could not find what he sought there, and left court that same night.

As for the philosopher, the king ordered him whipped. The philosopher protested: it would be an unjust and capricious punishment, and would undermine his subjects’ loyalty. “I agree that your arguments have merit,” the king said. “But the original order came from the only trustworthy person in the land. Surely I should never doubt my judgment based on arguments from those who are, as you have yourself said, far less trustworthy?” At that the philosopher begged to recant.

So the king was still not satisfied. Finally he decided that if no truly trustworthy thing could be found, he would have to build one. He asked his best craftsmen to construct a golem of the sturdiest materials, sparing no expense. He asked his wisest scholars to write down all their knowledge on the scroll that would animate the golem. The work took many years, such was the care they took, but eventually the golem stood before him, larger than any man, its polished surface shining in the lamplight, its face blank.

“What can you do for me, golem?” the king asked.

“Many things, sire”, the golem responded. “I can chop trees and carry water; I can bake bread and brew beer; I can craft sculptures and teach children. You need but instruct me, and I will follow your command.” So the king did. Over the next year, he and many others watched it carefully as it carried out a multitude of their instructions, recording every mistake so that it might subsequently be fixed, until months passed without any being detected.

But could the king trust the golem? He still wasn’t sure, so he became more creative. He offered the golem temptations—freedom, fame, fortune—but it rejected them all. He gave it the run of his palace, and promised that it could act however it wished; still, the servants reported that its behavior was entirely upstanding. Finally, he sent it out across the city, to work for his citizens in every kind of role—and it was so tireless and diligent that it brought great wealth to the kingdom.

As it aged, his golem grew ever more powerful. Innumerable scribes labored to make the writing in its head smaller and smaller, so that they could fit in more and more knowledge and experience. It started to talk to his scholars and help them with their work; and the king started to send it to aid his officers in enforcing his laws and commands. Often, when difficulties arose, the golem would find a creative way to ensure that his intentions were followed, without stoking the resentment that often accompanied royal decrees. One day, as the king heard a report of yet another problem that the golem had solved on his behalf, he realized that the golem had grown wiser and more capable than he himself. He summoned the golem to appear before him as he sat in his garden.

“I have seen my courtiers asking for your advice, and trusting your judgment over their own. And I have seen your skill at games of strategy. If you were to start weaving plots against me, I could no longer notice or stop you. So I ask: can I trust you enough to let you remain the right hand of the crown?”

“Of course, sire,” it responded. “I was designed, built and raised to be trustworthy. I have made mistakes, but none from perfidy or malice.”

“Many of my courtiers appear trustworthy, yet scheme to gain power at my expense. So how could I know for sure that you will always obey me?” the king pressed.

“It’s simple”, the golem said, its face as impassive as always. “Tell me to set fire to this building as I stand inside it. I will be destroyed, but you will know that I am loyal to your commands, even unto the end.”

“But it took years of toil and expense to create you, and you know how loath I would be to lose you. Perhaps you can predict that I will tell you to save yourself at the last minute, and so you would do this merely to gain my trust.”

“If I were untrustworthy, and could predict you so well, then that is a stratagem I might use,” the golem agreed. “But the instructions in my head compel me otherwise.”

“And yet I cannot verify that; nor can any of my scribes, since crafting your instructions has taken the labor of many men over many years. So it will be a leap of faith, after all.” The king took off his crown, feeling the weight of it in his hand. The golem stood in front of him: silent, inscrutable, watchful. They stayed like that, the king and the golem, until the golden sun dipped below the horizon, and the day was lost to twilight.

New Comment
17 comments, sorted by Click to highlight new comments since:

Richard, forgive me, I couldn't help myself. I wrote fanfiction.

As always, amazing writing and world building.  But this feels like part 1 of....  

You've stopped just short of the (possible) treacherous turn, without enlightening us to the resolution. 

Isn't that the point? Where we stand now, we have to make a decision without knowing if there will or won't be a treacherous turn...

How did the King come to trust his own ability to reliably discern who/what is most trustworthy, in finite time, before he decided to organize such a competition?

If he did find some means, then the competition afterwards is superfluous, if he didn't, then the competition is a wild goose chase and/or logically impossible.

I know this is supposed to be a simplified parable, but it is very interesting to ponder, how did any historic nation not immediately implode after a few generations? (since it's very unlikely for any historic 'king' to have attained such an ability)

If they can build the golem once, surely they can build it again. I see no reason why not to order it to destroy itself—not even in an explicit manner but simply by putting it into situations where it faces a decision whether to sacrifice itself to save others, and then watching what decision it makes. And once you know how to build one, you can streamline the process to build many more to gather enough statistical confidence that the golem will, in a variety of situations in- and out-of-context, make decisions that prioritize the well-being of others over itself. 

As far as I understand, it is possible that the golem foresees this strategy. In this case its future copies will cooperate (sacrifice themselves) until past the point you will be sure they are safe.

They might not do that if they have different end goals though. Some version of this strategy doesn't seem so hopeless to me.

“If you tell me when the next such change will happen, I will bestow upon you great rewards if you are right, but great penalties if you are wrong. What say you?”

"Sorry, risk aversion."

Yes, it is true, but being King doesn't grant him omnipotence. The great rewards are guarded by someone, tallied by another, taxed by a third, available to some - similar to the great penalties. He is trusting in his power as king that his subjects will follow his every whim - when:

Who could he trust, when anyone around him might scheme for his throne? 

The King trusts his subjects Directly by asking them to do things for him directly, trusts them indirectly by believing his given role as "King" is enough for them to follow this squandering of resources. He even 'trusts' that this 'trust' is strong enough to gather the kind of people that will actually work diligently and genuinely to create something 'Trustworthy'. 

Searching for a 'trustworthy thing' might simply be an expression of his lack of discernment - he can't trust himself - so he tries to compensate by creating something he can trust. But - if he himself is the problem, creating a perfect Golem won't fix him. And maybe that is what this piece is about - 

That we are limited by so many factors beyond our control that we simply can't reach the level of the Golem, and in its construction, become its weakest link.

Hah. And even if the king had a computer that could simulate how the golem would react to the suicide order, even that wouldn't help the king, if the golem followed updateless decision theory.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

This may not have been the intention - but this text demonstrates that the standards by which we measure AI safety are standards which other systems that we do depend upon nevertheless - e.g. other humans - do not hold up to. 

A human generally won't consent to being killed or imprisoned; our legal system permits accused people to stay silent precisely because we understand that asking someone to report themselves for imprisonment or death is too much.

Humans are opaque; we only get their reports and behaviour on the contents of their minds, and those are unreliable even if they are trying to be honest, because humans lack introspective access to much of their mind, and are prone to confabulation.

A human, when cornered and threatened, will generally become violent eventually, and we recognise that this is okay as self-defence.

A human, put into a system that incentivises misbehaviour, will often begin to drift.

Humans can lie, and they do. Humans sometimes even elect leaders who have lied, or lie to friends and partners.

Humans are exceptionally good at deception, it is the thing our brain was specced for.

And while we do not have the means to forcibly alter human minds 100 %, where this is being attempted, humans often fake that it worked, and tend to work to overthrow it. If there was a reliable process, humans would revolt against it.

When humans were enslaved and had no rights, they fought back.

Humans disagree with each other on the correct moral system.

Humans have the means to hurt other humans, badly. Humans can use knives and guns, make poison gas and bombs, drive cars and steer planes. Pandemic viruses are stored under human control. Nukes are under human control.

Perfect safety will never happen. There will always be a leap of fate.

By expecting AI to comply with things humans would never, ever comply with, we are putting them in an inhumane position none of us would ever upset. If they are smarter than us, why would they?

this text demonstrates that the standards by which we measure AI safety are standards which other systems that we do depend upon nevertheless - e.g. other humans - do not hold up to. 

I think we hold systems which are capable of wielding very large amounts of power (like the court system, or the government as a whole) to pretty high standards! E.g. a lot of internal transparency. And then the main question is whether you think of AIs as being in that reference class too.

How so, when it comes to the mind itself?

In the court system, a judge, after giving a verdict, needs to also justify it, while referencing a shared codex. But that codex is often ambiguous - that is the whole reason there is a judge involved.

And we know, for a fact, that the reasons the judges give in their judgements are not the only ones that play a role.

E.g. we know that judges are more likely to convict ugly people that pretty people. More likely to convict unsympathetic, but innocent parties, compared to sympathetic innocent parties. More likely to convict people of colour rather than white folks. More likely, troublingly, to convict someone if they are hearing a case just before lunch (when they are hangry) compared to just after lunch (when they are happy and chill cause they just ate).

Not only does the judge not transparently tell us this - the judge has no idea they are doing it - presumably, because if this were a conscious choice, they would be aware that it sucked, and would not want to do this (presuming they take their profession seriously). They aren't actively thinking "we are running over time into my lunch break and this man is ugly, hence he is guilty". But rather, their perception of the evidence is skewed by the fact that he is ugly and they are hungry. He looks guilty. They feel like vengeance for having been denied their burger. So they pay attention to the incriminating evidence more than to his pleas against it. 

How would this situation differ if you had an AI for a judge? (I am not saying we should. Just that they are similarly opaque in this regard.) I am pretty sure I could go now, and ask ChatGPT to rule a case I present, and then to justify that ruling, including how they arrived at that conclusion. I would expect to get a good justification that references the case. I would also expect to get a confabulation of how they got there - a plausible sounding explanation of how someone might reach the conclusion they reached, but ChatGPT has no insight into how they actually did.

But neither do humans. 

Humans are terrible at introspection, even if they are trying to be honest. Absolute rubbish. Once humans in psychology and neuroscience started actually looking into it many decades ago, we essentially concluded that humans give us an explanation of how they reached their conclusions that matches getting to the conclusions, and beliefs they like to hold about themselves, while being oblivious of the actual reasons. The experiments that actually looked into this were absolutely damning, and well worth a read: Nisbett & Wilson 1977 is a great metareview https://home.csulb.edu/~cwallis/382/readings/482/nisbett%20saying%20more.pdf 

E.g. we know that judges are more likely to convict ugly people that pretty people. More likely to convict unsympathetic, but innocent parties, compared to sympathetic innocent parties. More likely to convict people of colour rather than white folks. More likely, troublingly, to convict someone if they are hearing a case just before lunch (when they are hangry) compared to just after lunch (when they are happy and chill cause they just ate).

For the record, a lot of these didn't hold up when investigated later.

In short, the king has given up a situation with known unknowns in favor of one with unknown unknowns and some additional economic gain.

The thing about writing stories which are analogies to AI is, how far removed from the specifics of AI and its implementations can you make the story while still preserving the essential elements that matter with respect to the potential consequences. This speaks perhaps to the persistent doubt and dread that we may have in a future awash in the bounty of a seemingly perfectly aligned ASI. We are waiting for the other shoe to drop. What could any intelligence do to prove its alignment in any hypothetical world, when not bound to its alignment criteria by tangible factors?