khafra - LessWrong

This is the most optimistic believable scenario I've seen in quite a while!

Alignment Implications of LLM Successes: a Debate in One Act

And yet it behaves remarkably sensibly. Train a one-layer transformer on 80% of possible addition-mod-59 problems, and it learns one of two modular addition algorithms, which perform correctly on the remaining validation set. It's not a priori obvious that it would work that way! There are other possible functions on $Z / 59 Z$ compatible with the training data.

Seems like Simplicia is missing the worrisome part--it's not that the AI will learn a more complex algorithm which is still compatible with the training data; it's that the simplest several algorithms compatible with the training data will kill all humans OOD.

Review: Planecrash

khafra1mo40

AFAICT, in the Highwayman example, if the would-be robber presents his ultimatum as "give me half your silk or I burn it all," the merchant should burn it all, same as if the robber says "give me 1% of your silk or I burn it all."
But a slightly more sophisticated highwayman might say "this is a dangerous stretch of desert, and there are many dangerous, desperate people in those dunes. I have some influence with most of the groups in the next 20 miles. For x% of your silk, I will make sure you are unmolested for that portion of your travel."
Then the merchant actually has to assign a probabilities to a bunch of events, calculate Shapley values, and roll some dice for his mixed strategy.

Communications in Hard Mode (My new job at MIRI)

khafra1mo72

Tangentially to Tanagrabeast's "least you can do" suggestion, as a case report: I came out to my family as an AI xrisk worrier over a decade ago, when one could still do so in a fairly lighthearted way. They didn't immediately start donating to MIRI and calling their senators to request an AI safety manhattan project, but they did agree with the arguments I presented, and check up with me, on occasion, about how the timelines and probabilities are looking.

I have had two new employers since then, and a few groups of friends; and with each, when the conversation turns to AI (as it often does, over the last half-decade), I mention my belief that it's likely going to kill us all, and expand on Instrumental Convergence, RAAP, and/or "x-risk, from Erewhon, to IJ Good, to the Extropians," depending on which aspect people seem interested in. I've been surprised by the utter lack of dismissal and mockery, so far!

Mob and Bailey

khafra2mo50

See also Steven Kaas' aphorisms on twitter:

> First Commandment of the Church of Tautology: Live next to thy neighbor
And
> "Whatever will be will be" is only the first secret of the tautomancers.

Seven lessons I didn't learn from election day

khafra3mo120

The story I read about why neighbor polling is supposed to correct for bias in specifically the last few presidential elections is that some people plan to vote for Trump, but are ashamed of this, and don't want to admit it to people who aren't verified Trump supporters. So if you ask them who they plan to vote for, they'll dissemble. But if you ask them who their neighbors are voting for, that gives them permission to share their true opinion non-attributively.

Personal AI Planning

khafra3mo20

In the late 80's, I was homeschooled, and studied caligraphy (as well as cursive); but I considered that more of a hobby than preparation for entering the workforce of 1000 years ago.

I also learned a bit about DOS and BASIC, after being impressed with the fractal-generating program that the carpenter working on our house wrote, and demonstrated on our computer.

Three Notions of "Power"

khafra3mo20

Your definition seems like it fits the Emperor of China example--by reputation, they had few competitors for being the most willing and able to pessimize another agent's utility function; e.g. 9 Familial Exterminations.
And that seems to be a key to understanding this type of power, because if they were able to pessimize all other agents' utility functions, that would just be an evil mirror of bargaining power. Being able to choose a sharply limited number of unfortunate agents, and punish them severely pour encourager les autres, seems like it might just stop working when the average agent is smart enough to implicitly coordinate around a shared understanding of payoff matrices.
So I think I might have arrived back to the "all dominance hierarchies will be populated solely by scheming viziers" conclusion.

Three Notions of "Power"

khafra3mo30

Clarifying question: If A>B on the dominance hierarchy, that doesn't seem to mean that A can always just take all B's stuff, per the Emperor of China example. It also doesn't mean that A can trust B to act faithfully as A's agent, per the cowpox example.

If all dominance hierarchies control is who has to signal submission to whom, dominance seems only marginally useful for defense, law, taxes, and public expenditure; mostly as a way of reducing friction toward the outcome that would have happened anyway.

It seems like, with intelligence too cheap to meter, any dominance hierarchy that doesn't line up well with the bargaining power hierarchy or the getting-what-you-want vector space is going to be populated with nothing but scheming viziers.

But that seems like a silly conclusion, so I think I'm missing something about dominance hierarchies.

The salt in pasta water fallacy

khafra3mo30

Note also that there are several free parameters in this example. E.g., I just moved to Germany, and now have wimpy German burners on my stove. If I put on a large container with 6L or more of water, and I do not cover it, the water will never go beyond bubble formation into a light simmer, let alone a rolling boil. If I cover the container at this steady state, it reaches a rolling boil in about another 90s.

LESSWRONG
LW

Posts

Wiki Contributions

Comments