N1X - LessWrong

Thoughts on “AI is easy to control” by Pope & Belrose

N1X8mo10

Humans often really really want something in the world to happen

This sentence is adjacent to my core concern regarding AI alignment, and why I'm not particularly reassured by the difficulty-of-superhuman-performance or return-on-compute reassurances regarding AGI: we don't need superhuman AI to deal superhuman-seeming amounts of damage. Indeed, even today's "perfectly-sandboxed" models (in that according to the most reliable publicly-available information none of the most cutting-edge models are allowed direct read/write access to the systems which would allow them to plot and attain world domination or the destruction of humanity (or specific nations' interests) have the next-best thing: whenever a new technological lever emerges in the world, humans with malicious intentions are empowered to a much greater degree than those who want strictly the best^[1] for everybody. There are also bit-flip attacks on aligned AI which are much harder to implement on humans.

^{^}
Using "best" is fraught but we'll pretend that "world best-aligned with a Pareto-optimal combination of each person's expressed reflective preferences and revealed preferences, to the extent that those revealed preferences do not represent akrasia or views and preferences which the person isn't comfortable expressing directly and publicly but does indeed have" is an adequate proxy to continue along this line of argument; the other option is developing a provably-correct theory of morality and politics which would take more time than this comment by 2-4 orders of magnitude.

Thoughts on “AI is easy to control” by Pope & Belrose

N1X8mo10

“Imagine a square circle, and now answer the following questions about it…”.

Just use the Chebyshev (aka maximum or ) metric.

Social status part 1/2: negotiations over object-level preferences

N1X9mo10

I think a somewhat-more-elegant toy model might look something like the following: Alice’s object-level preferences are , and Beth’s are $U_{B}$ . Alice’s all-things-considered preferences are $U_{A} + α U_{B}^{'}$ , and Beth’s are $U_{B} + β U_{A}^{'}$ . Here, $U_{A}^{'}$ & $U_{B}^{'}$ represent Beth’s current beliefs about Alice’s desires and vice-versa, and the $α, β$ parameters represent how much Alice cares about Beth’s object-level desires and vice-versa. The latter could arise from admiration of the other person, fear of pissing them off, or various other considerations discussed in the next post.

I think that the most general model would be $U_{A} + f_{t} (U_{B}^{'})$ and $U_{B} + g_{t} (U_{A}^{'})$ where $A_{t}, B_{t}$ are time-dependent (or past-interaction-dependent, or status-dependent; these are all fundamentally the same thing). It's a notable feature that this model does not assume that the functions are monotonic! I suspect that most people are willing to compromise somewhat on their preferences but become less willing to do so when the other party begins to resemble a utility monster, and notably covers the situation where the weight is negative under some but not all circumstances.

Social status part 1/2: negotiations over object-level preferences

N1X9mo10

trains but not dinosaurs

Did you get this combo from this video, or is this convergent evolution?

What makes teaching math special

N1X10mo30

This argument is in no small part covered in

https://worrydream.com/refs/Lockhart_2002_-_A_Mathematician's_Lament.pdf

which is also available in 5-times-the-page-count-and-costs-$10.

Then you should pay them 10 years of generous salary to produce a curriculum and write model textbooks. You need both of that. (If you let someone else write the textbook, the priors say that the textbook will probably suck, and then everyone will blame the curriculum authors. And you, for organizing this whole mess.) They should probably also write model tests.

The problem undergirding the problem you're talking about is not just that nobody's decided to "put the smart people who know math and can teach effectively in a room and let them write the curriculum." As a matter of fact, both New Math and the Common Core involved people with at least all but point 2, and the premise that elementary school teachers are best qualified to undertake this project is a flawed one (if it's a necessity, then Lockhart may be the most famous exemplar adjacent to your goals, and reading his essay or book should take priority over trying to theorize or recruit other similar specialists.

The Pareto Best and the Curse of Doom

N1X10mo20

The negative examples are the things that fail to exist because there aren't enough people with that overlap of skills. The Martian for automotive repair might exist, but I haven't heard of it.

Zen and the Art of Motorcycle Maintenance?

OpenAI's Sora is an agent

N1X10mo10

Why "selection" could be a capacity which would generalize: albeit to a (highly-lossy) first approximation, most of the most successful models have been based on increasingly-general types of gamification of tasks. The more general models have more general tasks. Video can capture sufficient information to describe almost any action which humans do or would wish to take along with numerous phenomena which are impossible to directly experience in low-dimensional physical space, so if you can simulate a video, you can operate or orchestrate reality.
Why selection couldn't generalize: I can watch someone skiing but that doesn't mean that I can ski. I can watch a speedrun of a video game and, even though the key presses are clearly visible, fail to replicate it. I could also hack together a fake speedrun. I suspect that Sora will be more useful for more-convincingly-faking speedrun content than for actually beating human players or becoming the TAS tool to end all TAS tools (aside from novel glitch discovery). This is primarily because there's not a strong reason to believe that the model can trained to achieve extremely high-fidelity or high-precision tasks.

Leading The Parade

N1X10mo30

One way to identify counterfactually-excellent researchers would be to compare the magnitude of their "greatest achievement" and secondary discoveries, because the credit that parade leaders get is often useful for propagating their future success and the people who do more with that boost are the ones who should be given extra credit for originality (their idea) as opposed to novelty (their idea first). Newton and Leibniz both had remarkably successful and diverse achievements, which suggests that they were relatively high in counterfactual impact in most (if not all) of those fields. Another approach would consider how many people or approaches to a problem had tried and failed to solve it: crediting the zeitgeist rather than Newton and/or Leibniz specifically seems to miss a critical question, namely that if neither of them solved it, would it have taken an additional year, or more like 10 to 50? In their case, we have a proxy to an answer: ideas took months or years to spread at all beyond the "centers of discovery" at the time, and so although they clearly took only a few months or years to compete for the prize of first (and a few decades to argue over it), we can relatively safely conjecture that whichever anonymous contender is third in the running is likely to have been behind on at least that timescale. That should be considered in contrast to Andrew Wiles, whose proof of Rermat's Last Theorem was efficiently and immediately published (and patched as needed) This is also important because other and in particular later luminaries of the field (e.g. Mengoli, Mercator, various Bernoullis, Euler, etc.) might not have had the vocabulary necessary to make as many discoveries as quickly as they did or communicate those discoveries as effectively if not for Newton & Leibniz's timely contributions.

Snake Eyes Paradox

N1X1y21

Right, and the correct value is 37/72, not 19/36, because exactly half of the remaining 70/72 players lose (in the limit).

Assume Bad Faith

N1X1y104

I think that this post relies too heavily on a false binary. Specifically, the description of all arguments as "good faith" or "bad faith" completely ignores the (to my intuition, far likelier) possibility that most arguments begin primarily (my guess is 90% or so, but maybe I just tend not to hold arguments with people below 70%) good faith, then people adjust according to their perception of their interlocutor(s), audience (if applicable), and the importance of the issue being argued. Common signals of arguments in particularly bad faith advanced by otherwise intelligent people include persistent reliance on most listed logical fallacies (dismissing that criticism and keeping a given point after its fallacious nature is clearly explained; sealioning and whataboutism are prototypical exemplars), moving the goalposts, and ignoring all contradictory evidence.

Another false binary: this also ignores the possibility of both sides in an argument being correct (or founded on correct factual data). For example, today I spent perhaps 10 minutes arguing over whether a cheese was labeled as Gouda or not, because I'd read the manufacturer's label which did not contain that word but did say "Goat cheese of Holland" and my interlocutor read the price label from Costco which called it "goat Gouda." I'm marginally more correct because I recognized the contradiction in terms (in the EU gouda can only be made from milk produced by Dutch cows), but neither of us was lying or arguing in bad faith, and yet I briefly struggled to believe that we were inhabiting the same reality and remembering it correctly. They were a very entertaining 10 minutes, but I wouldn't want to have that kind of groundless argument more than once or twice a week, and that limit assumes a discussion of trivial topics as opposed to an ostensibly sincere debate on something which I hold dear.

LESSWRONG
is fundraising!
LW
$

Posts

Wiki Contributions

Comments