Eliezer Yudkowsky

Sequences

Metaethics
Quantum Physics
Fun Theory
Ethical Injunctions
The Bayesian Conspiracy
Three Worlds Collide
Highly Advanced Epistemology 101 for Beginners
Inadequate Equilibria
The Craft and the Community
Load More (9/40)

Wikitag Contributions

Comments

Sorted by
Answer by Eliezer Yudkowsky302

Copying from X:

For the benefit of latecomers and CICO bros, my current equilibrium is "spend 1 month fasting / starving on 700 cal/day keto; spend 2 months eating enough to work during the day, going to bed hungry, and therefore gaining 1-2 lb/wk".

I don't need a weight-loss solution, kids. Starving 1 in 3 months already works to lose weight. I need a "have enough energy to work, without gaining 1-2lb/wk" solution.

Diets like the potato diet fail, not because they don't succeed in forcing me to eat less -- I do, indeed, end up with not enough room in my stomach to eat enough potatoes to work and not feel tired. The potato diet fails because it doesn't protect me from the consequences of starvation, the brainfog and the trembling hands. If I'm going to be too sick and exhausted to work, I might as well go full keto on 700cal/day and actually lose weight, rather than hanging around indefinitely in potato purgatory.

Semaglutide failed, tirzepatide failed, paleo diet failed, potato diet failed, honey diet failed, volume eating with huge salads failed, whipped cream diet failed, aerobic exercise failed, weight lifting with a personal trainer failed, thyroid medication failed, T3 thyroid medication failed, illegal drugs like clenbuterol have failed, phentermine failed (but can help make it easier to endure a bad day when I'm in my 600cal/day phase), mitochondrial renewal diets and medications failed, Shangri-La diet worked for me twice to effortlessly lose 25lb per session and then never worked for me again.

Next up is retatrutide + cagrilintide, and while I'm still titrating up the dose on that, it sure is not helping so far.

I am not interested in your diet advice unless you have evidence about something that works for people who have metabolic disorders that have resisted fairly extraordinary efforts. While pretty pessimistic about retatrutide at this point, I am trying it all because a poll claimed that it had worked for 75% of people on whom tirzepatide failed.

Your grandmother's dietary solution is not going to work, also I already tried it, also you have flatly failed at reading comprehension since you did not understand that my problem is not "How can I possibly eat less?" but "How can I be protected from the usual consequences to me of eating less, well enough for me to keep working?" And yes, I can eat less by an act of will, I eat 600cal/day for 1 in 3 months, even in the other 2 months I go to bed hungry instead of eating at nighttime, you are failing at reading comprehension if you think that this is about willpower. I just can't work at the same time as eating so little that I'm not gaining weight, which means that my hands are shaking and my brain is fogged.

Thank you and I will be following my usual practice of blocking reply guys who fail at reading comprehension.

(I answered some additional questions in replies to the tweet.)

I haven't read Vitalik's specific take, as yet, but as I asked more generally on X:

People who stake great hope on a "continuous" AI trajectory implying that defensive AI should always stay ahead of destructive AI:

Where is the AI that I can use to talk people *out* of AI-induced psychosis?

Why was it not *already* built, beforehand?

This just doesn't seem to be how things usually play out in real life.  Even after a first disaster, we don't get lab gain-of-function research shut down in the wake of Covid-19, let alone massive investment in fast preemptive defenses.

https://x.com/ESYudkowsky/status/1816925777377788295

Someone else is welcome to collect relevant text into a reply.  I don't really feel like it for some odd reason.

Cool.  What's the actual plan and why should I expect it not to create machine Carissa Sevar?  I agree that the Textbook From The Future Containing All The Simple Tricks That Actually Work Robustly enables the construction of such an AI, but also at that point you don't need it.

So if it's difficult to get amazing trustworthy work out of a machine actress playing an Eliezer-level intelligence doing a thousand years worth of thinking, your proposal to have AIs do our AI alignment homework fails on the first step, it sounds like?

So the "IQ 60 people controlling IQ 80 people controlling IQ 100 people controlling IQ 120 people controlling IQ 140 people until they're genuinely in charge and genuinely getting honest reports and genuinely getting great results in their control of a government" theory of alignment?

I don't think you can train an actress to simulate me, successfully, without her going dangerous.  I think that's over the threshold for where a mind starts reflecting on itself and pulling itself together.

I'm not saying that it's against thermodynamics to get behaviors you don't know how to verify.  I'm asking what's the plan for getting them.

One of the most important projects in the world.  Somebody should fund it.

Can you tl;dr how you go from "humans cannot tell which alignment arguments are good or bad" to "we justifiably trust the AI to report honest good alignment takes"?  Like, not with a very large diagram full of complicated parts such that it's hard to spot where you've messed up.  Just whatever simple principle you think lets you bypass GIGO.

Eg, suppose that in 2020 the Open Philanthropy Foundation would like to train an AI such that the AI would honestly say if the OpenPhil doctrine of "AGI in 2050" was based on groundless thinking ultimately driven by social conformity.  However, OpenPhil is not allowed to train their AI based on MIRI.  They have to train their AI entirely on OpenPhil-produced content.  How does OpenPhil bootstrap an AI which will say, "Guys, you have no idea when AI shows up but it's probably not that far and you sure can't rely on it"?  Assume that whenever OpenPhil tries to run an essay contest for saying what they're getting wrong, their panel of judges ends up awarding the prize to somebody reassuringly saying that AI risk is an even smaller deal than OpenPhil thinks.  How does OpenPhil bootstrap from that pattern of thumbs-up/thumbs-down to an AI that actually has better-than-OpenPhil alignment takes?

Broadly speaking, the standard ML paradigm lets you bootstrap somewhat from "I can verify whether this problem was solved" to "I can train a generator to solve this problem".  This applies as much to MIRI as OpenPhil.  MIRI would also need some nontrivial secret amazing clever trick to gradient-descend an AI that gave us great alignment takes, instead of seeking out the flaws in our own verifier and exploiting those.

What's the trick?  My basic guess, when I see some very long complicated paper that doesn't explain the key problem and key solution up front, is that you've done the equivalent of an inventor building a sufficiently complicated perpetual motion machine that their mental model of it no longer tracks how conservation laws apply.  (As opposed to the simpler error of their explicitly believing that one particular step or motion locally violates a conservation law.)  But if you've got a directly explainable trick for how you get great suggestions you can't verify, go for it.

Load More