What if Didactic Fiction is the Answer to Aligning Large Language Models?
This is a linkshortform for https://bittertruths.substack.com/p/how-to-train-your-shoggoth-part-1
Epistemic status: this sounds very much like an embarrassingly dumb take, but I haven't been able to convince myself that it is false, and indeed am somewhat convinced that it would at least help to align large language models (and other models that meaningfully incorporate them). So I'm writing at least partly to see if anyone has any fundamental objectio...
Very cool post! We need a theory of valence that is grounded in real neuroscience, since understanding valence is pretty much required for any alignment agenda that works the first time.
I have read the sequences. Not all of them, because, who has time.
Here is a video of me reading the sequences (both Eliezer's and my own):
https://bittertruths.substack.com/p/semi-adequate-equilibria
Well what if he bets a significant amount of money at 2000:1 odds that the Pope will officially add his space Bible to the real Bible as a third Testament after the New Testament within the span of a year?
What if he records a video of himself doing Bible study? What if he offers to pay people their currently hourly rate to watch him do Bible study?
I guess the thrust of my questions here is, at what point do you feel that you become the dick for NOT helping him publish his own space Bible? At what point are you actively impeding new religious discoveries by...
Recorded a sort of video lecture here: https://open.substack.com/pub/bittertruths/p/semi-adequate-equilibria
Agree that it should be time-based rather than karma-based.
I'm currently on a very heavy rate limit that I think is being manually adjusted by the LessWrong team.
Is this clear enough:
I posit that the reason that humans are able to solve any coordination problems at all is that evolution has shaped us into game players that apply something vaguely like a tit-for-tat strategy meant to enforce convergence to a nearby Schelling Point / Nash Equilibrium, and to punish defectors from this Schelling Point / Nash Equilibrium. I invoke a novel mathematical formalization of Kant's Categorical Imperative as a potential basis for coordination towards a globally computable Schelling Point. I believe that this constitutes a prom...
My understanding of the current situation with me is that I am not in fact rate-limited purely by automatic processes currently, but rather by some sort of policy decision on the part of LessWrong's moderators.
Which is fine, I'll just continue to post my alignment research on my substack, and occasionally dump linkposts to them in my shortform, which the mods have allowed me continued access to.
https://github.com/epurdy/ethicophysics/blob/main/writeup1.pdf
Perhaps it is about right, then.
Yes, this is a valid and correct point. The observed and theoretical Nash Equilibrium of the Wittgensteinian language game of maintaining consensus reality is indeed not to engage with cranks who have not Put In The Work in a way that is visible and hard-to-forge.
Thank you for this anwer. I agree that I have not visibly been putting in the work to make falsifiable predictions relevant to the ethicophysics. These can indeed be made in the ethicophysics, but they're less predictions and more "self-fulfilling prophecies" that have the effect of compelling the reader to comply with a request to the extent that they take the request seriously. Which, in plain language, is some combination of wagers, promises, and threats.
And it seems impolite to threaten people just to get them to read a PDF.
And I also have the courage to apply to Y Combinator to start either a 501c3 or a for-profit company to actually perform this trial through legal, official channels. Do you think that I will be denied entry into their program with such a noble goal and the collaboration of a domain expert?
I have the courage to commit an act of civil disobedience in which I ask people caring for Alzheimer's patients to request a Zoloft and/or Trazodone prescription for their loved ones, and then track the results.
Do you think I lack the persistence and capital to organize something of that nature? Why or why not?
Well then, I submit that courage is a virtue, when tempered with the wisdom not to pick fights you do not plan to finish.
This comment continues to annoy me. I composed a whole irrational response in my mind where I would make credible threats to burn significant parts of the capabilities commons every time someone called me delusional on LessWrong.
But that's probably not a reasonable way to live my life, so this response is not that response.
I get that history is written by the victors. I get that what is accepted by consensus reality is dictated by the existing power structures. The fact that you would presume to explain these things to the author of Ethicophysics I and Eth...
But it uses the tools of physics, so the math would best be checked by someone who understands Lagrangian mechanics at a professional level.
Yes, it is a specification of a set of temporally adjacent computable Schelling Points. It thus constitutes a trajectory through the space of moral possibilities that can be used by agents to coordinate and punish defectors from a globally consistent morality whose only moral stipulations are such reasonable sounding statements as "actions have consequences" and "act more like Jesus and less like Hitler".
And I'm happy to code up the smartphone app and run the clinical trial from my own funds. My uncle is starting to have memory trouble, I believe.
Oh, come on. If the rationality community disapproved of Einstein predicting the transit of Mercury, that's an L for the rationality community, not for Einstein.
I have offered to say why I believe it to be true, as soon as I can get clearance from my company to publish capabilities relevant theoretical neuroscience work.
That's fair, and I need to do a better job of building on-ramps for different readers. My most recent shortform is an attempt to build such an on-ramp for the LessWrong memeplex.
That's fair (strong up/agree vote).
If you consult my recent shortform, I lay out a more measured, skeptical description of the project. Basically, ethicophysics constitutes a globally computable Schelling Point, such that it can be used as a protocol between different RL agents that believe in "oughts" to achieve Pareto-optimal outcomes. As long as the largest coalition agrees to prefer Jesus to Hitler, I think (and I need to do far more to back this up) defectors can be effectively reined in, the same way that Bitcoin works because the majority of the computers hooked up to it don't want to destroy faith in the Bitcoin protocol.
In this post, I will try to lay out my theories of computational ethics in as simple, skeptic-friendly, non-pompous language as I am able to do. Hopefully this will be sufficient to help skeptical readers engage with my work.
The ethicophysics is a set of computable algorithms that suggest (but do not require) specific decisions in response to ethical decisions in a multi-player reinforcement learning problem.
The design goal that the various equations need to satisfy is that they should select a...
They really suck. The old paradigm of Alzheimer's research is very weak and, as I understand it, no drug has an effect size sufficient to offset even a minimal side effect profile, to the point where I think only one real drug has been approved by the FDA in the old paradigm, and that approval was super controversial. That's my understanding, anyway. I welcome correction from anyone who knows better.
So maybe we should define the effect size in terms of cogntiive QALY's? Say, an effective treatment should at least halve the rate of decline of the experiment...
Here is the best I could muster on short notice: https://bittertruths.substack.com/p/ethicophysics-for-skeptics
Since I'm currently rate-limited, I cannot post it officially.
How will we handle the desk drawer effect, where insignificant results are quietly shelved? I guess if the trial is preregistered this won't happen...
https://chat.openai.com/share/068f5311-f11a-43fe-a2da-cbfc2227de8e
Here are ChatGPT's speculations on how much it would cost to run this study. I invite any interested reader to work on designing this study. I can also write up my theories as to why this etiology is plausible in arbitrary detail if that is decision-relevant to someone with either grant money or interest in helping to code up the smartphone app we would need. to collect the relevant measurements cheaply. (Intuitively, it would be something like a Dual N-Back app, but more user-friendly for Alzheimer's patients.)
I can put together some sort of proposal tonight, I suppose.
OK, let's do it. Your nickel against my $100.
What resolution criteria should we use? Perhaps the first RCT that studies a treatment I deem sufficiently similar has to find a statistically significant effect with a publishable effect size? Or should we require that the first RCT that studies a similar treatment is halted halfway through because it would be unethical for the control group not to receive the treatment? (We could have a side bet on the latter, perhaps.)
What would the study look like? Presumably scores on a standard cognitive test designed to m...
OK, sounds good! Consider it a bet.
Hey @MadHatter - Eliezer confirms that I've won our bet.
I ask that you donate my winnings to GiveWell's All Grants fund, here, via credit card or ACH (preferred due to lower fees). Please check the box for "I would like to dedicate this donation to someone" and include zac@zhd.dev as the notification email address so that I can confirm here that you've done so.
I wouldn't say I really do satire? My normal metier is more "the truth, with jokes". If I'm acting too crazy to be considered a proper rationalist, it's usually because I am angry or at least deeply annoyed.
OK. I can only personally afford to be wrong to the tune of about $10K, which would be what, $5 on your part? Did I do that math correctly?
OK, anybody who publicly bets on my predicted outcome to the RCT wins the right to engage me in a LessWrong dialogue on a topic of their choosing, in which I will politely set aside my habitual certainty and trollish demeanor.
Well that should be straightforward, and is predicted by my model of serotonin's function in the brain. It would require an understanding of the function of orexin, which I do not currently possess, beyond the standard intuition that it modulates hunger.
The evolutionary story would be this:
Well what's the appropriate way to act in the face of the fact that I AM sure I am right? I've been offering public bets of the nickel of some high-karma person versus my $100, which seems like a fair and attractive bet for anyone who doubts my credibility and ability to reason about the things I am talking about.
I will happily bet anyone with significant karma that Yudkowsky will find my work on the ethicophysics valuable a year from now, at the odds given above.
I have around 2K karma and will take that bet at those odds, for up to 1000 dollars on my side.
Resolution criteria are to ask EY about his views on this sequence as of December 1st 2024, literally "which of Zac or MadHatter won this bet", and resolves no payment if he declined to respond or does not explicitly rule for any other reason.
I'm happy to pay my loss by eg Venmo, and would request winnings as a receipt for your donation to GiveWell's all-grants fund.
And now I am officially rate-limited to one post per week. Be sure to go to my substack if you are curious about what I am up to.
Well, I'll just have to continue being first out the door, then, won't I?
And if people refuse to take such an attractive bet for the reason that my proposed cure sounds like it couldn't possibly hurt anyone, and might indeed help, then I reiterate the point I made in The Alignment Agenda THEY Don't Want You to Know About: the problem is not that my claims are prima facie ridiculous, it is that I myself am prima facie ridiculous.
I will publicly wager $100 against a single nickel with the first 10 people with extremely high LessWrong karma who want to publicly bet against my predicted RCT outcome.
https://alzheimergut.org/research/ is the place to look for all the lastest research from the gut microbiome hypothesis community.
Agreed, this is a crucial lesson of history.
Young people forget important stuff, get depressed, struggle to understand the world. That is the prediction of my model: that a bad gut microbiome would cause more neural pruning than is strictly optimal.
It is well documented that starving young people have lower IQ's, I believe? Certainly the claim does not seem prima facie ridiculous to me.
The older you get, the more chances you have to develop a bad gut microbiome. Perhaps the actual etiology of bad gut microbiomes (which I do not claim to understand) is heavily age-correlated. Or maybe we simply do no...
Also, for interested readers, I am happy to post a more detailed mechanistic neuroscience explanation of my theory, but want to make sure I'm not breaking my company NDA's by sharing it first.
What's so bad about keeping a human in the loop forever? Do we really think we can safely abdicate our moral responsibilities?
I'm not trying to generate revenue for Wayne. I'm trying to spread his message to force the hand of the judicial system to not imprison him for longer than they already have.
Well, perhaps we can ask, what is reading about? Surely it involves reading through clearly presented arguments and trying to understand the process that generated them, and not presupposing any particular resolution to the question "is this person crazy" beyond the inevitable and unenviable limits imposed by our finite time on Earth.
That's fair. I just want Wayne to get out of jail soon because he's a personal friend of mine.
How to Train Your Shoggoth, Part 2
The Historical Teams Framework for Alignment Research
Eliezer Yudkowsky has famously advocated for embracing "security mindset" when thinking about AI safety. This is a mindset where you think about how to prevent things from going wrong, rather than how to make things go right. This seems obviously correct to me, so for the purposes of this post I'll just take this as a given.
But I think there's a piece missing from the AI Safety community's understanding of security mindset, that is a key part of the mindset that computer... (read more)