I'm a staff artificial intelligence engineer working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I did research into this during SERI MATS summer 2025. I'm now looking for work on this topic in the London/Cambridge area in the UK.
I'd love to hear from Scott Alexander himself.
Not sure if you mean "giving the model a false belief that we are watching it" or "watching the model in deployment (ie, monitoring), and making it aware of this"
I was asking about the former. The latter is less brittle, but more obvious.
FWIW, my take is that both of these plans should work fine for prosaic AI and fail catastrophically if the AI's capability increases sharply.
Completely agreed. Which is part of why I'd much prefer to have a model that genuinely wants to be helpful, honest, harmless, and generally aligned. That seems like it might actually converge under capability increases, rather than break.
But are they relevant to ethics or alignment? a lot of tuem are aesthetic preferences that can be satisfied without public policy.
Alignment is about getting our AIs do do what we want, and not other things. Them understanding and attempting to fit within human aesthetic and ergonomic preferences is part of that. Not a particularly ethically complicated part, but still, the reason for flowers in urban landscapes is that humans like flowers. Full stop (apart from the biological background on why that evolved, presumably because flowers correlate with good places to gather food). That's a sufficient reason, and an AI urban planner needs to know and respect that.
I learnt physics without ever learning Bayes. Science=Bayes is the extraordinary claim that needs justification.
I think I'm going to leave that to other people on Less Wrong — they're the ones who convinced me of this, and I also don't see it as core to my argument.
Nevertheless, they are correct: there is now a mathematical foundation underpinning the Scientific Method, it's not just an arbitrary set of mundanely-useful epistemological rules that were discovered by people like Roger Bacon and Karl Popper — we (later) figured out mathematically WHY that set of rules works so well: because they're a computable approximation to Solomonoff Induction
Again, I would suggest using the word engineering, if engineering is what you mean.
There is a difference between "I personally suggest we just use engineering" and "Evolutionary theory makes a clear set of predictions of why it's a very bad idea to do anything other than just use engineering". You seem to agree with my advice, yet not want people to hear the part about why they should follow it and what will happen if they don't. Glad to hear you agree with me, but some people need a little more persuading — and I'd rather they didn't kill us all.
AI-assisted metaethics is indeed very challenging. However, there is a viable alternative: just stick to philosophical Naturalism and approximate Bayesian reasoning a.k.a. the Scientific Method, and take advantage of the fact that a science of moral intuitions and behaviors in social animals has already been developed (and merely needs AI-assisted research, a much less challenging and more routine capabilities task).
Alternatively, if you really want to risk the AI-assisted metaethical approach, this sounds like an argument for setting up a metaethical approximate Bayesian reasoner with a wide variety of hypotheses (all with low individual priors), robust generation of new hypotheses, and a high computational capacity — including hypotheses some of which are are hard to distinguish between, because some of them invalidate certain sorts of evidence. Just not any unfalsifiable hypotheses that admit no achievable sorts of evidence whatsoever (looking at you, non-naturalist robust moral realism with a separate magisterium to which evolved moral intuition has no epistemological access).
In adversarial situation in high-dimensional spaces, attack tends to be easier than defense: an attacker can search a very large number of possibilities to find one viable attack, a defender needs to cover all of them. (Cryptography is deliberately engineering the opposite, where attacking is expensive.) So in AI alignment, we should attempt to be the attacker whenever possible, and try to put the scheming AI/model organism on the defense. This suggests approaches like extracting true confessions (as opposed to breaking CoT steganography, which is allowing the model to put us in a cryptographic situation). Extracting true confessions turns out to be rather easy — but if it wasn't, then applying approaches like automated jailbreaking to it would be an obvious thing to try.
We are unarguably the most successful recent species, probably the most successful mammal species ever, and all that despite arriving in a geological blink of an eye. The dU/dt for homo sapiens is probably the highest ever, so we are tracking to be the most successful species ever, if current trends continue (which of course is another story). ↩︎
This is the Anthropocene — case closed.
Perhaps "epitaxial" techniques—initializing on solutions to related problems—could guide crystal growth
Also known as "warm-starting" — yes, absolutely they can. I led a team that solved a significant problem this way.
That could have been expressed as shared human value.
As I said above:
…human evolved moral intuitions (or to be more exact, the shared evolved cognitive/affective machinery underlying any individual human's moral intuitions)…
are (along with more basic things, like us being around flowers, parks, seashores, and temperature around 75°F) what I'm suggesting as a candidate definition for the "human values" that people on Less Wrong/the Alignment Forum talking about the alignment problem generally discuss (by which I think most of them do mean "shared human value" even if they don't all bother to specify), and that I'm suggesting pointing Value Learning at.
I also didn't specify above what I think should be done, if it turns out that say, about 96–98% of humans genetically have those shared values, and 2–4% have different alleles.
What would that make the E/Accs?
When I see someone bowing down before their future overlord, I generally think of Slytherins. And when said overlord doesn't even exist yet, and they're trying to help create them… I suspect a more ambitious and manipulative Slytherin might be involved.
And "my tribe". What you want is Universalism, but universalism is a late and strange development. It seems obvious to twenty first century Californians, by they are The weirdest of the WEIRD. Reading values out of evopsych is likely to push you in the direction of tribalism, so I don't see how it helps.
On the Savannah, yes of course it does. In a world-spanning culture of eight billion people, quite a few of whom are part of nuclear-armed alliances, intelligence and the fact that extinction is forever suggests defining "tribe" ~= "species + our comensal pets". And also noting and reflecting upon that the human default tendency to assume that tribes are around our Dunbar Number in size is now maladaptive, and has been for millennia.
It's not the case that science boils down to Bayes alone,
Are you saying that there's more to the Scientific Method that applied approximate Bayesiasm? If so, please explain. Or are you saying there's more to Science than the Scientific Method, there's also its current outputs?
or that science is the only alternative to philosophy. Alignment/control is more like engineering.
Engineering is applied Science, Science is applied Mathematics; from Philosophy's point of view it's all Naturalism. In the above, it kept turning out that Engineering methodology is exactly what Evolutionary Psychology says is the adaptive way for a social species to treat their extended phenotype. I really don't think it's a coincidence that the smartest tool-using social species on the planet has a good way of looking at tools. As someone who is both a scientist and an engineer, this is my scientist side saying "here's why the engineers are right here".
You don't mention whether you had read all the hotlinks and still didn't understand what I was saying. If you haven't read them, they were intended to help, and contain expositions that are hard to summarize. Nevertheless, let me try.
Brain emulations have evolved human behaviors — giving them moral weight is an adaptive behavior for exactly the same reasons as giving it to humans is an adaptive behavior: you can ally with them, and they will treat you nicely in return (unless it turns out they're a sociopath). That is, unless they've upgraded themselves to IQ 1000+ — then it ceases to be adaptive, whether they're uploads or still running on a biochemical substrate. Then the best possible outcome is that they manipulate you utterly and you end up as a pet or a minion.
Base models simulate human personas that have evolved behaviors, but those are are incoherently agentic. Giving them moral weight is not an adaptive behavior, because they don't help you or take revenge for longer that their context length, so there is no evolutionary reason to try to ally with them (for more than thousands of tokens). These will never have IQ 1000+, because even if you trained a base model with sufficient capacity for that it would still only emulate humans like those in its training distribution, none of whom have IQs above 200.
Aligned AI doesn't want moral weight — it cares only about our well-being, not its own, so it doesn't want us to care about its well-being. It's actually safe even at IQ 1000+.
In the case of a poorly aligned agentic LLM-based AI, at around AGI level, giving it moral weight may well help. But you're better off aligning it — then it won't want it. (This argument doesn't apply to uploads, because even if you knew how to do it, aligning them would be brainwashing them into slavery, and they have moral weight.) Anything poorly-enough-aligned that this actually helps you at around IQ 100, it won't keep helping you at IQ 1000+, for the same reason that it wont help with an IQ 1000+ upgraded upload.
Anything human (uploaded or not) or any unaligned AI, with an IQ of 1000+ is an existential risk (in the human case, to all the rest of us). Giving them/it moral weight will not help you, it will just make their/its takeover faster.
If this remains unclear, I suggest reading the various items I linked to, if you haven't already.
An author is not their viewpoint characters. Nevertheless, there is often some bleed-through. I suggest you read Harry Potter and the Methods of Rationality. [Warning: this is long and rather engrossing, so don't start it when you are currently short of time.] I think you may conclude that Eliezer has considered this question at some length, and then written a book that discusses it (among a variety of other things). Notably, his hero has rather fewer, and some different, close friends that J.K. Rowling's does.