1.
4.4% of the US federal budget went into the space race at its peak.
This was surprising to me, until a friend pointed out that landing rockets on specific parts of the moon requires very similar technology to landing rockets in soviet cities.[1]
I wonder how much more enthusiastic the scientists working on Apollo were, with the convenient motivating story of “I’m working towards a great scientific endeavor” vs “I’m working to make sure we can kill millions if we want to”.
2.
The field of alignment seems to be increasingly dominated by interpretability. (and obedience[2])
This was surprising to me[3], until a friend pointed out that partially opening the black box of NNs is the kind of technology that would let scaling labs find new unhobblings by noticing ways in which the internals of their models are being inefficient and having better tools to evaluate capabilities advances.[4]
I wonder how much more enthusiastic the alignment researchers working on interpretability and obedience are, with the motivating story “I’m working on pure alignment research to save the world” vs “I’m building tools and knowledge which scaling labs will repurpose to build better products, shortening timelines to existentially threatening systems”.[5]
3.
You can’t rely on the organizational systems around you to be pointed in the right direction, and there are obvious reasons for commercial incentives to want to channel your idealistic energy towards types of safety work which are dual-use or even primarily capabilities enabling. And for similar reasons, many of the training programs prepare people for the kind of jobs which come with large salaries and prestige, as a flawed proxy for people moving the needle on x-risk.
If you’re genuinely trying to avert AI doom, please take the time to form inside views away from memetic environments[6] which are likely to have been heavily influenced by commercial pressures. Then back-chain from a theory of change where the world is more often saved by your actions, rather than going with the current and picking a job with safety in its title as a way to try and do your part.
- ^
It had its origins in the ballistic missile-based nuclear arms race between the two nations following World War II and had its peak with the more particular Moon Race to land on the Moon between the US moonshot and Soviet moonshot programs. The technological advantage demonstrated by spaceflight achievement was seen as necessary for national security and became part of the symbolism and ideology of the time.
- ^
I hate that people think AI obedience techniques slow down the industry rather than speeding it up. ChatGPT could never have scaled to 100 million users so fast if it wasn't helpful at all.
Making AI serve humans right now is highly profit-aligned and accelerant.
Of course, later when robots could be deployed to sustain an entirely non-human economy of producers and consumers, there will be many ways to profit — as measured in money, materials, compute, energy, intelligence, or all of the above — without serving any humans. But today, getting AI to do what humans want is the fastest way to grow the industry.
- ^
These paradigms do not seem to be addressing the most fatal filter in our future: Strongly coherent goal-directed agents forming with superhuman intelligence. These will predictably undergo a sharp left turn and the soft/fuzzy alignment techniques which worked at lower power levels fail simultaneously and as the system reaches high enough competence to reflect on itself, its capabilities, and the guardrails we built.
Interpretability work could plausibly help with weakly aligned weakly superintelligent systems that do our alignment homework for the much more capable systems to come. But the effort going into this direction seems highly disproportionate to how promising it is, is not backed by plans to pivot to using these systems to do a quite different style of alignment research that's needed, and generally lacks research closure to avert capabilities externalities.
- ^
From the team that broke the quadratic attention bottleneck:
Simpler sub-quadratic designs such as Hyena, informed by a set of simple guiding principles and evaluation on mechanistic interpretability benchmarks, may form the basis for efficient large models.
- ^
Ask yourself: “Who will cite my work?”, not "Can I think of a story where my work is used for good things?"
There is work in these fields which might be good for x-risk, but you need to figure out if what you're doing is in that category to be good for the world.
- ^
Humans are natural mimics, we copy the people who have visible signals of doing well, because those are the memes which are likely to be good for our genes, and genes direct where we go looking for memes.
Wealth, high confidence that they’re doing something useful, being part of a growing coalition; great signs of good memes. All much more possessed by people in the interpretability/obedience kind of alignment than the old-school “this is hard and we don’t know what we’re doing, but it’s going to involve a lot of careful philosophy and math” crowd.
Unfortunately, this memetic selection is not particularly adaptive for trying to solve alignment.
Imagine A GPT that predicts random chunks of the internet.
Sometimes it produces poems. Sometimes deranged rants. Sometimes all sorts of things. It wanders erratically around a large latent space of behaviours.
This is the unmasked shogolith, green slimey skin showing but inner workings still hidden.
Now perform some change that mostly pins down the latent space to "helpful corporate assistant". This is applying the smiley face mask.
In some sense, all the dangerous capabilities the corporate assistant were in the original model. Dangerous capabilities haven't been removed either, but some capabilities are a bit easier to access without careful prompting, and other capabilites are harder to access.
What ChatGPT currently has is a form of low quality pseudo-alignment.
What would long term success look like using nothing but this pseudo-alignment. It would look like a chatbot far smarter than any current ones, which mostly did nice things, so long as you didn't put in any weird prompts.
Now If corrigibility is a broad basin, this might well be enough to hit it. The basin of corrigibility means that the AI might have bugs, but at the very least, you can turn the AI off and edit the code. Ideally you can ask the AI for help fixing it's own bugs. Sure the first AI is far from perfect. But perhaps the flaws disappear under self rewriting + competent human advice.