generating evidence of danger
Has this worked so far? How many cases can we point to where a person who was publicly skeptical of AI danger changed their public stance, as a result of seeing experimental evidence?
What's the theory of change for "generating evidence of danger"? The people who are most grim about AI would probably tell you that there is already plenty of evidence. How will adding more evidence to the pile help? Who will learn about this new evidence, and how will it cause them to behave differently?
Here's a theory of change I came up w...
Technically the point of going to college is to help you thrive in the rest of your life after college. If you believe in AI 2027, the most important thing for the rest of your life is for AI to be developed responsibly. So, maybe work on that instead of college?
I think the EU could actually be good place to protest for an AI pause. Because the EU doesn't have national AI ambitions, and the EU is increasingly skeptical of the US, it seems to me that a bit of protesting could do a lot to raise awareness of the reckless path that the US is taking. That, ...
If people start losing jobs from automation, that could finally build political momentum for serious regulation.
Suggested in Zvi's comments the other month (22 likes):
...The real problem here is that AI safety feels completely theoretical right now. Climate folks can at least point to hurricanes and wildfires (even if connecting those dots requires some fancy statistical footwork). But AI safety advocates are stuck making arguments about hypothetical future scenarios that sound like sci-fi to most people. It's hard to build political momentum around "trust
I am optimistic that further thinking on automation prospects could identify other automation-tractable areas of alignment and control (e.g. see here for previous work).
This tag might be helpful: https://www.lesswrong.com/w/ai-assisted-alignment
Here's a recent shortform on the topic: https://www.lesswrong.com/posts/mKgbawbJBxEmQaLSJ/davekasten-s-shortform?commentId=32jReMrHDd5vkDBwt
I wonder about getting an LLM to process LW archive posts, and tag posts which contain alignment ideas that seem automatable.
It looks like the comedian whose clip you linked has a podcast:
https://www.joshjohnsoncomedy.com/podcasts
I don't see any guests in their podcast history, but maybe someone could invite him on a different podcast? His website lists appearances on other podcasts. I figure it's worth trying stuff like this for VoI.
I think people should emphasize more the rate of improvement in this technology. Analogous to early days of COVID -- it's not where we are that's worrisome; it's where we're headed.
For humans acting very much not alone, like big AGI research companies, yeah that's clearly a big problem.
How about a group of superbabies that find and befriend each other? Then they're no longer acting alone.
I don't think the problem is about any of the people you listed having too much brainpower.
I don't think problems caused by superbabies would look distinctively like "having too much brainpower". They would look more like the ordinary problems humans have with each other. Brainpower would be a force multiplier.
...(I feel we're somewhat talki
I mostly just want people to pay attention to this problem.
Ok. To be clear, I strongly agree with this. I think I've been responding to a claim (maybe explicit, or maybe implicit / imagined by me) from you like: "There's this risk, and therefore we should not do this.". Where I want to disagree with the implication, not the antecedent. (I hope to more gracefully agree with things like this. Also someone should make a LW post with a really catchy term for this implication / antecedent discourse thing, or link me the one that's already been written.)
But I...
I think this project should receive more red-teaming before it gets funded.
Naively, it would seem that the "second species argument" matches much more strongly to the creation of a hypothetical Homo supersapiens than it does to AGI.
We've observed many warning shots regarding catastrophic human misalignment. The human alignment problem isn't easy. And "intelligence" seems to be a key part of the human alignment picture. Humans often lack respect or compassion for other animals that they deem intellectually inferior -- e.g. arguing that because those othe...
Right now only low-E tier human intelligences are being discussed, they'll be able to procreate with humans and be a minority.
Considering current human distributions and a lack of 160+ IQ people having written off sub-100 IQ populations as morally useless I doubt a new sub-population at 200+ is going to suddenly turn on humanity
If you go straight to 1000IQ or something sure, we might be like animals compared to them
You shouldn't and won't be satisfied with this alone, as it doesn't deal with or even emphasize any particular peril; but to be clear, I have definitely thought about the perils: https://berkeleygenomics.org/articles/Potential_perils_of_germline_genomic_engineering.html
These are incredibly small peanuts compared to AGI omnicide.
The jailbreakability and other alignment failures of current AI systems are also incredibly small peanuts compared to AGI omnicide. Yet they're still informative. Small-scale failures give us data about possible large-scale failures.
You're somehow leaving out all the people who are smarter than those people, and who were great for the people around them and humanity? You've got like 99% actually alignment or something
Are you thinking of people such as Sam Altman, Demis Hassabis, Elon Musk...
Humans are very far from fooming.
Tell that to all the other species that went extinct as a result of our activity on this planet?
I think it's possible that the first superbaby will be aligned, same way it's possible that the first AGI will be aligned. But it's far from a sure thing. It's true that the alignment problem is considerably different in character for humans vs AIs. Yet even in this particular community, it's far from solved -- consider Brent Dill, Ziz, Sam Bankman-Fried, etc.
Not to mention all of history's great villains, many of whom beli...
Tell that to all the other species that went extinct as a result of our activity on this planet?
Individual humans.
Brent Dill, Ziz, Sam Bankman-Fried, etc.
If you look at the grim history of how humans have treated each other on this planet, I don't think it's justified to have a prior that this is gonna go well.
I think we have a huge advantage with humans simply because there isn't the same potential for runaway self-improvement.
Humans didn't have the potential for runaway self-improvement relative to apes. That was little comfort for the apes.
It's utterly different.
Can anyone think of alignment-pilled conservative influencers besides Geoffrey Miller? Seems like we could use more people like that...
Maybe we could get alignment-pilled conservatives to start pitching stories to conservative publications?
Likely true, but I also notice there's been a surprising amount of drift of political opinions from the left to the right in recent years. The right tends to put their own spin on these beliefs, but I suspect many are highly influenced by the left nonetheless.
Some examples of right-coded beliefs which I suspect are, to some degree, left-inspired:
"Capitalism undermines social cohesion. Consumerization and commoditization are bad. We're a nation, not an economy."
"Trans women undermine women's rights and women's spaces. Motherhood, and women's digni
I think the National Review is the most prestigious conservative magazine in the US, but there are various others. City Journal articles have also struck me as high-quality in the past. I think Coleman Hughes writes for them, and he did a podcast with Eliezer Yudkowsky at one point.
However, as stated in the previous link, you should likely work your way up and start by pitching lower-profile publications.
The big one probably has to do with being able to corrupt the metrics so totally that whatever you think you made them unlearn actually didn't happen, or just being able to relearn the knowledge so fast that unlearning doesn't matter
I favor proactive approaches to unlearning which prevent the target knowledge from being acquired in the first place. E.g. for gradient routing, if you can restrict "self-awareness and knowledge of how to corrupt metrics" to a particular submodule of the network during learning, then if that submodule isn't active, you can ...
Regarding articles which target a popular audience such as How AI Takeover Might Happen in 2 Years, I get the sense that people are preaching to the choir by posting here and on X. Is there any reason people aren't pitching pieces like this to prestige magazines like The Atlantic or wherever else? I feel like publishing in places like that is a better way to shift the elite discourse, assuming that's the objective. (Perhaps it's best to pitch to publications that people in the Trump admin read?)
Here's an article on pitching that I found on the EA Forum ...
I think unlearning could be a good fit for automated alignment research.
Unlearning could be a very general tool to address a lot of AI threat models. It might be possible to unlearn deception, scheming, manipulation of humans, cybersecurity, etc. I challenge you to come up with an AI safety failure story that can't, in principle, be countered through targeted unlearning in some way, shape, or form.
Relative to some other kinds of alignment research, unlearning seems easy to automate, since you can optimize metrics for how well things have been unlearned.
I...
Chinas has alienated virtually all its neighbours
That sounds like an exaggeration? My impression is that China has OK/good relations with countries such as Vietnam, Cambodia, Pakistan, Indonesia, North Korea, factions in Myanmar. And Russia, of course. If you're serious about this claim, I think you should look at a map, make a list of countries which qualify as "neighbors" based purely on geographic distance, then look up relations for each one.
What I think is more likely than EA pivoting is a handful of people launch a lifeboat and recreate a high integrity version of EA.
Thoughts on how this might be done:
Interview a bunch of people who became disillusioned. Try to identify common complaints.
For each common complaint, research organizational psychology, history of high-performing organizations, etc. and brainstorm institutional solutions to address that complaint. By "institutional solutions", I mean approaches which claim to e.g. fix an underlying bad incentive structure, so it won't
The possibility for the society-like effect of multiple power centres creating prosocial incentives on the projects
OpenAI behaves in a generally antisocial way, inconsistent with its charter, yet other power centers haven't reined it in. Even in the EA and rationalist communities, people don't seem to have asked questions like "Is the charter legally enforceable? Should people besides Elon Musk be suing?"
If an idea is failing in practice, it seems a bit pointless to discuss whether it will work in theory.
One idea is to use a base LLM with no RLHF, compute the perplexity of the reasoning text, and add it as an additional term in the loss function. That should help with comprehensibility, but it doesn't necessarily help with steganography. To disincentivize steganography, you could add noise to the reasoning in various ways, and remove any incentive for terseness, to ensure the model isn't trying to squeeze more communication into a limited token budget.
A basic idea for detecting steganography is to monitor next-token probabilities for synonym pairs. If t...
On LessWrong, there's a comment section where hard questions can be asked and are asked frequently.
In my experience, asking hard questions here is quite socially unrewarding. I could probably think of a dozen or so cases where I think the LW consensus "emperor" has no clothes, that I haven't posted about, just because I expect it to be an exercise in frustration. I think I will probably quit posting here soon.
...I don't think AI policy is a good example for discourse on LessWrong. There are strategic reasons to be less transparent about how to affect p
Great post. Self-selection seems huge for online communities, and I think it's no different on these fora.
Confidence level: General vague impressions and assorted thoughts follow; could very well be wrong on some details.
A disagreement I have with both the rationalist and EA communities is what the process of coming to robust conclusions looks like. In those communities, it seems like the strategy is often to identify a few super-geniuses who go do a super-deep analysis, and come to a conclusion that's assumed to be robust and trustworthy. See the "Grou...
Yeah, I think there are a lot of underexplored ideas along these lines.
It's weird how so much of the internet seems locked into either the reddit model (upvotes/downvotes) or the Twitter model (likes/shares/followers), when the design space is so much larger than that. Someone like Aaron, who played such a big role in shaping the internet, seems more likely to have a gut-level belief that it can be shaped. I expect there are a lot more things like Community Notes that we could discover if we went looking for them.
I've always wondered what Aaron Swartz would think of the internet now, if he was still alive. He had far-left politics, but also seemed to be a big believer in openness, free speech, crowdsourcing, etc. When he was alive those were very compatible positions, and Aaron was practically the poster child for holding both of them. Nowadays the far left favors speech restrictions and is cynical about the internet.
Would Aaron have abandoned the far left, now that they are censorious? Would he have become censorious himself? Or would he have invented some clever new technology, like RSS or reddit, to try and fix the internet's problems?
Just goes to show what a tragedy death is, I guess.
I imagine he would have tried to do things like help people see separations in who has what opinion, so as to avoid people thinking of an entire group as uniformly exhibiting a single deleterious behavior; this might, for example, help amplify the parts of the group that do not exhibit that behavior, so that the things that that group has to offer which are good by the lights of humanity are not lost due to a negative behavior.
I generally find the left has a lot of useful and interesting things to say, if you're able to get people to share their beliefs in...
I expect escape will happen a bunch
Are you willing to name a specific year/OOM such that if there are no publicly known cases of escape by that year/OOM, you would be surprised? What, if anything, would you acknowledge as evidence that alignment is easier than you thought, here?
To ensure the definition of "escape" is not gerrymandered -- do you know of any cases of escape right now? Do you think escape has already occurred and you just don't know about it? "Escape" means something qualitatively different from any known event up to this point, yes? D...
Sure there will be errors, but how important will those errors be?
Humans currently control the trajectory of humanity, and humans are error-prone. If you replace humans with something that's error-prone in similar ways, that doesn't seem like it's obviously either a gain or a loss. How would such a system compare to an em of a human, for example?
If you want to show that we're truly doomed, I think you need additional steps beyond just "there will be errors".
Some recent-ish bird flu coverage:
Global health leader critiques ‘ineptitude’ of U.S. response to bird flu outbreak among cows
A Bird-Flu Pandemic in People? Here’s What It Might Look Like. TLDR: not good. (Reload the page and ctrl-a then ctrl-c to copy the article text before the paywall comes up.) Interesting quote: "The real danger, Dr. Lowen of Emory said, is if a farmworker becomes infected with both H5N1 and a seasonal flu virus. Flu viruses are adept at swapping genes, so a co-infection would give H5N1 opportunity to gain genes that enable it to s...
About a month ago, I wrote a quick take suggesting that an early messaging mistake made by MIRI was: claim there should be a single leading FAI org, but not give specific criteria for selecting that org. That could've lead to a situation where Deepmind, OpenAI, and Anthropic can all think of themselves as "the best leading FAI org".
An analogous possible mistake that's currently being made: Claim that we should "shut it all down", and also claim that it would be a tragedy if humanity never created AI, but not give specific criteria for when it would be app...
Don’t have time to respond in detail but a few quick clarifications/responses:
Sure, don't feel obligated to respond, and I invite the people disagree-voting my comments to hop in as well.
— There are lots of groups focused on comms/governance. MIRI is unique only insofar as it started off as a “technical research org” and has recently pivoted more toward comms/governance.
That's fair, when you said "pretty much any other organization in the space" I was thinking of technical orgs.
MIRI's uniqueness does seem to suggest it has a comparative advantage fo...
I think if MIRI engages with “curious newcomers” those newcomers will have their own questions/confusions/objections and engaging with those will improve general understanding.
You think policymakers will ask the sort of questions that lead to a solution for alignment?
In my mind, the most plausible way "improve general understanding" can advance the research frontier for alignment is if you're improving the general understanding of people fairly near that frontier.
...Based on my experience so far, I don’t expect their questions/confusions/objections to ov
So what's the path by which our "general understanding of the situation" is supposed to improve? There's little point in delaying timelines by a year, if no useful alignment research is done in that year. The overall goal should be to maximize the product of timeline delay and rate of alignment insights.
Also, I think you may be underestimating the ability of newcomers to notice that MIRI tends to ignore its strongest critics. See also previously linked comment.
I think if MIRI engages with “curious newcomers” those newcomers will have their own questions/confusions/objections and engaging with those will improve general understanding.
Based on my experience so far, I don’t expect their questions/confusions/objections to overlap a lot with the questions/confusions/objections that tech-oriented active LW users have.
I also think it’s not accurate to say that MIRI tends to ignore its strongest critics; there’s perhaps more public writing/dialogues between MIRI and its critics than for pretty much any other organizatio...
In terms of "improve the world's general understanding of the situation", I encourage MIRI to engage more with informed skeptics. Our best hope is if there is a flaw in MIRI's argument for doom somewhere. I would guess that e.g. Matthew Barnett he has spent something like 100x as much effort engaging with MIRI as MIRI has spent engaging with him, at least publicly. He seems unusually persistent -- I suspect many people are giving up, or gave up long ago. I certainly feel quite cynical about whether I should even bother writing a comment like this one.
superbabies
I'm concerned there may be an alignment problem for superbabies.
Humans often have contempt for people and animals with less intelligence than them. "You're dumb" is practically an all-purpose putdown. We seem to assign moral value to various species on the basis of intelligence rather than their capacity for joy/suffering. We put chimpanzees in zoos and chickens in factory farms.
Additionally, jealousy/"xenophobia" towards superbabies from vanilla humans could lead them to become misanthropes. Everyone knows genetic enhancement is a radioa...
Also a strategy postmortem on the decision to pivot to technical research in 2013: https://intelligence.org/2013/04/13/miris-strategy-for-2013/
I do wonder about the counterfactual where MIRI never sold the Singularity Summit, and it was blowing up as an annual event, same way Less Wrong blew up as a place to discuss AI. Seems like owning the Summit could create a lot of leverage for advocacy.
One thing I find fascinating is the number of times MIRI has reinvented themselves as an organization over the decades. People often forget that they were originally...
I appreciate your replies. I had some more time to think and now I have more takes. This isn't my area, but I'm having fun thinking about it.
See https://en.wikipedia.org/wiki/File:ComputerMemoryHierarchy.svg
Disk encryption is table stakes. I'll assume any virtual memory is also encrypted. I don't know much about that.
I'm assuming no use of flash memory.
Absent homomorphic encryption, we have to decrypt in the registers, or whatever they're called for a GPU.
So basically the question is how valuable is it to encrypt the weights in RAM and poss...
QFT is the extreme example of a "better abstraction", but in principle (if the natural abstraction hypothesis fails) there will be all sorts and shapes of abstractions, and some of them will be available to us, and some of them will be available to the model, and these sets will not fully overlap—which is a concern in worlds where different abstractions lead to different generalization properties.
Indeed. I think the key thing for me is, I expect the model to be strongly incentivized to have a solid translation layer from its internal ontology to e.g. E...
If I encountered an intelligent extraterrestrial species, in principle I think I could learn to predict fairly well things like what it finds easy to understand, what its values are, and what it considers to be ethical behavior, without using any of the cognitive machinery I use to self-reflect. Humans tend to reason about other humans by asking "what would I think if I was in their situation", but in principle an AI doesn't have to work that way. But perhaps you think there are strong reasons why this would happen in practice?
Supposing we had strong rea...
If I were to guess, I'd guess that by "you" you're referring to someone or something outside of the model, who has access to the model's internals, and who uses that access to, as you say, "read" the next token out of the model's ontology.
Was using a metaphorical "you". Probably should've said something like "gradient descent will find a way to read the next token out of the QFT-based simulation".
Yes, there are certainly applications where the training regime produces IID data, but next-token prediction is pretty clearly not one of those?
I suppose ...
Because the human isn't going to constantly be present for everything the system does after it's deployed (unless for some reason it's not deployed).
I think it ought to be possible for someone to always be present. [I'm also not sure it would be necessary.]
So we need not assume that predicting "the genius philosopher" is a core task.
It's not the genius philosopher that's the core task, it's the reading of their opinions out of a QFT-based simulation of them. As I understand this thought experiment, we're doing next-token prediction on e.g. a book ...
I'm confused about what it means to "remove the human", and why it's so important whether the human is 'removed'. Maybe if I try to nail down more parameters of the hypothetical, that will help with my confusion. For the sake of argument, can I assume...
That the AI is running computations involving quantum fields because it found that was the most effective way to make e.g. next-token predictions on its training set?
That the AI is in principle capable of running computations involving quantum fields to represent a genius philosopher?
If I can ass...
I think this quantum fields example is perhaps not all that forceful, because in your OP you state
maybe a faithful and robust translation would be so long in the system’s “internal language” that the translation wouldn’t fit in the system
However, it sounds like you're describing a system where we represent humans using quantum fields as a routine matter, so fitting the translation into the system isn't sounding like a huge problem? Like, if I want to know the answer to some moral dilemma, I can simulate my favorite philosopher at the level of quantum ...
Re: "number go up" research tasks that seem automatable -- one idea I had is to use an LLM to process the entire LW archive, and identify alignment research ideas which could be done using the "number go up" approach (or seem otherwise amenable to automation).
"Proactive unlearning", in particular, strikes me as a quite promising research direction which could be automated. Especially if it is possible to "proactively unlearn" scheming. Gradient routing would be an example of the sort of approach I have in mind.
To elaborate: I think it is ideal to have an... (read more)