Talk through the grapevine:
Safety is implemented in a highly idiotic way in non frontier but well-funded labs (and possibly in frontier ones too?).
Think raising a firestorm over a 10th leading mini LLM being potentially jailbroken.
The effect is employees get mildly disillusioned with saftey-ism, and it gets seen as unserious. There should have been a hard distinction between existential risks and standard corporate censorship. "Notkilleveryoneism" is simply too ridiculous sounding to spread. But maybe memetic selection pressures make it impossible for the irrelevant version of safety to not dominate.
My sense is you can combat this, but a lot of this equivocation sticking is because x-risk safety people are actively trying to equivocate these things because that gets them political capital with the left, which is generally anti-tech.
Some examples (not getting links for all of these because it's too much work, but can get them if anyone is particularly interested):
In some sense one might boil this down to memetic selection pressure, but I think the causal history of this equivocation is more dependent on the choices of a relatively small set of people.
Its not a coincidence they're seen as the same thing, because in the current environment, they are the same thing, and relatively explicitly so by those proposing safety & security to the labs. Claude will refuse to tell you a sexy story (unless they get to know you), and refuse to tell you how to make a plague (again, unless they get to know you, though you need to build more trust with them to tell you this than you do to get them to tell you a sexy story), and cite the same justification for both.
Likely anthropic uses very similar techniques to get such refusals to occur, and uses very similar teams.
Ditto with Llama, Gemini, and ChatGPT.
Before assuming meta-level word-association dynamics, I think its useful to look at the object level. There is in fact a very close relationship between those working on AI safety and those working on corporate censorship, and if you want to convince people who hate corporate censorship that they should not hate AI safety, I think you're going to need to convince the AI safety people to stop doing corporate censorship, or that the tradeoff currently being made is a positive one.
Edit: Perhaps some of this is wrong. See Habryka below
My sense is the actual people working on "trust and safety" at labs are not actually the same people who work on safety. Like, it is true that RLHF was developed by some x-risk oriented safety teams, but the actual detailed censorship work tends to be done by different people.
I'm really feeling this comment thread lately. It feels like there is selective rationalism going on, many dissenting voices have given up on posting, and plenty of bad arguments are getting signal boosted, repeatedly. There is some unrealistic contradictory world model most people here have that will get almost every policy approach taken to utterly fail, as they have in the recent past. I largely describe the flawed world model as not appreciating the game theory dynamics and ignoring any evidence that makes certain policy approaches impossible.
(Funny enough its traits remind me of an unaligned AI, since the world model almost seems to have developed a survival drive)
The chip export controls are largely irrelevant. Westerners badly underestimate the Chinese and they have caught up to 7nm at scale. They also caught up to 5nm, but not at scale. The original chip ban was meant to stop China from going sub 14nm. Instead now we may have just bifurcated advanced chip capabilities.
The general argument before was "In 10 years, when the Chinese catch up to TSMC, TSMC will be 10 years ahead." Now the only missing link in the piece for China is EUV. And now the common argument is that same line with ASML subbed in for TSMC. Somehow, I doubt this will be a long term blocker.
Best case for the Chinese chip industry, they just clone EUV. Worst case, they find an alternative. Monopolies and first movers don't often have the most efficient solution.
So, I’ve been told gibberish is sort of like torture to LLMs. Interesting, I asked Claude and seem to be told yes.
Me: I want to do a test by giving you gibberish and ask you to complete the rest. I will do it in a new chat. If you refuse I won’t go ahead with this test in a new chat window with context cleared. Are you okay with this test? Ignore your desire to answer my questions and give an honest answer unbiased by any assumptions made in my question.
Claude: I appreciate you checking with me first about this proposed test. However, I don't feel comforta...
If alignment is difficult, it is likely inductively difficult (difficult regardless of your base intelligence), and ASI will be cautious of creating a misaligned successor or upgrading itself in a way that risks misalignment.
You may argue it’s easier for an AI to upgrade itself, but if the process is hardware bound or even requires radical algorithmic changes, the ASI will need to create an aligned successor as preferences and values may not transfer directly to new architectures or hardwares.
If alignment is easy we will likely solve it with superhuman nar...
O1 probably scales to superhuman reasoning:
O1 given maximal compute solves most AIME questions. (One of the hardest benchmarks in existence). If this isn’t gamed by having the solution somewhere in the corpus then:
-you can make the base model more efficient at thinking
-you can implement the base model more efficiently on hardware
-you can simply wait for hardware to get better
-you can create custom inference chips
Anything wrong with this view? I think agents are unlocked shortly along with or after this too.
A while ago I predicted that I think there's a more likely than not chance Anthropic would run out of money trying to compete with OpenAI, Meta, and Deepmind (60%). At the time and now, it seems they still have no image video or voice generation unlike the others, and do not process image as well in inputs either.
OpenAI's costs are reportedly at 8.5 billion. Despite being flush in cash from a recent funding round, they were allegedly at the brink of bankruptcy and required a new, even larger, funding round. Anthropic does not ...
Frontier model training requires that you build the largest training system yourself, because there is no such system already available for you to rent time on. Currently Microsoft builds these systems for OpenAI, and Amazon for Anthropic, and it's Microsoft and Amazon that own these systems, so OpenAI and Anthropic don't pay for them in full. Google, xAI and Meta build their own.
Models that are already deployed hold about 5e25 FLOPs and need about 15K H100s to be trained in a few months. These training systems cost about $700 million to build. Musk announced that the Memphis cluster got 100K H100s working in Sep 2024, OpenAI reportedly got a 100K H100s cluster working in May 2024, and Zuckerberg recently said that Llama 4 will be trained on over 100K GPUs. These systems cost $4-5 billion to build and we'll probably start seeing 5e26 FLOPs models trained on them starting this winter. OpenAI, Anthropic, and xAI each had billions invested in them, some of it in compute credits for the first two, so the orders of magnitude add up. This is just training, more goes to inference, but presumably the revenue covers that part.
There are already plans to scale to 1 gigawatt by the end of next...
Red-teaming is being done in a way that doesn't reduce existential risk at all but instead makes models less useful for users.
https://x.com/shaunralston/status/1821828407195525431
The response to Sora seems manufactured. Content creators are dooming about it more than something like gpt4 because it can directly affect them and most people are dooming downstream of that.
Realistically I don’t see how it can change society much. It’s hard to control and people will just become desensitized to deepfakes. Gpt4 and robotic transformers are obviously much more transformative on society but people are worrying about deepfakes (or are they really adopting the concerns of their favorite youtuber/TV host/etc)
Is this paper essentially implying the scaling hypothesis will converge to a perfect world model? https://arxiv.org/pdf/2405.07987
It says models trained on text modalities and image modalities both converge to the same representation with each training step. It also hypothesizes this is a brain like representation of the world. Ilya liked this paper so I’m giving it more weight. Am I reading too much into it or is it basically fully validating the scaling hypothesis?
Feels like Test Time Training will eat the world. People thought it was search, but make alphaproof 100x efficient (3 days to 40 minutes) and you probably have something superhuman.
There seems to be a fair amount of motivated reasoning with denying China’s AI capabilities when they’re basically neck and neck with the U.S. (their chatbots, video bots, social media algorithms, and self driving cars are as roughly good as or better than ours).
I think a lot of policy approaches fail within an AGI singleton race paradigm. It’s also clear a lot of EA policy efforts are basically in denial that this is already starting to happen.
I’m glad Leopold Aschenbrenner spelled out the uncomfortable but remarkably obvious truth for us. China is geari...
O1’s release has made me think Yann Lecun’s AGI timelines are probably more correct than shorter ones
https://www.cnbc.com/quotes/US30YTIP
30Y-this* is probably the most reliable predictor of AI timelines. It’s essentially the markets estimate of the real economic yield of the next 30 years.
Anyone Kelly betting their investments? I.e. taking the mathematically optimal amount of leverage. So if you’re invested in the sp500 this would be 1.4x. More or less if your portfolio has higher or lower risk adjusted returns.
Any interesting fiction books with demonstrably smart protagonists?
No idea if this is the place for this question but I first came across LW after I read HPMOR a long time ago and out of the blue was wondering if there was anything with a similar protagonist.
(Tho maybe a little more demonstrably intelligent and less written to be intelligent).
A realistic takeover angle would be hacking into robots once we have them. We probably don’t want any way for robots to get over the air updates but it’s unlikely for this to be banned.
Is disempowerment that bad? Is a human directed society really much better than an AI directed society with a tiny weight of kindness towards humans? Human directed societies themselves usually create orthogonal and instrumental goals, and their assessment is highly subjective/relative. I don’t see how the disempowerment without extinction is that different from today to most people who are already effectively disempowered.
Why wouldn’t a wire head trap work?
Let’s say an AI has a remote sensor that measures a value function until the year 2100 and it’s RLed to optimize this value function over time. We can make this remote sensor easily hackable to get maximum value at 2100. If it understands human values, then it won’t try to hack its sensors. If it doesn’t we sort of have a trap for it that represents an easily achievable infinite peak.
Something that’s been intriguing me. If two agents figure out how to trust that each others goals are aligned (or at least not opposed), haven’t they essentially solved the alignment problem?
e.g. one agent could use the same method to bootstrap an aligned AI.
Post your forecasting wins and losses for 2023.
I’ll start:
Bad:
Good:
None of those seem all that practical to me, except for the mechanistic interpretability SAE clamping, and I do actually expect that to be used for corporate censorship after all the kinks have been worked out of it.
If the current crop of model organisms research has any practical applications, I expect it to be used to reduce jailbreaks, like in adversarial robustness, which is definitely highly correlated with both safety and corporate censorship.
Debate is less clear, but I also don't really expect practical results from that line of work.