A potential future, focused on the epistemic considerations:
It's 2028.
MAGA types typically use DeepReasoning-MAGA. The far left typically uses DeepReasoning-JUSTICE. People in the middle often use DeepReasoning-INTELLECT, which has the biases of a somewhat middle-of-the-road voter.
Some niche technical academics (the same ones who currently favor Bayesian statistics) and hedge funds use DeepReasoning-UNBIASED, or DRU for short. DRU is known to have higher accuracy than the other models, but gets a lot of public hate for having controversial viewpoints. DRU ...
I think I broadly agree on the model basics, though I suspect that if you can adjust for "market viability", some of these are arguably much further ahead than others.
For example, different models have very different pricing, the APIs are gradually getting different features (i.e. prompt caching), the playgrounds are definitely getting different features. And these seem to be moving much more slowly to me.
I think it might be considerably easier to make a model ranked incredibly high than it is to make all the infrastructure for it to be scaled cheapl...
Quick list of some ideas I'm excited about, broadly around epistemics/strategy/AI.
1. I think AI auditors / overseers of critical organizations (AI efforts, policy groups, company management) are really great and perhaps crucial to get right, but would be difficult to do well.
2. AI strategists/tools telling/helping us broadly what to do about AI safety seems pretty safe.
3. In terms of commercial products, there’s been some neat/scary military companies in the last few years (Palantir, Anduril). I’d be really interested if there could be some companies to au...
Yep!
On "rerun based on different inputs", this would work cleanly with AI forecasters. You can literally say, "Given that you get a news article announcing a major crisis X that happens tomorrow, what is your new probability on Y?" (I think I wrote about this a bit before, can't find it right now).
I did write more about a full-scale forecasting system could be built and evaluated, here, for those interested:
https://www.lesswrong.com/posts/QvFRAEsGv5fEhdH3Q/preliminary-notes-on-llm-forecasting-and-epistemics
https://www.lesswrong.com/posts/QNfzCFhhGtH8...
Agreed. I'm curious how to best do this.
One thing that I'm excited about is using future AIs to judge current ones. So we could have a system that does:
1. An AI today (or a human) would output a certain recommended strategy.
2. In 10 years, we agree to have the most highly-trusted AI evaluator evaluate how strong this strategy was, on some numeric scale. We could also wait until we have a "sufficient" AI, meaning that there might be some set point at which we'd trust AIs to do this evaluation. (I discussed this more here)
3. Going back to ~today, we have for...
Btw, I posted my related post here:
https://www.lesswrong.com/posts/byrxvgc4P2HQJ8zxP/6-potential-misconceptions-about-ai-intellectuals?commentId=dpEZ3iohCXChZAWHF#dpEZ3iohCXChZAWHF
It didn't seem to do very well on LessWrong, I'm kind of curious why. (I realize the writing is a bit awkward, but I broadly stand by it)
"I see some risk that strategic abilities will be the last step in the development of AI that is powerful enough to take over the world."
Just fyi - I feel like this is similar to what others have said. Most recently, benwr had a post here: https://www.lesswrong.com/posts/5rMwWzRdWFtRdHeuE/not-all-capabilities-will-be-created-equal-focus-on?commentId=uGHZBZQvhzmFTrypr#uGHZBZQvhzmFTrypr
Maybe we could call this something like "Strategic Determinism"
I think one more precise claim I could understand might be:
1. The main bottleneck to AI advancement is "st...
Alexander Gordon-Brown challenged me on a similar question here:
https://www.facebook.com/ozzie.gooen/posts/pfbid02iTmn6SGxm4QCw7Esufq42vfuyah4LCVLbxywAPwKCXHUxdNPJZScGmuBpg3krmM3l
One thing I wrote there:
...I didn't spend much time on the limitations of such intellectuals. For the use cases I'm imagining, it's fairly fine for them to be slow, fairly expensive (maybe it would cost $10/hr to chat with them), and not very great at any specific discipline. Maybe you could spend $10 to $100 and get the equivalent of one Scott Alexander essay, on any topic he
Thanks for letting me know.
I spent a while writing the piece, then used an LLM to edit the sections, as I flagged in the intro.
I then spent some time re-editing it back to more of my voice, but only did so for some key parts.
I think that overall this made it more readable and I consider the sections to be fairly clear. But I agree that it does pattern-match on LLM outputs, so if you have a prior that work that sounds kind of like that is bad, you might skip this.
I obviously find that fairly frustrating and don’t myself use that stra...
I was confused here, had Claude try to explain this to me:
...Let me break down Ben's response carefully.
He says you may have missed three key points from his original post:
- His definition of "superhuman strategic agent" isn't just about being better at strategic thinking/reasoning - it's about being better than the best human teams at actually taking real-world strategic actions. This is a higher bar that includes implementation, not just planning.
- Strategic power is context-dependent. He gives two examples to illustrate this:
- An AI in a perfect simulation
I just tried this with a decent prompt, and got answers that seem okay-ish to me, as a first pass.
My prompt:
Estimate the expected costs of each of the following:
- 1 random person dying
- 1 family of 5 people dying
- One person says a racial slur that no one hears
- One person says a racial slur that 1 person hears
Then rank these in total harm.
Claude:
...To answer this question thoughtfully and accurately, we'll need to consider various ethical, economic, and social factors. Let's break this down step by step, estimating the costs and then ranking them b
I imagine this also has a lot to do with the incentives of the big LLM companies. It seems very possible to fix this if a firm really wanted to, but this doesn't seem like the kind of thing that would upset many users often (and I assume that leaning on the PC side is generally a safe move).
I think that the current LLMs have pretty mediocre epistemics, but most of that is just the companies playing safe and not caring that much about this.
I claim that we will face existential risks from AI no sooner than the development of strategically human-level artificial agents, and that those risks are likely to follow soon after.
If we are going to build these agents without "losing the game", either (a) they must have goals that are compatible with human interests, or (b) we must (increasingly accurately) model and enforce limitations on their capabilities. If there's a day when an AI agent is created without either of these conditions, that's the day I'd consider humanity to have lost.
I'm not sure i...
Happy to see work to elicit utility functions with LLMs. I think the intersection of utility functions and LLMs is broadly promising.
I want to flag the grandiosity of the title though. "Utility Engineering" sounds like a pretty significant thing. But from what I understand, almost all of the paper is really about utility elicitation (not control, as it spelled out), and it's really unclear if this represents a breakthrough significant enough for me to feel comfortable with such a name.
I feel like a whole lot of what I see from the Center For AI Safet...
It's arguably difficult to prove that AIs can be as good or better at moral reasoning than humans.
A lot of the challenge is that there's no clear standard for moral reasoning. Honestly, I'd guess that a big part of this is that humans are generally quite bad at it, and generally highly overconfident in their own moral intuitions.
But one clearer measure is if AIs can predict human's moral judgements. Very arguably, if an AI system can predict all the moral beliefs that a human would have after being exposed to different information, then the AI must be capa...
...
- Develop AIs which are very dumb within a forward pass, but which are very good at using natural language reasoning such that they are competitive with our current systems. Demonstrate that these AIs are very unlikely to be scheming due to insufficient capacity outside of natural language (if we monitor their chains of thought). After ruling out scheming, solve other problems which seem notably easier.
- Pursue a very different AI design which is much more modular and more hand constructed (as in, more GOFAI style). This can involve usage of many small and dum
This might be obvious, but I don't think we have evidence to support the idea that there really is anything like a concrete plan. All of the statements I've seen from Sam on this issue so far are incredibly basic and hand-wavy.
I suspect that any concrete plan would be fairly controversial, so it's easiest to speak in generalities. And I doubt there's anything like an internal team with some great secret macrostrategy - instead I assume that they haven't felt pressured to think through it much.
I partially agree, but I think this must only be a small part of the issue.
- I think there's a whole lot of key insights people could raise that aren't info-hazards.
- If secrecy were the main factor, I'd hope that there would be some access-controlled message boards or similar. I'd want the discussion to be intentionally happening somewhere. Right now I don't really think that's happening. I think a lot of tiny groups have their own personal ideas, but there's surprisingly little systematic and private thinking between the power players.
- I thi...
I'm not sure if it means much, but I'd be very happy if AI safety could get another $50B from smart donors today.
I'd flag that [stopping AI development] would cost far more than $50B. I'd expect that we could easily lose $3T of economic value in the next few years if AI progress seriously stopped.
I guess, it seems to me like duration is basically dramatically more expensive to get than funding, for amounts of funding people would likely want.
Thanks for the specificity!
> On harder-to-operationally-define dimensions (sense of hope and agency for the 25th through 75th percentile of culturally normal people), it’s quite a bit worse.
I think it's likely that many people are panicking and losing hope each year. There's a lot of grim media around.
I'm far less sold that something like "civilizational agency" is declining. From what I can tell, companies have gotten dramatically better at achieving their intended ends in the last 30 years, and most governments have generally been improvin...
In terms of proposing and discussing AI Alignment strategies, I feel like a few individuals have been dominating the LessWrong conversation recently.
I've seen a whole lot from John Wentworth and the Redwood team.
After that, it seems to get messier.
There are several individuals or small groups with their own very unique takes. Matthew Barnett, Davidad, Jesse Hoogland, etc. I think these groups often have very singular visions that they work on, that few others have much buy-in with.
Groups like the Deepmind and Anthropic safety teams seem h...
...Here are some important-seeming properties to illustrate what I mean:
- Robustness of value-alignment: Modern LLMs can display a relatively high degree of competence when explicitly reasoning about human morality. In order for it to matter for RSI, however, those concepts need to also appropriately come into play when reasoning about seemingly unrelated things, such as programming. The continued ease of jailbreaking AIs serves to illustrate this property failing (although solving jailbreaking would not necessarily get at the whole property I am pointing at).
- P
I think that Slop could be a social problem (i.e. there are some communities that can't tell slop from better content) , but I'm having a harder time imagining it being a technical problem.
I have a hard time imagining a type of Slop that isn't low in information. All the kinds of Slop I'm familiar with is basically, "small variations on some ideas, which hold very little informational value."
It seems like models like o1 / r1 are trained by finding ways to make information-dense AI-generated data. I expect that trend to continue. If AIs for some reason experience some "slop thresh-hold", I don't see how they get much further by using generated data.
I mostly want to point out that many disempowerment/dystopia failure scenarios don't require a step-change from AI, just an acceleration of current trends.
Do you think that the world is getting worse each year?
My rough take is that humans, especially rich humans, are generally more and more successful.
I'm sure there are ways for current trends to lead to catastrophe - line some trends dramatically increasing and others decreasing, but that seems like it would require a lengthy and precise argument.
In many worlds, if we have a bunch of decently smart humans around, they would know what specific situations "very dumb humans" would mess up, and take the corresponding preventative measures.
A world where many small pockets of "highly dumb humans" could cause an existential catastrophe is one that's very clearly incredibly fragile and dangerous, enough so that I assume reasonable actors would freak out until it stops being so fragile and dangerous. I think we see this in other areas - like cyber attacks, where reasonable people prevent small clusters of a...
I feel like you're talking in highly absolutist terms here.
Global wealth is $454.4 trillion. We currently have ~8 Bil humans, with an average happiness of say 6/10. Global wealth and most other measures of civilization flourishing that I know of seem to be generally going up over time.
I think that our world makes a lot of mistakes and fails a lot at coordination. It's very easy for me to imagine that we could increase global wealth by 3x if we do a decent job.
So how bad are things now? Well, approximately, "We have the current world, at $454 Trillion, with 8 billion humans, etc". To me that's definitely something to work with.
I assume that current efforts in AI evals and AI interpretability will be pretty useless if we have very different infrastructures in 10 years. For example, I'm not sure how much LLM interp helps with o1-style high-level reasoning.
I also think that later AI could help us do research. So if the idea is that we could do high-level strategic reasoning to find strategies that aren't specific to specific models/architectures, I assume we could do that reasoning much better with better AI.
The second worry is, I guess, a variant of the first: that we'll use intent-aligned AI very foolishly. That would be issuing a command like ""follow the laws of the nation you originated in but otherwise do whatever you like." I guess a key consideration in both cases is whether there's an adequate level of corrigibility.
I'd flag that I suspect that we really should have AI systems forecasting the future and the results of possible requests.
So if people made a broad request like, "follow the laws of the nation you originated in but otherwise do whate...
A bunch of people in the AI safety landscape seem to argue "we need to stop AI progress, so that we can make progress on AI safety first."
One flip side to this is that I think it's incredibly easy for people to waste a ton of resources on "AI safety" at this point.
I'm not sure how much I trust most technical AI safety researchers to make important progress on AI safety now. And I trust most institutions a lot less.
I'd naively expect if any major country would throw $100 Billion on it today, the results would be highly underwhelming. I rarely trust these go...
There have been a few takes so far of humans gradually losing control to AIs - not through specific systems going clearly wrong, but rather by a long-term process of increasing complexity and incentives.
This sometimes gets classified as "systematic" failures - in comparison to "misuse" and "misalignment."
There was "What Failure Looks Like", and more recently, this piece on "Gradual Disempowerment."
To me, these pieces come across as highly hand-wavy, speculative, and questionable.
I get the impression that a lot of people have strong low-level assumptions he...
Rather than generic slop, the early transformative AGI is fairly sycophantic (for the same reasons as today’s AI), and mostly comes up with clever arguments that the alignment team’s favorite ideas will in fact work.
I have a very easy time imagining work to make AI less sycophantic, for those who actually want that.
I expect that one major challenge for popular LLMs is that a large amount of sycophancy is both incredibly common online, and highly approved of by humans.
It seems like it should be an easy thing to stop for someone actually motivated...
I think it's totally fine to think that Anthropic is a net positive. Personally, right now, I broadly also think it's a net positive. I have friends on both sides of this.
I'd flag though that your previous comment suggested more to me than "this is just you giving your probability"
> Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don't actually have good advice to give anyone.
I feel like there are much nicer ways to phase that last bit. I suspect that this is much of the reason you got disagreement points.
Then we must consider probabilities, expected values, etc. Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don't actually have good advice to give anyone.
Are there good models that support that Anthropic is a good bet? I'm genuinely curious.
I assume that naively, if any side had more of the burden of proof, it would be Anthropic. They have many more resources, and are the ones doing the highly-impactful (and potentially negative) work.
My impression was that there was very little probablistic risk modeling here, but I'd love to be wrong.
The introduction of the GD paper takes no more than 10 minutes to read
Even 10 minutes is a lot, for many people. I might see 100 semi-interesting Tweets and Hacker News posts that link to lengthy articles per day, and that's already filtered - I definitely can't spend 10 min each on many of them.
and no significant cognitive effort to grasp, really.
"No significant cognitive effort" to read a nuanced semi-academic article with unique terminology? I tried spending around ~20-30min understanding this paper, and didn't find it trivial. I think it's ...
By the way - I imagine you could do a better job with the evaluation prompts by having another LLM pass, where it formalizes the above more and adds more context. For example, with an o1/R1 pass/Squiggle AI pass, you could probably make something that considers a few more factors with this and brings in more stats.
I assume that what's going on here is something like,
"This was low-hanging fruit, it was just a matter of time until someone did the corresponding test."
This would imply that OpenAI's work here isn't impressive, and also, that previous LLMs might have essentially been underestimated. There's basically a cheap latent capabilities gap.
I imagine a lot of software engineers / entrepreneurs aren't too surprised now. Many companies are basically trying to find wins where LLMs + simple tools give a large gain.
So some people could look at this and say, "sure, this test is to be expected", and others would be impressed by what LLMs + simple tools are capable of.
I feel like there are some critical metrics are factors here that are getting overlooked in the details.
I agree with your assessment that it's very likely that many people will lose power. I think it's fairly likely that most humans won't be able to provide much economic value at some point, and won't be able to ask for many resources in response. So I could see an argument for incredibly high levels of inequality.
However, there is a key question in that case, of "could the people who own the most resources guide AIs using those resources to do what ...
Yea, I assume that "DeepReasoning-MAGA" would rather be called "TRUTH" or something (a la Truth Social). Part of my name here was just to be clearer to readers.