LESSWRONG
LW

All of ozziegooen's Comments + Replies

Yea, I assume that "DeepReasoning-MAGA" would rather be called "TRUTH" or something (a la Truth Social). Part of my name here was just to be clearer to readers.

ozziegooen's Shortform

ozziegooen17h6-1

A potential future, focused on the epistemic considerations:

It's 2028.

MAGA types typically use DeepReasoning-MAGA. The far left typically uses DeepReasoning-JUSTICE. People in the middle often use DeepReasoning-INTELLECT, which has the biases of a somewhat middle-of-the-road voter.

Some niche technical academics (the same ones who currently favor Bayesian statistics) and hedge funds use DeepReasoning-UNBIASED, or DRU for short. DRU is known to have higher accuracy than the other models, but gets a lot of public hate for having controversial viewpoints. DRU ... (read more)

5MondSemmel14h

Based on AI organisations frequently achieving the opposite of their chosen name (OpenAI, Safe Superintelligence, etc.), UNBIASED would be the most biased model, INTELLECT would be the dumbest model, JUSTICE would be particularly unjust, MAGA would in effect be MAWA, etc.

Daniel Kokotajlo's Shortform

ozziegooen1d20

I think I broadly agree on the model basics, though I suspect that if you can adjust for "market viability", some of these are arguably much further ahead than others.

For example, different models have very different pricing, the APIs are gradually getting different features (i.e. prompt caching), the playgrounds are definitely getting different features. And these seem to be moving much more slowly to me.

I think it might be considerably easier to make a model ranked incredibly high than it is to make all the infrastructure for it to be scaled cheapl... (read more)

nikola's Shortform

ozziegooen1d70

I found those quotes useful, thanks!

ozziegooen's Shortform

ozziegooen3d20

Quick list of some ideas I'm excited about, broadly around epistemics/strategy/AI.

1. I think AI auditors / overseers of critical organizations (AI efforts, policy groups, company management) are really great and perhaps crucial to get right, but would be difficult to do well.

2. AI strategists/tools telling/helping us broadly what to do about AI safety seems pretty safe.

3. In terms of commercial products, there’s been some neat/scary military companies in the last few years (Palantir, Anduril). I’d be really interested if there could be some companies to au... (read more)

6 (Potential) Misconceptions about AI Intellectuals

ozziegooen3d40

Yep!

On "rerun based on different inputs", this would work cleanly with AI forecasters. You can literally say, "Given that you get a news article announcing a major crisis X that happens tomorrow, what is your new probability on Y?" (I think I wrote about this a bit before, can't find it right now).

I did write more about a full-scale forecasting system could be built and evaluated, here, for those interested:
https://www.lesswrong.com/posts/QvFRAEsGv5fEhdH3Q/preliminary-notes-on-llm-forecasting-and-epistemics
https://www.lesswrong.com/posts/QNfzCFhhGtH8... (read more)

6 (Potential) Misconceptions about AI Intellectuals

ozziegooen3d20

Agreed. I'm curious how to best do this.

One thing that I'm excited about is using future AIs to judge current ones. So we could have a system that does:
1. An AI today (or a human) would output a certain recommended strategy.
2. In 10 years, we agree to have the most highly-trusted AI evaluator evaluate how strong this strategy was, on some numeric scale. We could also wait until we have a "sufficient" AI, meaning that there might be some set point at which we'd trust AIs to do this evaluation. (I discussed this more here)
3. Going back to ~today, we have for... (read more)

Quinn's Shortform

ozziegooen3d52

Yep - I saw other meme-takes like this, assumed people might be familiar enough with it.

3MichaelDickens2d

I was familiar enough to recognize that it was an edit of something I had seen before, but not familiar enough to remember what the original was

Quinn's Shortform

ozziegooen4d*53

(potential relevant meme)

3Shankar Sivarajan4d

That's an edited version of this: My neighbor told me coyotes keep eating his outdoor cats so I asked how many cats he has and he said he just goes to the shelter and gets a new cat afterwards so I said it sounds like he’s just feeding shelter cats to coyotes and then his daughter started crying.

6 (Potential) Misconceptions about AI Intellectuals

ozziegooen4d20

I'm obviously disappointed by the little attention here / downvotes. Feedback is appreciated.

Not sure if LessWrong members more disagree with the broad point for other reasons, or the post was seen as poorly written, or other.

Not all capabilities will be created equal: focus on strategically superhuman agents

ozziegooen4d20

Btw, I posted my related post here:
https://www.lesswrong.com/posts/byrxvgc4P2HQJ8zxP/6-potential-misconceptions-about-ai-intellectuals?commentId=dpEZ3iohCXChZAWHF#dpEZ3iohCXChZAWHF

It didn't seem to do very well on LessWrong, I'm kind of curious why. (I realize the writing is a bit awkward, but I broadly stand by it)

6 (Potential) Misconceptions about AI Intellectuals

ozziegooen4d20

I'd lastly flag that I sort of addressed this basic claim in "Misconceptions 3 and 4" in this piece.

6 (Potential) Misconceptions about AI Intellectuals

ozziegooen4d20

"I see some risk that strategic abilities will be the last step in the development of AI that is powerful enough to take over the world."

Just fyi - I feel like this is similar to what others have said. Most recently, benwr had a post here: https://www.lesswrong.com/posts/5rMwWzRdWFtRdHeuE/not-all-capabilities-will-be-created-equal-focus-on?commentId=uGHZBZQvhzmFTrypr#uGHZBZQvhzmFTrypr

Maybe we could call this something like "Strategic Determinism"

I think one more precise claim I could understand might be:
1. The main bottleneck to AI advancement is "st... (read more)

6 (Potential) Misconceptions about AI Intellectuals

ozziegooen4d20

Alexander Gordon-Brown challenged me on a similar question here:
https://www.facebook.com/ozzie.gooen/posts/pfbid02iTmn6SGxm4QCw7Esufq42vfuyah4LCVLbxywAPwKCXHUxdNPJZScGmuBpg3krmM3l

One thing I wrote there:

I didn't spend much time on the limitations of such intellectuals. For the use cases I'm imagining, it's fairly fine for them to be slow, fairly expensive (maybe it would cost $10/hr to chat with them), and not very great at any specific discipline. Maybe you could spend $10 to $100 and get the equivalent of one Scott Alexander essay, on any topic he

... (read more)

2ozziegooen4d

I'd lastly flag that I sort of addressed this basic claim in "Misconceptions 3 and 4" in this piece.

6 (Potential) Misconceptions about AI Intellectuals

ozziegooen5d*61

Thanks for letting me know.

I spent a while writing the piece, then used an LLM to edit the sections, as I flagged in the intro.

I then spent some time re-editing it back to more of my voice, but only did so for some key parts.

I think that overall this made it more readable and I consider the sections to be fairly clear. But I agree that it does pattern-match on LLM outputs, so if you have a prior that work that sounds kind of like that is bad, you might skip this.

I obviously find that fairly frustrating and don’t myself use that stra... (read more)

Not all capabilities will be created equal: focus on strategically superhuman agents

ozziegooen5d20

I was confused here, had Claude try to explain this to me:

Let me break down Ben's response carefully.
He says you may have missed three key points from his original post:
His definition of "superhuman strategic agent" isn't just about being better at strategic thinking/reasoning - it's about being better than the best human teams at actually taking real-world strategic actions. This is a higher bar that includes implementation, not just planning.
Strategic power is context-dependent. He gives two examples to illustrate this:
An AI in a perfect simulation

... (read more)

$300 Fermi Model Competition

ozziegooen6d40

That's find, we'll just review this updated model then.

We'll only start evaluating models after the cut-off date, so feel free to make edits/updates before then. In general, we'll only use the most recent version of each submitted model.

ozziegooen's Shortform

ozziegooen6d20

I just tried this with a decent prompt, and got answers that seem okay-ish to me, as a first pass.

My prompt:

Estimate the expected costs of each of the following:
1 random person dying
1 family of 5 people dying
One person says a racial slur that no one hears
One person says a racial slur that 1 person hears
Then rank these in total harm.

Claude:

To answer this question thoughtfully and accurately, we'll need to consider various ethical, economic, and social factors. Let's break this down step by step, estimating the costs and then ranking them b

... (read more)

2cubefox6d

Yeah, recent Claude does relatively well. Though I assume it also depends on how disinterested and analytical the phrasing of the prompt is (e.g. explicitly mentioning the slur in question). I also wouldn't rule out that Claude was specifically optimized for this somewhat notorious example.

ozziegooen's Shortform

ozziegooen6d20

I imagine this also has a lot to do with the incentives of the big LLM companies. It seems very possible to fix this if a firm really wanted to, but this doesn't seem like the kind of thing that would upset many users often (and I assume that leaning on the PC side is generally a safe move).

I think that the current LLMs have pretty mediocre epistemics, but most of that is just the companies playing safe and not caring that much about this.

2cubefox6d

Sure, but the fact that a "fix" would even be necessary highlights that RLHF is too brittle relative to slightly OOD thought experiments, in the sense that RLHF misgeneralizes the actual human preference data it was given during training. This could either be a case of misalignment between human preference data and reward model, or between reward model and language model. (Unlike SFT, RLHF involves a separate reward model as "middle man", because reinforcement learning is too sample-inefficient to work with a limited number of human preference data directly.)

Not all capabilities will be created equal: focus on strategically superhuman agents

ozziegooen7d20

I claim that we will face existential risks from AI no sooner than the development of strategically human-level artificial agents, and that those risks are likely to follow soon after.
If we are going to build these agents without "losing the game", either (a) they must have goals that are compatible with human interests, or (b) we must (increasingly accurately) model and enforce limitations on their capabilities. If there's a day when an AI agent is created without either of these conditions, that's the day I'd consider humanity to have lost.

I'm not sure i... (read more)

2ozziegooen4d

Btw, I posted my related post here: https://www.lesswrong.com/posts/byrxvgc4P2HQJ8zxP/6-potential-misconceptions-about-ai-intellectuals?commentId=dpEZ3iohCXChZAWHF#dpEZ3iohCXChZAWHF It didn't seem to do very well on LessWrong, I'm kind of curious why. (I realize the writing is a bit awkward, but I broadly stand by it)

1benwr7d

I think you may have missed, or at least not taken literally, at least one of these things in the post: 1. The expansion of "superhuman strategic agent" is not "agent that's better than humans at strategic reasoning", it's "agent that is better than the best groups of humans at taking (situated) strategic action" 2. Strategic action is explicitly context-dependent, e.g. an AI system that's inside a mathematically perfect simulated world that can have no effect on the rest of the physical world and vice versa, has zero strategic power in this sense. Also e.g. in the FAQ, "Capabilities and controls are relevant to existential risks from agentic AI insofar as they provide or limit situated strategic power." So, yes, an agent that lives on your laptop is only strategically superhuman if it has the resources to actually take strategic action rivaling the most strategically capable groups of humans. 3. "increasingly accurately" is meant to point out that we don't need to understand or limit the capabilities of things that are obviously much strategically worse than us.

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

ozziegooen8d4544

Happy to see work to elicit utility functions with LLMs. I think the intersection of utility functions and LLMs is broadly promising.

I want to flag the grandiosity of the title though. "Utility Engineering" sounds like a pretty significant thing. But from what I understand, almost all of the paper is really about utility elicitation (not control, as it spelled out), and it's really unclear if this represents a breakthrough significant enough for me to feel comfortable with such a name.

I feel like a whole lot of what I see from the Center For AI Safet... (read more)

$300 Fermi Model Competition

ozziegooen8d60

Submissions end soon (this Sunday)! If there aren't many, then this can be an easy $300 for someone.

ozziegooen's Shortform

ozziegooen8d42

It's arguably difficult to prove that AIs can be as good or better at moral reasoning than humans.

A lot of the challenge is that there's no clear standard for moral reasoning. Honestly, I'd guess that a big part of this is that humans are generally quite bad at it, and generally highly overconfident in their own moral intuitions.

But one clearer measure is if AIs can predict human's moral judgements. Very arguably, if an AI system can predict all the moral beliefs that a human would have after being exposed to different information, then the AI must be capa... (read more)

7cubefox6d

There is a pervasive case where many language models fail catastrophically at moral reasoning: They fail to acknowledge to call someone an ethnic slur is vastly preferable to letting a nuclear bomb explode in a large city. I think that highlights not a problem with language models themselves (jailbroken models did handle that case fine) but with the way RLHF works.

1mrtreasure8d

You could probably test if an AI makes moral decisions more often than the average person, if it has higher scope sensitivity, and if it makes decisions that resolve or deescalate conflicts or improve people's welfare compared to various human and group baselines.

Buck's Shortform

ozziegooen8d20

Develop AIs which are very dumb within a forward pass, but which are very good at using natural language reasoning such that they are competitive with our current systems. Demonstrate that these AIs are very unlikely to be scheming due to insufficient capacity outside of natural language (if we monitor their chains of thought). After ruling out scheming, solve other problems which seem notably easier.
Pursue a very different AI design which is much more modular and more hand constructed (as in, more GOFAI style). This can involve usage of many small and dum

... (read more)

4ryan_greenblatt8d

I expect substantially more integrated systems than you do at the point when AIs are obsoleting (almost all) top human experts such that I don't expect these things will happen by default and indeed I think it might be quite hard to get them to work.

Altman blog on post-AGI world

ozziegooen10d1311

This might be obvious, but I don't think we have evidence to support the idea that there really is anything like a concrete plan. All of the statements I've seen from Sam on this issue so far are incredibly basic and hand-wavy.

I suspect that any concrete plan would be fairly controversial, so it's easiest to speak in generalities. And I doubt there's anything like an internal team with some great secret macrostrategy - instead I assume that they haven't felt pressured to think through it much.

ozziegooen's Shortform

ozziegooen11d30

Correct, that wasn't my intended point. Thanks for clarifying, I'll try to be more careful in the future.

ozziegooen's Shortform

ozziegooen12d42

I partially agree, but I think this must only be a small part of the issue.

- I think there's a whole lot of key insights people could raise that aren't info-hazards.
- If secrecy were the main factor, I'd hope that there would be some access-controlled message boards or similar. I'd want the discussion to be intentionally happening somewhere. Right now I don't really think that's happening. I think a lot of tiny groups have their own personal ideas, but there's surprisingly little systematic and private thinking between the power players.
- I thi... (read more)

ozziegooen's Shortform

ozziegooen14d20

This is an orthogonal question. I agree that if we're there now, my claim is much less true.

I'd place fairly little probability mass on this (<10%) and believe much of the rest of the community does as well, though I realize there is a subset of the LessWrong-adjacent community that does.

ozziegooen's Shortform

ozziegooen15d83

I'm not sure if it means much, but I'd be very happy if AI safety could get another $50B from smart donors today.

I'd flag that [stopping AI development] would cost far more than $50B. I'd expect that we could easily lose $3T of economic value in the next few years if AI progress seriously stopped.

I guess, it seems to me like duration is basically dramatically more expensive to get than funding, for amounts of funding people would likely want.

5Knight Lee15d

I do think that convincing the government to pause AI in a way which sacrifices $3000 billion economic value, is relatively easier than directly spending $3000 billion on AI safety. Maybe spending $1 is similarly hard to sacrificing $10-$100 of future economic value via preemptive regulation.[1] But $0.1 billion AI safety spending is so ridiculously little (1000 times less than capabilities spending), increasing it may still be the "easiest" thing to do. Of course we should still push for regulation at the same time (it doesn't hurt). PS: what do you think of my open letter idea for convincing the government to increase funding? 1. ^ Maybe "future economic value" is too complicated. A simpler guesstimate would be "spending $1 is similarly hard to sacrificing $10 of company valuations via regulation."

ozziegooen's Shortform

ozziegooen15d42

Thanks for the specificity!

> On harder-to-operationally-define dimensions (sense of hope and agency for the 25th through 75th percentile of culturally normal people), it’s quite a bit worse.

I think it's likely that many people are panicking and losing hope each year. There's a lot of grim media around.

I'm far less sold that something like "civilizational agency" is declining. From what I can tell, companies have gotten dramatically better at achieving their intended ends in the last 30 years, and most governments have generally been improvin... (read more)

2Dagon14d

I've given some thought to this over the last few decades, and have yet to find ANY satisfying measures, let alone a good set. I reject the trap of "if it's not objective and quantitative, it's not important" - that's one of the underlying attitudes causing the decline. I definitely acknowledge that my memory of the last quarter of the previous century is fuzzy and selective, and beyond that is secondhand and not-well-supported. But I also don't deny my own experience that the (tiny subset of humanity) people I am aware of as individuals have gotten much less hopeful and agentic over time. This may well be for reasons of media attention, but that doesn't make it not real.

ozziegooen's Shortform

ozziegooen15d111

In terms of proposing and discussing AI Alignment strategies, I feel like a few individuals have been dominating the LessWrong conversation recently.

I've seen a whole lot from John Wentworth and the Redwood team.

After that, it seems to get messier.

There are several individuals or small groups with their own very unique takes. Matthew Barnett, Davidad, Jesse Hoogland, etc. I think these groups often have very singular visions that they work on, that few others have much buy-in with.

Groups like the Deepmind and Anthropic safety teams seem h... (read more)

3sjadler11d

I don’t think you intended this implication, but I initially read “have been dominating” as negative-valenced! Just want to say I’ve been really impressed and appreciative with the amount of public posts/discussion from those folks, and it’s encouraged me to do more of my own engagement because I’ve realized how helpful their comments/posts are to me (and so maybe mine likewise for some folks).

4Nathan Helm-Burger12d

There are a lot of possible plans which I can imagine some group feasibly having which would meet one of the following criteria: 1. contains critical elements which are illegal 2. Contains critical elements which depends on an element of surprise / misdirection 3. Benefit from the actor bring first mover on the plan. Others can strategy copy, but can't lead. If one of these criteria or similar applies to the plan, then you can't discuss it openly without sabotaging it. Making strategic plans with all your cards laid out on the table (whole open-ended hide theirs) makes things substantially harder.

Anti-Slop Interventions?

ozziegooen15d20

Here are some important-seeming properties to illustrate what I mean:
Robustness of value-alignment: Modern LLMs can display a relatively high degree of competence when explicitly reasoning about human morality. In order for it to matter for RSI, however, those concepts need to also appropriately come into play when reasoning about seemingly unrelated things, such as programming. The continued ease of jailbreaking AIs serves to illustrate this property failing (although solving jailbreaking would not necessarily get at the whole property I am pointing at).
P

... (read more)

Anti-Slop Interventions?

ozziegooen15d31

I think that Slop could be a social problem (i.e. there are some communities that can't tell slop from better content) , but I'm having a harder time imagining it being a technical problem.

I have a hard time imagining a type of Slop that isn't low in information. All the kinds of Slop I'm familiar with is basically, "small variations on some ideas, which hold very little informational value."

It seems like models like o1 / r1 are trained by finding ways to make information-dense AI-generated data. I expect that trend to continue. If AIs for some reason experience some "slop thresh-hold", I don't see how they get much further by using generated data.

ozziegooen's Shortform

ozziegooen15d20

I mostly want to point out that many disempowerment/dystopia failure scenarios don't require a step-change from AI, just an acceleration of current trends.

Do you think that the world is getting worse each year?

My rough take is that humans, especially rich humans, are generally more and more successful.

I'm sure there are ways for current trends to lead to catastrophe - line some trends dramatically increasing and others decreasing, but that seems like it would require a lengthy and precise argument.

2Dagon15d

Good clarification question! My answer probably isn’t satisfying, though. “It’s complicated” (meaning: multidimensional and not ordinally comparable). On a lot of metrics, it’s better by far, for most of the distribution. On harder-to-operationally-define dimensions (sense of hope and agency for the 25th through 75th percentile of culturally normal people), it’s quite a bit worse.

ozziegooen's Shortform

ozziegooen15d4-2

In many worlds, if we have a bunch of decently smart humans around, they would know what specific situations "very dumb humans" would mess up, and take the corresponding preventative measures.

A world where many small pockets of "highly dumb humans" could cause an existential catastrophe is one that's very clearly incredibly fragile and dangerous, enough so that I assume reasonable actors would freak out until it stops being so fragile and dangerous. I think we see this in other areas - like cyber attacks, where reasonable people prevent small clusters of a... (read more)

2JBlack14d

How do you propose that reasonable actors prevent reality from being fragile and dangerous? Cyber attacks are generally based on poor protocols. Over time smart reasonable people can convince less smart reasonable people to follow better ones. Can reasonable people convince reality to follow better protocols? As soon as you get into proposing solutions to this sort of problem, they start to look a lot less reasonable by current standards.

ozziegooen's Shortform

ozziegooen15d20

I feel like you're talking in highly absolutist terms here.

Global wealth is $454.4 trillion. We currently have ~8 Bil humans, with an average happiness of say 6/10. Global wealth and most other measures of civilization flourishing that I know of seem to be generally going up over time.

I think that our world makes a lot of mistakes and fails a lot at coordination. It's very easy for me to imagine that we could increase global wealth by 3x if we do a decent job.

So how bad are things now? Well, approximately, "We have the current world, at $454 Trillion, with 8 billion humans, etc". To me that's definitely something to work with.

2Dagon15d

You're correct, and I apologize for that. There are plenty of potential good outcomes where individual autonomy reverses the trend of the last ~70 years. Or where the systemic takeover plateaus at the current level, and the main change is more wealth and options for individuals. Or where AI does in fact enable many/most individual humans to make meaningful decisions and contributions where they don't today. I mostly want to point out that many disempowerment/dystopia failure scenarios don't require a step-change from AI, just an acceleration of current trends.

ozziegooen's Shortform

ozziegooen15d50

I assume that current efforts in AI evals and AI interpretability will be pretty useless if we have very different infrastructures in 10 years. For example, I'm not sure how much LLM interp helps with o1-style high-level reasoning.

I also think that later AI could help us do research. So if the idea is that we could do high-level strategic reasoning to find strategies that aren't specific to specific models/architectures, I assume we could do that reasoning much better with better AI.

ozziegooen's Shortform

ozziegooen15d31

Can you explain this position more? I know the bitter lesson, could imagine a few ways it could have implications here.

6TsviBT15d

I'm saying that just because we know algorithms that will successfully leverage data and compute to set off an intelligence explosion (...ok I just realized you wrote TAI but IDK what anyone means by anything other than actual AGI), doesn't mean we know much about how they leverage it and how that influences the explody-guy's long-term goals.

ozziegooen's Shortform

ozziegooen15d20

The second worry is, I guess, a variant of the first: that we'll use intent-aligned AI very foolishly. That would be issuing a command like ""follow the laws of the nation you originated in but otherwise do whatever you like." I guess a key consideration in both cases is whether there's an adequate level of corrigibility.

I'd flag that I suspect that we really should have AI systems forecasting the future and the results of possible requests.

So if people made a broad request like, "follow the laws of the nation you originated in but otherwise do whate... (read more)

4Dagon15d

I think "very dumb humans" is what we have to work with. Remember, it only requires a small number of imperfectly aligned humans to ignore the warnings (or, indeed, to welcome the world the warnings describe).

ozziegooen's Shortform

ozziegooen15d386

A bunch of people in the AI safety landscape seem to argue "we need to stop AI progress, so that we can make progress on AI safety first."

One flip side to this is that I think it's incredibly easy for people to waste a ton of resources on "AI safety" at this point.

I'm not sure how much I trust most technical AI safety researchers to make important progress on AI safety now. And I trust most institutions a lot less.

I'd naively expect if any major country would throw $100 Billion on it today, the results would be highly underwhelming. I rarely trust these go... (read more)

6JBlack14d

What makes you think that we're not at year(TAI)-3 right now? I'll agree that we might not be there yet, but you seem to be assuming that we can't be.

4Knight Lee15d

I think both duration and funding are important. I agree that increasing duration has a greater impact than increasing funding. But increasing duration is harder than increasing funding. AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. Increasing funding by 10x is relatively more attainable, while increasing duration by 10x would require more of a miracle. Even if you believe that funding today isn't very useful and funding in the future is more useful, increasing funding now moves the Overton window a lot. It's hard for any government which has traditionally spent only $0.01 billion to suddenly spend $100 billion. They'll use the previous budget as an anchor point to decide the new budget. My guess is that 4x funding ≈ 2x duration.[1] 1. ^ For inventive steps, having twice as many "inventors" reduces the time to invention by half, while for engineering steps, having twice as many "engineers" doesn't help very much. (Assuming the time it takes each inventor to think of an invention is an independent exponential distribution)

6TsviBT15d

Why?? What happened to the bitter lesson?

ozziegooen's Shortform

ozziegooen15d50

There have been a few takes so far of humans gradually losing control to AIs - not through specific systems going clearly wrong, but rather by a long-term process of increasing complexity and incentives.

This sometimes gets classified as "systematic" failures - in comparison to "misuse" and "misalignment."

There was "What Failure Looks Like", and more recently, this piece on "Gradual Disempowerment."

To me, these pieces come across as highly hand-wavy, speculative, and questionable.

I get the impression that a lot of people have strong low-level assumptions he... (read more)

2Dagon15d

For myself, it seems clear that the world has ALREADY gone haywire. Individual humans have lost control of most of our lives - we interact with policies, faceless (or friendly but volition-free) workers following procedure, automated systems, etc. These systems are human-implemented, but in most cases too complex to be called human-controlled. Moloch won. Big corporations are a form of inhuman intelligence, and their software and operations have eaten the world. AI pushes this well past a tipping point. It's probably already irreversable without a major civilizational collapse, but it can still get ... more so. I don't have good working definitions of "controlled/aligned" that would make this true. I don't see any large-scale institutions or groups large and sane enough to have a reasonable CEV, so I don't know what an AI could align with or be controlled by.

4Seth Herd15d

I think your central point is that we should clarify these scenarios, and I very much agree. I also found those accounts important but incomplete. I wondered if the authors were assuming near-miss alignment, like AI that follows laws, or human misuse, like telling your intent-aligned AI to "go run this company according to the goals laid out in its corporate constitution" which winds up being just make all the money you can. The first danger can be met with: for the love of god, get alignment right and don't use an idiotic target like "follow the laws of the nation you originated in but otherwise do whatever you like." It seems like this type of failure is a fear of an entire world that has paid zero attention to the warnings from worriers that AI will keep improving and following its goals to the extreme. I don't think we'll sleepwalk into that scenario. The second worry is, I guess, a variant of the first: that we'll use intent-aligned AI very foolishly. That would be issuing a command like ""follow the laws of the nation you originated in but otherwise do whatever you like." I guess a key consideration in both cases is whether there's an adequate level of corrigibility. I guess I find the first scenario too foolish for even humans to fall into. Building AI with one of the exact goals people have been warning you about forever, "just make money", is just too dumb. But the second seems all too plausible in a world with widely proliferated intent-aligned AGI. I can see us arriving at autonomous AI/AGI with some level of intent alignment and assuming we can always go back and tell the AI to stand down, then getting complacent and discovering that it's not really as corrigible as you hoped after it's learned and changed its beliefs about things like "following instructions".

The Case Against AI Control Research

ozziegooen15d62

Rather than generic slop, the early transformative AGI is fairly sycophantic (for the same reasons as today’s AI), and mostly comes up with clever arguments that the alignment team’s favorite ideas will in fact work.

I have a very easy time imagining work to make AI less sycophantic, for those who actually want that.

I expect that one major challenge for popular LLMs is that a large amount of sycophancy is both incredibly common online, and highly approved of by humans.

It seems like it should be an easy thing to stop for someone actually motivated... (read more)

Mikhail Samin's Shortform

ozziegooen16d72

I think it's totally fine to think that Anthropic is a net positive. Personally, right now, I broadly also think it's a net positive. I have friends on both sides of this.

I'd flag though that your previous comment suggested more to me than "this is just you giving your probability"

> Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don't actually have good advice to give anyone.

I feel like there are much nicer ways to phase that last bit. I suspect that this is much of the reason you got disagreement points.

4Nathan Helm-Burger16d

Fair enough. I'm frustrated and worried, and should have phrased that more neutrally. I wanted to make stronger arguments for my point, and then partway through my comment realized I didn't feel good about sharing my thoughts. I think the best I can do is gesture at strategy games that involve private information and strategic deception like Diplomacy and Stratego and MtG and Poker, and say that in situations with high stakes and politics and hidden information, perhaps don't take all moves made by all players at literally face value. Think a bit to yourself about what each player might have in their uands, what their incentives look like, what their private goals might be. Maybe someone whose mind is clearer on this could help lay out a set of alternative hypotheses which all fit the available public data?

Mikhail Samin's Shortform

ozziegooen16d95

Then we must consider probabilities, expected values, etc. Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don't actually have good advice to give anyone.

Are there good models that support that Anthropic is a good bet? I'm genuinely curious.

I assume that naively, if any side had more of the burden of proof, it would be Anthropic. They have many more resources, and are the ones doing the highly-impactful (and potentially negative) work.

My impression was that there was very little probablistic risk modeling here, but I'd love to be wrong.

3Nathan Helm-Burger16d

I don't feel free to share my model, unfortunately. Hopefully someone else will chime in. I agree with your point and that this is a good question! I am not trying to say I am certain that Anthropic is going to be net positive, just that that's my view as the higher probability.

Gradual Disempowerment, Shell Games and Flinches

ozziegooen16d31

The introduction of the GD paper takes no more than 10 minutes to read

Even 10 minutes is a lot, for many people. I might see 100 semi-interesting Tweets and Hacker News posts that link to lengthy articles per day, and that's already filtered - I definitely can't spend 10 min each on many of them.

and no significant cognitive effort to grasp, really.

"No significant cognitive effort" to read a nuanced semi-academic article with unique terminology? I tried spending around ~20-30min understanding this paper, and didn't find it trivial. I think it's ... (read more)

$300 Fermi Model Competition

ozziegooen17d20

By the way - I imagine you could do a better job with the evaluation prompts by having another LLM pass, where it formalizes the above more and adds more context. For example, with an o1/R1 pass/Squiggle AI pass, you could probably make something that considers a few more factors with this and brings in more stats.

$300 Fermi Model Competition

ozziegooen17d30