LESSWRONG
LW

All of Raemon's Comments + Replies

Self-fulfilling misalignment data might be poisoning our AI models

I think I understood your article, and was describing which points/implications seemed important.

I think we probably agree on predictions for nearterm models (i.e. that including this training data makes it more likely for them to deceive), I just don't think it matters very much if sub-human-intelligence AIs deceive.

Navigation by Moonlight

Raemon9d62

FYI I think there are a set of cues that move you from ‘pretty unlikely to be interested’ to ‘maybe interested’, but not that get you above like 25% likely.

VDT: a solution to decision theory

Raemon10d20

FYI this was an April Fools joke.

1refaelb10d

Understood. Just there's a point of criticizing the rigid categories of decision theory, which is shown by showing the common sense in between. This is my question. That's the idea of the quote at the end from Berlin; but one of the premises of ai research is that thinking is reducible to formula and categories.

AI 2027: What Superintelligence Looks Like

Raemon11d90

Curated. I've been following this project for awhile (you can see some of the earlier process in Daniel's review of his "What 2026 looks like" post, and on his comment on Tom Davidon's What a Compute-centric framework says about AI takeoff). I've participated in one of the wargames that helped inform what sort of non-obvious things might happen along the path of AI takeoff.

(disclosure, Lightcone did a lot of work on the website of this project, although I was only briefly involved)

Like others have said, I appreciate this for both having a lot of research b... (read more)

Exercise: Planmaking, Surprise Anticipation, and "Baba is You"

Raemon13d31

Yeah I introduced Baba is You more as a counterbalance to empiricism-leaning fields. I think ‘practice forms of thinking that don’t come natural to you’ is generally valuable so you don’t get in ruts.

Exercise: Planmaking, Surprise Anticipation, and "Baba is You"

Raemon15d20

I’m curious how it went in terms of ‘do you think you learned anything useful?’

4jenn14d

It felt productive. My day job (running a policy nonprofit) involves a lot of vibes/reacting to Current Thing and not a great deal of rigorously solving hard problems, and the exercise usefully... crystallized? a vague, vibes-based framework that I follow when I do strategy planning - set timers, generate plans, set probabilities, check surprise, go meta, iterate, etc. It's nice to have that operationalized!

The Ghibli Event: Our First Glimpse of Reality Transfer

Raemon16d62

I do think the thing you describe here is great. I think I hadn't actually tried really leveraging the current zeitgeist to actively get better at it, and it does seem like a skill you could improve at and that seems cool.

But I'd bet it's not what was happening for most people. I think the value-transfer is somewhat automatic, but most people won't actually be attuned to it enough. (might be neat to operationalize some kind of bet about this, if you disagree).

I do think it's plausible, if people put more deliberate effort it, to create a zeitgeist where the value transfer is more real for more people.

2BazingaBoy16d

You’re likely right – my ability to mentally apply the “Miyazaki goggles” and feel the value shift is probably not what’s happening for most people, or even many. For me, it’s probably a combination of factors: my background working extensively with images, the conceptual pathways formed during writing the original post above, and preexisting familiarity with the aesthetic from Nausicaä of the Valley of the Wind, Castle in the Sky, Kiki’s Delivery Service, Princess Mononoke, Spirited Away, Howl's Moving Castle, Tales from Earthsea, Ponyo, and Arrietty. But crucially, I share your optimism about the potential. I do think this is a skill others could cultivate with deliberate practice. Now that we've seen this kind of reality transfer is possible, perhaps methods and best practices could eventually be developed and tested to guide that learning.

The Ghibli Event: Our First Glimpse of Reality Transfer

Raemon17d*336

A thing that gave me creeping horror about the Ghiblification is that the I don't think the masses actually particularly understand Ghibli. And the result is an uneven simulacrum-mask that gives the impression of "rendered with love and care" without actually being so.

The Ghibli aesthetic is historically pretty valuable to me, and in particular important as a counterbalanacing force against "the things I expect to happen by default with AI."

Some things I like about Ghibli:

The "cinematic lens" emphasizes a kind of "see everything with wonder and

... (read more)

4BazingaBoy16d

I don’t share your concerns about simulacra or cheapening, because in this case, the style is the substance. It’s not just a cosmetic overlay; it fundamentally alters how we perceive and emotionally engage with a scene. And at any rate, the Ghibli aesthetic is too coherent, too complete in its internal logic, to be diminished by misuse or overuse. People can wear it wrong, but they can’t break it. What’s especially interesting to me right now is that I’ve gained the ability you refer to as “Miyazaki goggles.” Today, for example, I was repeatedly able to briefly summon that warm, quiet beauty while looking at my environment. And when I was with a close relative who seemed slightly frail, the moment I mentally applied the Ghibli filter, I instantly teared up and had a huge emotional reaction. A minute later I tried again, and the same thing happened. Repeated exposure to the reality transfer seems to teach you a new language, one that lets you do new things. After seeing so many A-to-B examples of Ghiblification, I have learned a heuristic for what photorealism could feel like under that lens, and can now easily switch to it. It’s not that I vividly visualize everything in Ghibli style, but I do vividly experience the value shift it brings. At most I might see Ghibli very faintly superimposed, abstractly even, but I can predict the vectors of what would change, and those shifts immediately alter my emotional reading of the scene. So perhaps over time, the Ghibli reality transfer will help us become more sensitive, appreciative, compassionate and easily able to expand our circle of concern. One caveat: I work with images constantly and have for a long time, so I might already have been more adept at mental visual transformation than most people. Related to this idea of “learning a new language that lets you do new things,” I’ve also been wanting to share something cool I trained myself to do: I wore an eyepatch over one eye and just went about daily life like that,

New Cause Area Proposal

Raemon17d200

Seems at odds with longhairism.

Good Research Takes are Not Sufficient for Good Strategic Takes

Raemon19d72

Curated. I think this is a pretty important point. I appreciate Neel's willigness to use himself as an example.

I do think this leaves us with the important followup questions of "okay, but, how actually DO we evaluate strategic takes?". A lot of people who are in a position to have demonstrated some kind of strategic awareness are people who are also some kind of "player" on the gameboard with an agenda, which means you can't necessarily take their statements at face value as an epistemic claim.

3Neel Nanda19d

Thanks! Yeah, I don't have a great answer to this one. I'm mostly trying to convey the spirit of: we're all quite confused, and the people who seem competent disagree a lot, so they can't actually be that correct. And given that the ground truth is confusion, it is epistemically healthier to be aware of this. Actually solving these problems is way harder! I haven't found a much better substitute than looking at people who have a good non-trivial track record of predictions, and people who have what to me seem like coherent models of the world that make legitimate and correct seeming predictions. Though the latter one is fuzzier and has a lot more false positives. A particularly salient form of a good track record is people who had positions in domains I know well (eg interpretability) that I previously thought were wrong/ridiculous, but who I later decided were right (eg I give Buck decent points here, and also a fair amount of points to Chris Olah)

Third-wave AI safety needs sociopolitical thinking

Raemon21d1713

I think I agree with a lot of stuff here but don't find this post itself particularly compelling for the point.

I also don't think "be virtuous" is really sufficient to know "what to actually do." It matters a lot which virtues. Like I think environmentalism's problems wasn't "insufficiently virtue-ethics oriented", it's problem was that it didn't have some particular virtues that were important.

Policy for LLM Writing on LessWrong

Raemon22d52

Or: when the current policy stops making sense, we can figure out a new policy.

In particular, when the current policy stops making sense, AI moderation tools may also be more powerful and can enable a wider range of policies.

Policy for LLM Writing on LessWrong

Raemon23d30

I mean, the sanctions are ‘if we think your content looks LLM generated, we’ll reject it and/or give a warning and/or eventually delete or ban.’ We do this for several users a day.

That may get harder someday but it’s certainly not unenforceable now.

1Anders Lindström22d

Yes, but as I wrote in the answer to habryka (see below), I am not talking about the present moment. I am concerned with the (near) future. With the break neck speed at which AI is moving it wont be long until it will be hopeless to figure out if its AI generated or not. So my point and rhetorical question is this: AI is not going to go away. Everyone(!) will use it, all day every day. So instead of trying to come up with arbitrary formulas for how much AI generated content a post can or cannot contain, how can we use AI to the absolute limit to increase the quality of posts and make Lesswrong even better than it already is?!

Policy for LLM Writing on LessWrong

Raemon23d25

I agree it'll get harder to validate, but I think having something like this policy is, like, a prerequisite (or at least helpful grounding) for the mindset change.

-1Anders Lindström23d

I understand the motif behind the policy change but its unenforceable and carry no sanctions. In 12-24 months I guess it will be very difficult (impossible) to detect AI spamming. The floodgates are open and you can only appeal to peoples willingness to have a real human to human conversation. But perhaps those conversations are not as interesting as talking to an AI? Those who seek peer validation for their cleverness will use all available tools in doing so no matter what policy there is.

AI for AI safety

Raemon24d6-1

Curated. I think figuring out whether and how we can apply AI to AI safety is one of the most important questions, and I like this post for exploring this through many more different angles than we'd historically seen.

A thing I both like and dislike about this post is that it's more focused on laying out the questions than giving answers. This makes it easier for me the post to "help me think it through myself" (rather than just telling me a "we should do X" style answer).

But it lays out a dizzying enough array of different concerns that I found it s... (read more)

Policy for LLM Writing on LessWrong

Raemon25d*2716

(note: This is Raemon's random take rather than considered Team Consensus)

Part of the question here is "what sort of engine is overall maintainable, from a moderation perspective?".

LLMs make it easy for tons of people to be submitting content to LessWrong without really checking whether it's true and relevant. It's not enough for a given piece to be true. It needs to be reliably true, with low cost to moderator attention.

Right now, basically LLMs don't produce anywhere near good enough content. So, presently, letting people submit AI generated content with... (read more)

Recent AI model progress feels mostly like bullshit

Raemon25d82

My lived experience is that AI-assisted-coding hasn't actually improved my workflow much since o1-preview, although other people I know have reported differently.

Raemon's Shortform

Raemon25d70

It seems like my workshops would generally work better if they were spaced out over 3 Saturdays, instead of crammed into 2.5 days in one weekend.

This would give people more time to try applying the skills in their day to day, and see what strategic problems they actually run into each week. Then on each Saturday, they could spend some time reviewing last week, thinking about what they want to get out of this workshop day, and then making a plan for next week.

My main hesitation is I kind of expect people to flake more when it's spread out over 3 weeks... (read more)

2Viliam24d

Are there many people who pay for 3 Saturdays and then skip one? I would be surprised. What age is the target group? An adult person can probably easily find 3 free Saturdays in a row. For a student living with parents it will probably be more difficult, because it means 3 weekends when the parents cannot organize any whole-weekend activity.

METR: Measuring AI Ability to Complete Long Tasks

Raemon26d147

Also, have you tracked the previous discussion on Old Scott Alexander and LessWrong about generally "mysterious straight lines" being a surprisingly common phenomenon in economics. i.e. On an old AI post Oli noted:

This is one of my major go-to examples of this really weird linear phenomenon:
150 years of a completely straight line! There were two world wars in there, the development of artificial fertilizer, the broad industrialization of society, the invention of the car. And all throughout the line just carries one, with no significant perturbations.

This ... (read more)

2testingthewaters26d

That surprisingly straight line reminds me of what happens when you use noise to regularise an otherwise decidedly non linear function: https://www.imaginary.org/snapshot/randomness-is-natural-an-introduction-to-regularisation-by-noise

Elizabeth's Shortform

Raemon26d2213

I think it's also "My Little Pony Fanfics are more cringe than Harry Potter fanfics, and there is something about the combo of My Little Pony and AIs taking over the world that is extra cringe."

2Tomás B.24d

it's good though! but yeah a bit cringe, as all the best things are.

5EmpressCelestia26d

I feel like it should just loop back around to being cool at that point but I guess that's not how it works.

Arguments about fast takeoff

Raemon26d20

I'm here from the future trying to decide how much to believe in and how common are Gods of Straight Lines, and curious if you could say more arguing about this.

3Vivek Hebbar8d

Sigmoid is usually what "straight line" should mean for a quantity bounded at 0 and 1. It's a straight line in logit-space, the most natural space which complies with that range restriction. (Just as exponentials are often the correct form of "straight line" for things that are required to be positive but have no ceiling in sight.)

Reframing AI Safety as a Neverending Institutional Challenge

Raemon1moΩ399

I do periodically think about this and feel kind of exhausted at the prospect, but it does seem pretty plausibly correct. Good to have a writeup of it.

It particularly seems likely to be the right mindset if you think survival right now depends on getting some kind of longish pause (at least on the sort of research that'd lead to RSI+takeoff)

Raemon's Shortform

Raemon1mo*133

Metastrategy = Cultivating good "luck surface area"?

Metastrategy: being good at looking at an arbitrary situation/problem, and figure out what your goals are, and what strategies/plans/tactics to employ in pursuit of those goals.

Luck Surface area: exposing yourself to a lot of situations where you are more likely to get valuable things in a not-very-predictable way. Being "good at cultivating luck surface area" means going to events/talking-to-people/consuming information that are more likely to give you random opportunities / new ways of thinking / new pa... (read more)

1Jonas Hallgren25d

I guess a point here might also be that luck involves non-linear effects that are hard to predict and so when you're optimising for luck you need to be very conscious about not only looking at results but rather holding a frame of playing poker or similar. So it is not something that your brain does normally and so it is a core skill of successful strategy and intellectual humility or something like that?

2Ruby1mo

"Serendipity" is a term I've been seen used for this, possibly was Venkatesh Rao.

Any mistakes in my understanding of Transformers?

Answer by RaemonMar 22, 202520

Pedagogic feedback: each diagram is much longer than a page, it's harder to fit the whole thing in my head at once.

Going Nova

Raemon1mo62

It’s unclear to me what the current evidence is for this happening ‘a lot’ and ‘them being called Nova specifically’. I don’t particularly doubt it but it seemed sort of asserted without much background.

Levels of Friction

Raemon1mo50

Curated. This concept seems like an important building block for designing incentive structures / societies, and this seems like a good comprehensive reference post for the concept.

One pager

Raemon1mo20

Note: it looks like you probably want this to be a markdown file. You can go to https://www.lesswrong.com/account, with the "site customizations" section, and click "activate Markdown" to enable the markdown editor.

1samuelshadrach1mo

Thanks this is super helpful! Edited.

Paper: Field-building and the epistemic culture of AI safety

Raemon1mo20

Fyi I think it’s time to do minor formatting adjustments to make papers/abstracts easier to read on LW

Anthropic, and taking "technical philosophy" more seriously

Raemon1mo20

I think there might be a bit of a (presumably unintentional) motte and bailey here where the motte is "careful conceptual thinking might be required rather than pure naive empiricism (because we won't be given good enough test beds by default) and it seems like Anthropic (leadership) might fail heavily at this" and the bailey is "extreme philosophical competence (e.g. 10-30 years of tricky work) is pretty likely to be needed".

Yeah I agree that was happening somewhat. The connecting dots here are "in worlds where it turns out we need a long Philosophical Pa... (read more)

Anthropic, and taking "technical philosophy" more seriously

Raemon1mo20

Thanks for laying this out thus far. I'mma reply but understand if you wanna leave the convo here . I would be interested in more effortpost/dialogue about your thoughts here.

Yes, my reasoning is definitely part, but not all of the argument. Like the thing I said is a sufficient crux for me. (If I thought we had to directly use human labor to align AIs which were qualitatively wildly superhuman in general I would put much more weight on "extreme philosophical competence".)

This makes sense as a crux for the claim "we need philosophical competence to align u... (read more)

2ryan_greenblatt1mo

I was thinking of a slightly broader claim: "we need extreme philosophical competence". If I thought we had to use human labor to align wildly superhuman AIs, I would put much more weight on "extreme philosophical competence is needed". I agree that "we need philosophical competence to align any general, openended intelligence" isn't affected by the level of capability at handoff. I think there might be a bit of a (presumably unintentional) motte and bailey here where the motte is "careful conceptual thinking might be required rather than pure naive empiricism (because we won't have good enough test beds by default) and it seems like Anthropic (leadership) might fail heavily at this thinking" and the bailey is "extreme philosophical competence (e.g. 10-30 years of tricky work) is pretty likely to be needed". I buy the motte here, but not the bailey. I think the motte is a substantial discount on Anthropic from my perspective, but I'm kinda sympathetic to where they are coming from. (Getting conceptual stuff and futurism right is real hard! How would they know who to trust among people disagreeing wildly!) I don't think "does anthropic stop (at the right time)" is the majority of the relevance of careful conceptual thinking from my perspective. Probably more of it is "do they do a good job allocating their labor and safety research bets". This is because I don't think they'll have very much lead time if any (median -3 months) and takeoff will probably be slower than the amount of lead time if any, so pausing won't be as relevant. Correspondingly, pausing at the right time isn't the biggest deal relative to other factors, though it does seem very important at an absolute level.

Paper: Field-building and the epistemic culture of AI safety

Raemon1mo41

FYI I found this intro fairly hard to read – partly due to generally large blocks of text (see: Abstracts should be either Actually Short™, or broken into paragraphs) and also because it just... doesn't actually really say what the main point is, AFAICT. (It describes a bunch of stuff you do, but I had trouble finding the actual main takeaway, or primary sorts of new information I might get by reading it)

1peterslattery1mo

It's not my paper — I just wanted to share the link on LW as it seemed likely to be of interest to some people here. Thanks for the feedback, and sorry for the confusion!

Anthropic, and taking "technical philosophy" more seriously

Raemon1mo20

I don't really see why this is a crux. I'm currently at like ~5% on this claim (given my understanding of what you mean), but moving to 15% or even 50% (while keeping the rest of the distribution the same) wouldn't really change my strategic orientation. Maybe you're focused on getting to a world with a more acceptable level of risk (e.g., <5%), but I think going from 40% risk to 20% risk is better to focus on.

I think you kinda convinced me here this reasoning isn't (as stated) very persuasive.

I think my reasoning had some additional steps like:

when I'm

... (read more)

5ryan_greenblatt1mo

Not sure how interesting this is to discuss, but I don't think I agree with this. Stuff they're doing does seem harmful to worlds where you need a long pause, but feels like at the very least Anthropic is a small fraction of the torching right? Like if you think Anthropic is making this less likely, surely they are a small fraction of people pushing in this direction such that they aren't making this that much worse (and can probably still pivot later given what they've said so far).

Anthropic, and taking "technical philosophy" more seriously

Raemon1mo82

I'm pretty skeptical of the "extreme philosophical competence" perspective. This is basically because we "just" need to be able to hand off to an AI which is seriously aligned (e.g., it faithfully pursues our interests on long open-ended and conceptually loaded tasks that are impossible for use to check).

The "extreme philosophical competence" hypothesis is that you need such competence to achieve "seriously aligned" in this sense. It sounds like you disagree, but I don't know why since your reasoning just sidesteps the problem.

Looking over the comments of ... (read more)

ryan_greenblatt1mo141

The "extreme philosophical competence" hypothesis is that you need such competence to achieve "seriously aligned" in this sense. It sounds like you disagree, but I don't know why since your reasoning just sidesteps the problem.

Yes, my reasoning is definitely part, but not all of the argument. Like the thing I said is a sufficient crux for me. (If I thought we had to directly use human labor to align AIs which were qualitatively wildly superhuman in general I would put much more weight on "extreme philosophical competence".)

Looking over the comments of

Raemon1mo20

Thanks. I'll probably reply to different parts in different threads.

For the first bit:

My guess is that the parts of the core leadership of Anthropic which are thinking actively about misalignment risks (in particular, Dario and Jared) think that misalignment risk is like ~5x smaller than I think it is while also thinking that risks from totalitarian regimes are like 2x worse than I think they are. I think the typical views of opinionated employees on the alignment science team are closer to my views than to the views of leadership. I think this explains a

... (read more)

2ryan_greenblatt1mo

Yep, just the obvious. (I'd say "much less bought in" than "isn't bought in", but whatever.) I don't really have dots I'm trying to connect here, but this feels more central to me than what you discuss. Like, I think "alignment might be really, really hard" (which you focus on) is less of the crux than "is misalignment that likely to be a serious problem at all?" in explaining. Another way to put this is that I think "is misalignment the biggest problem" is maybe more of the crux than "is misalignment going to be really, really hard to resolve in some worlds". I see why you went straight to your belief though.

AI Tools for Existential Security

Raemon1mo20

Do you have existing ones you recommend?

I'd been working on a keylogger / screenshot-parser that's optimized for a) playing nicely will LLMs while b) being unopinionated about what other tools you plug it into. (in my search for existing tools, I didn't find keyloggers that actually did the main thing I wanted, and the existing LLM-tools that did similar things were walled-garden-ecosystems that didn't give me much flexibility on what I did with the data)

3Gurkenglas1mo

just pipe /dev/input/* into a file

AI4Science: The Hidden Power of Neural Networks in Scientific Discovery

Raemon1mo40

Minor note but I found the opening section hard to read. See: Abstracts should be either Actually Short™, or broken into paragraphs

1Max Ma1mo

Thanks... will look into

johnswentworth's Shortform

Raemon1mo90

Were you by any chance writing in Cursor? I think they recently changed the UI such that it's easier to end up in "agent mode" where it sometimes randomly does stuff.

7johnswentworth1mo

Nope, we were in Overleaf. ... but also that's useful info, thanks.

Anthropic, and taking "technical philosophy" more seriously

Raemon1mo42

I am kinda intrigued by how controversial this post seems (based on seeing the karma creep upwards and then back down over the past day). I am curious if the downvoters tend more like:

Anti-Anthropic-ish folk who think the post is way too charitable/soft on Anthropic
Pro-Anthropic-ish folk who think the post doesn't make very good/worthwhile arguments against Anthropic
"Alignment-is-real-hard" folks who think this post doesn't represent the arguments for that very well.
"other?"

Anthropic, and taking "technical philosophy" more seriously

Raemon1mo42

I agree with this (and think it's good to periodically say all of this straightforwardly).

I don't know that it'll be particularly worth your time, but, the thing I was hoping for this post was to ratchet the conversation-re-anthropic forward in, like, "doublecrux-weighted-concreteness." (i.e. your arguments here are reasonably crux-and-concrete, but don't seem to be engaging much with the arguments in this post that seemed more novel and representative of where anthropic employees tend to be coming from, instead just repeated AFAICT your cached arguments a... (read more)

Elizabeth's Shortform

Raemon1mo22

Yeah I was staring at the poll and went "oh no." They aren't often actually used this way so it's not obviously right to special-case it, although maybe we do polls enough that we generally should present react in "first->last posted" rather than sorting by number

4Ben Pace1mo

I personally have it as a to-do to just build polls. (React if you want to express that you would likely use this.)

5Elizabeth1mo

While we're at it, can it be >99% to match <1%?

Anthropic, and taking "technical philosophy" more seriously

Raemon1mo20

I don't particularly disagree with the first half, but your second sentence isn't really a crux for me for the first part.

Anthropic, and taking "technical philosophy" more seriously

Raemon1mo20

I think (moderately likely, though not super confident) it makes more sense to model Dario as:

"a person who actually is quite worried about misuse, and is making significant strategic decisions around that (and doesn't believe alignment is that hard)"

than as "a generic CEO who's just generally following incentives and spinning narrative post-hoc rationalizations."

2Adam Scholl1mo

Yeah, I buy that he cares about misuse. But I wouldn't quite use the word "believe," personally, about his acting as though alignment is easy—I think if he had actual models or arguments suggesting that, he probably would have mentioned them by now.

Anthropic, and taking "technical philosophy" more seriously

Raemon1mo80

I think... agree denotationally and (lean towards) disagreeing connotationally? (like, seems like this is implying "and because he doesn't seem like he obviously has coherent views on alignment-in-particular, it's not worth arguing the object level?")

(to be clear, I don't super expect this post to affect Dario's decisionmaking models, esp. directly. I do have at least some hope for Anthropic employees to engage with these sorts of models/arguments, and my sense from talking them is that a lot of the LW-flavored arguments have often missed their cruxes)

2Adam Scholl1mo

No, I agree it's worth arguing the object level. I just disagree that Dario seems to be "reasonably earnestly trying to do good things," and I think this object-level consideration seems relevant (e.g., insofar as you take Anthropic's safety strategy to rely on the good judgement of their staff).

Trojan Sky

Raemon1mo20

Also, the video you linked has a lot of additional opinionated features that I think are targeting a much more specific group than even "people who aren't put off by AI" - it would never show up on my youtube.

For frame of reference, do regular movie trailers normally show up in your youtube? This video seemed relatively "mainstream"-vibing to me, although somewhat limited by the medium.

Anthropic, and taking "technical philosophy" more seriously

Raemon1mo42

I expect that AIs will be obedient when they initially become capable enough to convince governments that further AI development would be harmful (if it would in fact be harmful).

Seems like "the AIs are good enough at persuasion to persuade governments and someone is deploying them for that" is right when you need to be very high confidence they're obedient (and, don't have some kind of agenda). If they can persuade governments, they can also persuade you of things.

I also think it gets into a point where I'd sure feel way more comfortable if we had m... (read more)

2PeterMcCluskey1mo

I'm assuming that the AI can accomplish its goal by honestly informing governments. Possibly that would include some sort of demonstration that the of the AI's power that would provide compelling evidence that the AI would be dangerous if it wasn't obedient. I'm not encouraging you to be comfortable. I'm encouraging you to mix a bit more hope in with your concerns.

Anthropic, and taking "technical philosophy" more seriously

Raemon1mo*60

Does Anthropic shorten timelines, by working on automatic AI research?

I think "at least a little", though not actually that much.

There's a lot of other AI companies now, but not that many of them are really frontier labs. I think Anthropic's presence in the race still puts marginal pressure on OpenAI companies to rush things out the door a bit with less care than they might have otherwise. (Even if you model other labs as caring ~zero about x-risk, there are still ordinary security/bugginess reasons to delay releases so you don't launch a broke... (read more)

Anthropic, and taking "technical philosophy" more seriously

Raemon1mo50

Cruxes and Questions

The broad thrust of my questions are:

Anthropic Research Strategy

Does Anthropic building towards automated AGI research make timelines shorter (via spurring competition or leaking secrets)
...or, make timelines worse (by inspiring more AI companies or countries to directly target AGI, as opposed to merely trying to cash in on the current AI hype)
Is it realistic for Anthropic to have enough of a lead to safely build AGI in a way that leads to durably making the world safer?

"Is Technical Philosophy actually that big a deal?"

Can there

... (read more)

6Raemon1mo

Does Anthropic shorten timelines, by working on automatic AI research? I think "at least a little", though not actually that much. There's a lot of other AI companies now, but not that many of them are really frontier labs. I think Anthropic's presence in the race still puts marginal pressure on OpenAI companies to rush things out the door a bit with less care than they might have otherwise. (Even if you model other labs as caring ~zero about x-risk, there are still ordinary security/bugginess reasons to delay releases so you don't launch a broken product. Having more "real" competition seems like it'd make people more willing to cut corners to avoid getting scooped on product releases) (I also think earlier work by Dario at OpenAI, and the founding of Anthropic in the first place, probably did significantly shorten timelines. But, this factor isn't significant at this point, and while I'm mad about the previous stuff it's not actually a crux for their current strategy) Subquestions: * How many bits does Anthropic leak by doing their research? This is plausibly low-ish. I don't know of them actually leaked much about reasoning models until after OpenAI and Deepseek had pretty thoroughly exposed that vein of research. * How many other companies are actually focused on automating AI research, or pushing frontier AI in ways that are particularly relevant? If it's a small number, then I think Anthropic's contribution to this race is larger and more costly. I think the main mechanism here might be Anthropic putting pressure on OpenAI in particular (by being one of 2-3 real competitors on 'frontier AI', which pushes OpenAI to release things with less safety testing) Is Anthropic institutionally capable of noticing "it's really time to stop our capabilities research," and doing so, before it's too late? I know they have the RSP. I think there is a threshold of danger where I believe they'd actually stop. The problem is, before we get to "if you le

Trojan Sky

Raemon1mo5-1

I would bet they are <1% of the population. Do you disagree, or think they disproportionately matter?

4the gears to ascension1mo

both - I'd bet they're between 5 to 12% of the population, and that they're natural relays of the ideas you'd want to broadcast, if only they weren't relaying such mode-collapsed versions of the points. A claim presented without deductive justification: in trying to make media that is very high impact, making something opinionated in the ways you need to is good, and making that same something unopinionated in ways you don't need to is also good. Also, the video you linked has a lot of additional opinionated features that I think are targeting a much more specific group than even "people who aren't put off by AI" - it would never show up on my youtube.

3Rana Dexsin1mo

I don't fully agree with gears, but I think it's worth thinking about. If you're talking about “proportion of people who sincerely think that way”, and if we're in the context of outreach, I doubt that matters as much as “proportion of people who will see someone else point at you and make ‘eww another AI slop spewer’ noises, then decide out of self-preservation that they'd better not say anything positive about you or reveal that they've changed their mind about anything because of you”. Also, “creatives who feel threatened by role displacement or think generative AI is morally equivalent to super-plagiarism (whether or not this is due to inaccurate mental models of how it works)” seems like an interest group that might have disproportionate reach. But I'm also not sure how far that pans out in importance-weighting. I expect my perception of the above to be pretty biased by bubble effects, but I also think we've (especially in the USA, but with a bunch of impact elsewhere due to American cultural-feed dominance) just gone through a period where an overarching memeplex that includes that kind of thing has had massive influence, and I expect that to have a long linger time even if the wave has somewhat crested by now. On the whole I am pretty divided about whether actively skirting around the landmines there is a good idea or not, though my intuition suggests some kind of mixed strategy split between operators would be best.

Trojan Sky

Raemon1mo3-1

I'm skeptical that there are actually enough people so ideologically opposed to this, that it outweighs the upside of driving home that capabilities are advancing, through the medium itself. (similar to how even though tons of people hate FB, few people actually leave)

I'd be wanting to target a quality level similar to this:

4the gears to ascension1mo

Perhaps multiple versions, then. I maintain my claim that you're missing a significant segment of people who are avoiding AI manipulation moderately well but as a result not getting enough evidence about what the problem is.

Trojan Sky

Raemon1mo173

One of the things I track are "ingredients for a good movie or TV show that would actually be narratively satisfying / memetically fit," that would convey good/realistic AI hard sci-fi to the masses.

One of the more promising strategies withint that I can think of is "show multiple timelines" or "flashbacks from a future where the AI wins but it goes slowly enough to be human-narrative-comprehensible" (with the flashbacks being about the people inventing the AI).

This feels like one of the reasonable options for a "future" narrative. (A previous one I ... (read more)

the gears to ascension1mo114

you'll lose an important audience segment the moment they recognize any AI generated anything. The people who wouldn't be put off by AI generated stuff probably won't be put off by the lack of it. you might be able to get away with it by using AI really unusually well such that it's just objectively hard to even get a hunch that AI was involved other than by the topic.