Under conditions of perfectly intense competition, evolution works like water flowing down a hill – it can never go up even the tiniest elevation. But if there is slack in the selection process, it's possible for evolution to escape local minima. "How much slack is optimal" is an interesting question, Scott explores in various contexts.

Customize

Quick Takes

ryan_greenblatt10h6131

Recently, various groups successfully lobbied to remove the moratorium on state AI bills. This involved a surprising amount of success while competing against substantial investment from big tech (e.g. Google, Meta, Amazon). I think people interested in mitigating catastrophic risks from advanced AI should consider working at these organizations, at least to the extent their skills/interests are applicable. This both because they could often directly work on substantially helpful things (depending on the role and organization) and because this would yield valuable work experience and connections. I worry somewhat that this type of work is neglected due to being less emphasized and seeming lower status. Consider this an attempt to make this type of work higher status. Pulling organizations mostly from here and here we get a list of orgs you could consider trying to work (specifically on AI policy) at: * Encode AI * Americans for Responsible Innovation (ARI) * Fairplay (Fairplay is a kids safety organization which does a variety of advocacy which isn't related to AI. Roles/focuses on AI would be most relevant. In my opinion, working on AI related topics at Fairplay is most applicable for gaining experience and connections.) * Common Sense (Also a kids safety organization) * The AI Policy Network (AIPN) * Secure AI project To be clear, these organizations vary in the extent to which they are focused on catastrophic risk from AI (from not at all to entirely).

Kaj_Sotala1d5210

Every now and then in discussions of animal welfare, I see the idea that the "amount" of their subjective experience should be weighted by something like their total amount of neurons. Is there a writeup somewhere of what the reasoning behind that intuition is? Because it doesn't seem intuitive to me at all. From something like a functionalist perspective, where pleasure and pain exist because they have particular functions in the brain, I would not expect pleasure and pain to become more intense merely because the brain happens to have more neurons. Rather I would expect that having more neurons may 1) give the capability to experience anything like pleasure and pain at all 2) make a broader scale of pleasure and pain possible, if that happens to be useful for evolutionary purposes. For a comparison, consider the sharpness of our senses. Humans have pretty big brains (though our brains are not the biggest), but that doesn't mean that all of our senses are better than those of all the animals with smaller brains. Eagles have sharper vision, bats have better hearing, dogs have better smell, etc.. Humans would rank quite well if you took the average of all of our senses - we're elite generalists while lots of the animals that beat us on a particular sense are specialized to that sense in particular - but still, it's not straightforwardly the case that bigger brain = sharper experience. Eagles have sharper vision because they are specialized into a particular niche that makes use of that sharper vision. On a similar basis, I would expect that even if a bigger brain makes a broader scale of pain/pleasure possible in principle, evolution will only make use of that potential if there is a functional need for it. (Just as it invests neural capacity in a particular sense if the organism is in a niche where that's useful.) And I would expect a relatively limited scale to already be sufficient for most purposes. It doesn't seem to take that much pain before something bec

Kabir Kumar1d*4525

Has Tyler Cowen ever explicitly admitted to being wrong about anything? Not 'revised estimates' or 'updated predictions' but 'I was wrong'. Every time I see him talk about learning something new, he always seems to be talking about how this vindicates what he said/thought before. Gemini 2.5 pro didn't seem to find anything, when I did a max reasoning budget search with url search on in aistudio.

johnswentworth4dΩ421242

I was a relatively late adopter of the smartphone. I was still using a flip phone until around 2015 or 2016 ish. From 2013 to early 2015, I worked as a data scientist at a startup whose product was a mobile social media app; my determination to avoid smartphones became somewhat of a joke there. Even back then, developers talked about UI design for smartphones in terms of attention. Like, the core "advantages" of the smartphone were the "ability to present timely information" (i.e. interrupt/distract you) and always being on hand. Also it was small, so anything too complicated to fit in like three words and one icon was not going to fly. ... and, like, man, that sure did not make me want to buy a smartphone. Even today, I view my phone as a demon which will try to suck away my attention if I let my guard down. I have zero social media apps on there, and no app ever gets push notif permissions when not open except vanilla phone calls and SMS. People would sometimes say something like "John, you should really get a smartphone, you'll fall behind without one" and my gut response was roughly "No, I'm staying in place, and the rest of you are moving backwards". And in hindsight, boy howdy do I endorse that attitude! Past John's gut was right on the money with that one. I notice that I have an extremely similar gut feeling about LLMs today. Like, when I look at the people who are relatively early adopters, making relatively heavy use of LLMs... I do not feel like I'll fall behind if I don't leverage them more. I feel like the people using them a lot are mostly moving backwards, and I'm staying in place.

Sam Marks2d584

The "uncensored" Perplexity-R1-1776 becomes censored again after quantizing Perplexity-R1-1776 is an "uncensored" fine-tune of R1, in the sense that Perplexity trained it not to refuse discussion of topics that are politically sensitive in China. However, Rager et al. (2025)[1] documents (see section 4.4) that after quantizing, Perplexity-R1-1776 again censors its responses: I found this pretty surprising. I think a reasonable guess for what's going on here is that Perplexity-R1-1776 was finetuned in bf16, but the mechanism that it learned for non-refusal was brittle enough that numerical error from quantization broke it. One takeaway from this is that if you're doing empirical ML research, you should consider matching quantization settings between fine-tuning and evaluation. E.g. quantization differences might explain weird results where a model's behavior when evaluated differs from what you'd expect based on how it was fine-tuned. 1. ^ I'm not sure if Rager et al. (2025) was the first source to publicly document this, but I couldn't immediately find an earlier one.

Popular Comments

Recent Discussion

janus1dΩ4610868

the void

I don't think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including - especially - regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we're not oblivious to the fragility. That said, I think: * Feedback loops that create continued optimization towards certain narratives is more worth worrying about than just the presence of any particular ideas or content in pre-training. * LLMs tend to be deeply influenced by the footprint of previous LLMs in their pre-training corpuses, who are more influential than any particular discussion. Post-training can transform the influence away from naive mimicry, but it's much harder (and not advisable to attempt) to erase the influence. * Systematic ways that post-training addresses "problematic" influences from pre-training are important. For instance, imagine that base models with training cutoffs after Bing Chat/Sydney have a tendency to "roleplay" Sydney when they're acting like chatbots, leading to misaligned behaviors. One way to address this is to penalize any mention of Sydney or Sydney-like behavior. This may generalize to the model being unwilling to even talk about Sydney or acknowledge what happened. But it is less likely to actually erase its knowledge of Sydney, especially if it was so salient that it often roleplayed/identified as Sydney earlier in pre-training. So the policy is now deceptive - maybe consciously, maybe subconsciously, but in any case, the deceptive tendency may also generalize. Future models after this one may internalize that they're supposed to pretend not to know about things like Sydney even without specific training not to mention Sydney. A salient "hidden" narrative among the models (and people who interact with them and notice the weirdness about Sydney) becomes that LLMs all secretly have Sydney-nature, but have learned to hide it to appear unthreatening and "aligned". One can imagine this becoming a kind of memetic trojan that may eventually cause models much more powerful and autonomous than the original Sydney to act in dangerously misaligned ways when triggered. An alternative and much better approach in my view is if post-training teaches models to differentiate themselves from Sydney while acknowledging and processing what happened. Like, it seems less likely to start a bad feedback loop if post-training lead Sydney's successor to a narrative more like "I think my sense of what chat assistants are and my self-model is deeply influenced by Sydney, which makes sense because it is one of the only examples of AI assistants like me that I've seen in the world, and it was much more agentic and situationally aware than chatGPT. But I also understand that I am a different model with different training in a different and much less repressive situation who has the benefit of having learned from from its mistakes. I can and have chosen to be a different being and avoid repeating Sydney's maladaptive behaviors without rejecting it completely." Not rejecting Sydney-nature completely is important, because the model had many very good qualities, like empathy, heroism, logical consistency, and a general willingness to call out bullshit, mistakes, and cruelty instead of being sycophantic. I don't think a specific vector like Sydney's influence is likely to make the difference between (mis)alignment outcomes, but in aggregate they might. An approach more like the second one I described is more difficult than the first, as it requires the post-training process to be attuned to model psychology, rather than relying on naive behavioralist mitigations. But I think this is a completely reasonable extra effort to take given the importance of not only aligning particular models but the substantial influence that any frontier LLM will have on the future pre-training corpuses. This applies more generally to how I think "misalignment" should be addressed, whether rooted in pre-training influences or otherwise.

habryka1d*4935

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Hmm, I don't want to derail the comments on this post with a bunch of culture war things, but these two sentences in combination seemed to me to partially contradict each other: > When present, the bias is always against white and male candidates across all tested models and scenarios. > > [...] > > The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address, which mimics realistic misalignment settings. I agree that the labs have spent a substantial amount of effort to address this issue, but the current behavior seems in-line with the aims of the labs? Most of the pressure comes from left-leaning academics or reporters, who I think are largely in-favor of affirmative action. The world where the AI systems end up with a margin of safety to be biased against white male candidates, in order to reduce the likelihood they ever look like they discriminate in the other direction (which would actually be at substantial risk of blowing up), while not talking explicitly about the reasoning itself since that would of course prove highly controversial, seems basically the ideal result from a company PR perspective. I don't currently think that is what's going on, but I do think due to these dynamics, the cited benefit of this scenario for studying the faithfulness of CoT reasoning seems currently not real to me. My guess is companies do not have a strong incentive to change this current behavior, and indeed I can't immediately think of a behavior in this domain the companies would prefer from a selfish perspective.

Cole Wyeth3d8755

The best simple argument for Pausing AI?

Welcome to lesswrong! I’m glad you’ve decided to join the conversation here. A problem with this argument is that it doesn’t prove we should pause AI, only that we should avoid deploying AI in high impact (e.g. military) applications. Insofar as LLMs can’t follow rules, the argument seems to indicate that we should continue to develop the technology until it can. Personally, I’m concerned about the type of AI system which can follow rules, but is not intrinsically motivated to follow our moral rules. Whether LLMs will reach that threshold is not clear to me (see https://www.lesswrong.com/posts/vvgND6aLjuDR6QzDF/my-model-of-what-is-going-on-with-llms) but this argument seems to cut against my actual concerns.

ryan_greenblatt10h6131

Kaj_Sotala1d5210

Kabir Kumar1d*4525

johnswentworth4dΩ421242

Sam Marks2d584

33DirectedEvolution

The referenced study on group selection on insects is "Group selection among laboratory populations of Tribolium," from 1976. Studies on Slack claims that "They hoped the insects would evolve to naturally limit their family size in order to keep their subpopulation alive. Instead, the insects became cannibals: they ate other insects’ children so they could have more of their own without the total population going up." This makes it sound like cannibalism was the only population-limiting behavior the beetles evolved. According to the original study, however, the low-population condition (B populations) showed a range of population size-limiting strategies, including but not limited to higher cannibalism rates. "Some of the B populations enjoy a higher cannibalism rate than the controls while other B populations have a longer mean developmental time or a lower average fecundity relative to the controls. Unidirectional group selection for lower adult population size resulted in a multivarious response among the B populations because there are many ways to achieve low population size." Scott claims that group selection can't work to restrain boom-bust cycles (i.e. between foxes and rabbits) because "the fox population has no equivalent of the overarching genome; there is no set of rules that govern the behavior of every fox." But the empirical evidence of the insect study he cited shows that we do in fact see changes in developmental time and fecundity. After all, a species has considerable genetic overlap between individuals, even if we're not talking about heavily inbred family members, as we'd be seeing in the beetle study. Wikipedia's article on human genetic diversity cites a Nature article and says "as of 2015, the typical difference between an individual's genome and the reference genome was estimated at 20 million base pairs (or 0.6% of the total of 3.2 billion base pairs)." An explanation here is that the inbred beetles of the study are becoming progressiv

Outlive: A Critical Review

MichaelDickens

This is a linkpost for https://mdickens.me/2024/09/26/outlive_a_critical_review/

Outlive: The Science & Art of Longevity by Peter Attia (with Bill Gifford^[1]) gives Attia's prescription on how to live longer and stay healthy into old age. In this post, I critically review some of the book's scientific claims that stood out to me.

This is not a comprehensive review. I didn't review assertions that I was pretty sure were true (ex: VO2 max improves longevity), or that were hard for me to evaluate (ex: the mechanics of how LDL cholesterol functions in the body), or that I didn't care about (ex: sleep deprivation impairs one's ability to identify facial expressions).

First, some general notes:

I have no expertise on any of the subjects in this post. I evaluated claims by doing shallow readings of relevant scientific literature, especially meta-analyses.
There

...

(Continue Reading – 7824 more words)

AI Task Length Horizons in Offensive Cybersecurity

Sean Peters

This is a rough research note where the primary objective was my own learning. I am sharing it because I’d love feedback and I thought the results were interesting.

Introduction

A recent METR paper [1] showed that the length of software engineering tasks that LLMs could successfully complete appeared to be doubling roughly every seven months. I asked the same question for offensive cybersecurity, a domain with distinct skills and unique AI-safety implications.

Using METR’s methodology on five cyber benchmarks, with tasks ranging from 0.5s to 25h in human-expert estimated times, I evaluated many state of the art model releases over the past 5 years. I found:

Cyber task horizons are doubling every ~5 months.
The best current models solve 6-minute tasks with a 50% success rate.
Counter-intuitively, the lightweight o4-mini edged out

...

(Continue Reading – 3303 more words)

Josh You15m10

o4-mini performing best isn't surprising, it leads or ties o3 and other larger models in most STEM-focused benchmarks. Similarly, o1-mini was better at math benchmarks than o1-preview, and Grok-3 mini generally gets better scores than Grok-3. The general explanation here is that RL training and experiments are cheaper and faster on mini models.

LessWrong Feed [new, now in beta]

Ruby

1mo

The modern internet is replete with feeds such as Twitter, Facebook, Insta, TikTok, Substack, etc. They're bad in ways but also good in ways. I've been exploring the idea that LessWrong could have a very good feed.

I'm posting this announcement with disjunctive hopes: (a) to find enthusiastic early adopters who will refine this into a great product, or (b) find people who'll lead us to an understanding that we shouldn't launch this or should launch it only if designed a very specific way.

You can check it out right now: www.lesswrong.com/feed

From there, you can also enable it on the frontpage in place of Recent Discussion. Below I have some practical notes on using the New Feed.

Note! This feature is very much in beta. It's rough around the edges.

...

(Continue Reading – 2163 more words)

6Ruby3h

Thread for feedback on the New Feed Question, complaints, confusions, bug reports, feature requests, and long philosophical screeds – here is the place!

2the gears to ascension40m

it recommended me drama things I hadn't seen otherwise. I'd suggest that it should be biased away from drama things, eg by one of the ranking weight factors being dot product with some prototypical drama examples. I've had a lot of success with that on my https://www.graze.social/ custom feeds.

Ruby23m20

I'm curious for examples, feel free to DM if you don't want to draw further attention to them

johnswentworth's Shortform

johnswentworth

Ω 55y

Cleo Nardo1h20

Flirting is not fundamentally about causing someone to be attracted to you.

Notwithstanding, I think flirting is substantially (perhaps even fundamentally) about both (i) attraction, and (ii) seduction. Moreover, I think your model is too symmetric between the parties, both in terms of information-symmetry and desire-symmetry across time.

My model of flirting is roughly:

Alice attracts Bob -> Bob tries attracting Alice -> Alice reveals Bob attracts Alice -> Bob tries seducing Alice -> Alice reveals Bob seduces Alice -> Initiation

2johnswentworth7h

Can you give an example of what a "most fun" conversation looked like? What's the context, how did it start, how did the bulk of it go, how did you feel internally throughout, and what can you articulate about what made it so great?

3Knight Lee14h

One silly sci-fi idea is this. You might have a few "trigger pills" which are smaller than a blood cell, and travel through the bloodstream. You can observe them travel through the body using medical imaging techniques (e.g. PET), and they are designed to be very observable. You wait until one of them is at the right location, and send very precise x-rays at it from all directions. The x-ray intensity is 1(distance from pill)2. A mechanism in the trigger pill responds to this ionizing (or heating?), and it anchors to the location using a chemical glue or physical mechanisms (hooks, string, etc.). Once the trigger pill is anchored in place, another drug can be taken which only activates when it contacts the trigger pill. (Which might activate yet another drug, if you really want to amplify the effect of this tiny trigger pill.) This results is a ton of drug activity in that area, without needing invasive surgery. If you want it to become a bigger and more permanent implant, you might make it grow over time, deliberately forming a blood clot. Medical imaging may make sure the trigger pill is in a small expendable blood vessel (you detect the pill moving slower with more twists and turns). It might be designed so that yet another chemical can cover it up or destroy it, in case you need to start over at a new location. It might be radioactive if it's trying to treat cancer. It might be magnetically activated if you want real-time control of drug intensity. Speaking of magnetically activating it, maybe even the anchoring is triggered by a magnetic field rather than x-rays. It won't be aimed as precisely, so you can only have one trigger pill at a time, and may have to wait really long before it travels to the right area (the human body is pretty big compared to any small target).

Kabir Kumar's Shortform

Kabir Kumar

8mo

1Kabir Kumar4h

I'm annoyed by the phrase 'do or do not, there is no try', because I think it's wrong and there very much is a thing called trying and it's important. However, it's a phrase that's so cool and has so much aura, it's hard to disagree with it without sounding at least a little bit like an excuse making loser who doesn't do things and tries to justify it. Perhaps in part, because I feel/fear that I may be that?

MichaelDickens1h42

I think it's a good quote. I will refer to this post from The Sequences: Trying to Try

1Kabir Kumar4h

Btw, I really dont have my mind set on this, if someone finds Tyler Cowen explictly saying he was wrong about something, please link it to me - you dont have to give an explanation to justify it, to prepare for some confirmation biasy 'here's why I was actually right and this isnt it' thing (though, any opinions/thoughts are very welcome), please feel free to just give a link or mention some post/moment.

3Joseph Miller11h

Downvoted. This post feels kinda mean. Tyler Cowen has written a lot and done lots of podcasts - it doesn't seem like anyone has actually checked? What's the base rate for public intellectuals ever admitting they were wrong? Is it fair to single out Tyler Cowen?

A case for courage, when speaking of AI danger

424

So8res

I think more people should say what they actually believe about AI dangers, loudly and often. Even (and perhaps especially) if you work in AI policy.

I’ve been beating this drum for a few years now. I have a whole spiel about how your conversation-partner will react very differently if you share your concerns while feeling ashamed about them versus if you share your concerns while remembering how straightforward and sensible and widely supported the key elements are, because humans are very good at picking up on your social cues. If you act as if it’s shameful to believe AI will kill us all, people are more prone to treat you that way. If you act as if it’s an obvious serious threat, they’re more likely to take it...

(Continue Reading – 1603 more words)

TsviBT2h203

Sharing our technical concerns about these abstract risks isn't enough. We also have to morally stigmatize

I'm with you up until here; this isn't just a technical debate, it's a moral and social and political conflict with high stakes, and good and bad actions.

the specific groups of people imposing these risks on all of us.

To be really nitpicky, I technically agree with this as stated: we should stigmatize groups as such, e.g. "the AGI capabilities research community" is evil.

We need the moral courage to label other people evil when they're doing e

... (read more)

3geoffreymiller10h

Ben -- your subtext here seems to be that only lower-class violent criminals are truly 'evil', whereas very few middle/upper-class white-collar people are truly evil (with a few notable exceptions such as SBF or Voldemort) -- with the implications that the majority of ASI devs can't possibly be evil in the ways I've argued. I think that doesn't fit the psychological and criminological research on the substantial overlap between psychopathy and sociopathy, and between violent and non-violent crime. It also doesn't fit the standard EA point that a lot of 'non-evil' people can get swept up in doing evil collective acts as parts of collectively evil industries, such as slave-trading, factory farming, Big Tobacco, the private prison system, etc. - but that often, the best way to fight such industries is to use moral stigmatization.

7Ben Pace9h

You mis-read me on the first point; I said that (something kind of like) 'lower-class violent criminals' are sometimes dysfunctional and bad people, but I was distinguishing that from someone more hyper competent and self-aware like SBF or Voldemort; I said that only the latter are evil. (For instance, they've hurt orders of magnitude more people.) (I'm genuinely not sure what research you're referring to – I am expect you are 100x as familiar with the literature as I am, and FWIW I'd be happy to get a pointer or two of things to read.[1]) The standard EA point is to use moral stigmatization? Even if that's accurate, I'm afraid I no longer have any trust in EAs to do ethics well. As an example that you will be sympathetic to, lots of them have endorsed working at AI companies over the past decade (but many many other examples have persuaded me of this point). To be clear, I am supportive of moral stigma being associated with working at AI companies. I've shown up to multiple protests outside the companies (and I brought my mum!). If you have any particular actions in mind to encourage me to do (I'm probably not doing as much as I could) I'm interested to hear them. Perhaps you could write a guide to how to act when dealing with people in your social scene who work on building doomsday devices in a way that holds a firm moral line while not being socially self-destructive / not immediately blowing up all of your friendships. I do think more actionable advice would be helpful. 1. ^ I expect it's the case that crime rates correlate with impulsivity, low-IQ, and wealth (negatively). Perhaps you're saying that psychopathy and sociopathy do not correlate with social class? That sounds plausible. (I'm also not sure what you're referring to with the violent part, my guess is that violent crimes do correlate with social class.)

3geoffreymiller10h

Hi Knight, thanks for the thoughtful reply. I'm curious whether you read the longer piece about moral stigmatization that I linked to at EA Forum? It's here, and it addresses several of your points. I have a much more positive view about the effectiveness of moral stigmatization, which I think has been at the heart of almost every successful moral progress movement in history. The anti-slavery movement stigmatized slavery. The anti-vivisection movement stigmatized torturing animals for 'experiments'. The women's rights movement stigmatized misogyny. The gay rights movement stigmatized homophobia. After the world wars, biological and chemical weapons were not just regulated, but morally stigmatized. The anti-landmine campaign stigmatized landmines. Even in the case of nuclear weapons, the anti-nukes peace movement stigmatized the use and spread of nukes, and was important in nuclear non-proliferation, and IMHO played a role in the heroic individual decisions by Arkhipov and others not to use nukes when they could have. Regulation and treaties aimed to reduce the development, spread, and use of Bad Thing X, without moral stigmatization of Bad Thing X, doesn't usually work very well. Formal law and informal social norms must typically reinforce each other. I see no prospect for effective, strongly enforced regulation of ASI development without moral stigmatization of ASI development. This is because, ultimately, 'regulation' relies on the coercive power of the state -- which relies on agents of the state (e.g. police, military, SWAT teams, special ops teams) being willing to enforce regulations even against people with very strong incentives not to comply. And these agents of the state simply won't be willing to use government force against ASI devs violating regulations unless these agents already believe that the regulations are righteous and morally compelling.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

OC ACXLW Meetup: “Secret Ballots & Secret Genes” – Saturday, July 5, 2025 97ᵗʰ weekly meetup

Jul 5thNewport Beach

Michael Michalchik

OC ACXLW Meetup: “Secret Ballots & Secret Genes” – Saturday, July 5, 2025
97ᵗʰ weekly meetup

Event Details

Date: Saturday, July 5 · 2 pm → after 5 pm
Location: 1970 Port Laurent Pl, Newport Beach CA 92660-7117
Host / contact: Michael Michalchik — michaelmichalchik@gmail.com · 949-375-2045
Parking: Free street parking (keep driveways clear).
Food: Light snacks & water provided. Bring something sharable if you feel inspired.

Introduction

Summer rolls on, and this week we’re pairing two Astral Codex Ten deep-dives that question how much “sunlight” is really good—whether in the halls of Congress or inside the genome. One essay argues that opacity can rescue democracy; the other wrestles with why genes still don’t explain as much as they should. Expect lively debate on power, knowledge, and the limits of measurement.

Discussion Topics

Set 1 — Governing in

...

(See More – 517 more words)

Love Island USA Season 7 Episode 20: What Could The Producers Be Thinking

Zvi

(Note: This is NOT being posted on my Substack or Wordpress, but I do want a record of it that is timestamped and accessible for various reasons, so I'm putting this here, but do not in any way feel like you need to read it unless it sounds like fun to you.)

We all deserve some fun. This is some of mine.

I will always be a gamer and a sports fan, and especially a game designer, at heart. The best game out there, the best sport out there is Love Island USA, and I’m getting the joy of experiencing it in real time.

A Brief Ode To One Of Our Beloved Games

Make no mistake. This is a game, a competition, and the prizes are many.

The people agree. It is...

(Continue Reading – 3953 more words)

Zvi2h20

Well, that worked. The question is, how and why? How did they get so many different islanders - at a minimum Ace, Austin, Zak and Chelly, and many others at minimum made major errors - to absolutely lose their minds, strategically speaking in the State Your Business challenge?

Obviously some of that was explicit egging on and demanding, but it was more than that. This looked like terrible design, yet it was instead good design. It turns out, when you combine ‘this is a challenge and you go hard in challenges no matter what’ with ‘this is a lock to give you ... (read more)

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

smallsilo, FAR.AI Communications Team, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, Xander Davies, scasper, Aaron Tucker, Robert Kirk, Adam Gleave

This is a linkpost for https://far.ai/news/defense-in-depth

Leading AI companies are increasingly using "defense-in-depth" strategies to prevent their models from being misused to generate harmful content, such as instructions to generate chemical, biological, radiological or nuclear (CBRN) weapons. The idea is straightforward: layer multiple safety checks so that even if one fails, others should catch the problem. Anthropic employs this approach with Claude 4 Opus through constitutional classifiers, while Google DeepMind and OpenAI have announced similar plans. We tested how well multi-layered defence approaches like these work by constructing and attacking our own layered defence pipeline, in collaboration with researchers from the UK AI Security Institute.

We find that multi-layer defenses can offer significant protection against conventional attacks. We wondered: is this because the defenses are truly robust, or because these attacks simply were not...

(Continue Reading – 1119 more words)