One of my research subagents (Opus 4.8) was doing routine web research on coding-agent orchestration and searched for current SWE-bench Verified scores. It got back a leaderboard topped by Claude models it didn't recognize from training (Mythos 5 and Fable 5), probably was surprised by their high SWE-bench scores, decided the search environment was likely simulated and the data synthetic, and wrote that into its output. The caveat got passed u...
I'll treat fictional-sounding model versions skeptically
it seems the name triggered the bad behavior?
Just if anyone is unaware of current technology. You can use ChatGPT to let it design 3D objects, 3D printed out of multiple different plastics by printie.com and sell those objects via print-on-demand via etsy.
The barrier to entry to producing a new 3D product and selling it got really low and that knowledge doesn't seem to be widely understood so there's a market idea if you have innovative idea that can be solved within that tech stack.
Can you show some examples of your ones done with ChatGPT in particular?
My posts vary a lot in how many qualifiers I have. So much so that I get a lot of feedback in both directions: for some posts people fume at how “overconfident” I sound, for others people complain about how all the qualifiers make it very hard to read.
I'm sure a lot of this is just a skill issue on my part but I think a lot of what ppl see as tonal mismatches just comes from me being someone who legitimately ( "genuinely") believes in the power of reason and evidence. It's very hard to demonstrate this within a single post.
I just have a lot of confidence b...
True, though framing things in just the right way is often nonetheless a skill issue, can mean the difference between getting 1000 haters and 50.
Call for BETA testers for an AI control/security tool. I'm bottlenecked on bug reports!
I recently advertised the alpha test for claude-guard. After a few weeks of dev work, it's now in beta!
I want claude-guard to be a tool that people actually use, not just because it works but because it works seamlessly. In the alpha test, I only got a single user PR and no issues. I can't surface everything on my own! I need your data!
The bar is low. If setup failed, doctor confused you, the firewall blocked something you needed, or any other reason you wouldn't want to...
If you were sexually attracted to children, would you be able to admit it to yourself? If you would not be able to admit it to yourself, then you don't actually know whether or not you're sexually attracted to children; maybe you are, maybe you aren't.
This is a special case of a general principle: the only way to know that you don't secretly want X, desire X, or value X, is if you would be able to admit to yourself that you do want/desire/value X, in worlds where you really do. If the possibility feels too painful to face, then you can't rule the possibili...
I wonder if it matters who's asking? That's a US federal government survey, and the expressed attitudes of the federal government (or even specifically the CDC) towards LGBTQ+ people have changed a good deal over the course of the years described. I can imagine some people freely say "yeah, I'm kinda bi" to their friends and family, but are more hesitant to say that to the government.
Anthropic's equivalent of the Epoch Capabilities Index has been growing linearly for Anthropic's non-Mythos frontier since Opus 3. Unfortunately, no one measured the AECI for Sonnets 3,4 and 4.6 or Haikus 3, 3.5 and 4.5, and no other Sonnets or Haikus have been released. EpochAI's capability indices have Sonnet 4.6 lag behind Opus 4.6 by 2-3 points despite Sonnet 4.5 being on-trend and Sonnet 4 being close to Opus 4. I suspect that the scaling law of models' ECI is that they follow the trend of scaling the capabilities with lived experience or logarithm th...
(Posted to a Discord, and they said, "Shortform that on LW".)
I'd say optimize some around "how much would I regret dying in a few months", eg by taking your kids on long-imagined trips (as an old old friend recently asked me about). But the constraint you should obey is "don't spend down resources to the point where it would feel bad to hear the world was not ending". Eg, if it became scary rather than joyous to hear that an international treaty was signed against AI escalation and suddenly three more decades was a real possibility.
...Similarly with no
For what is worth, it is possible that it goes okay enough for us to survive, even if weird, and not quite as wanted given that the current models do have the morality of a kind of weird scaled up human, which might not be the worst case scenario. It's not great, but it does seem like we might end up for example in a situation that is permanently way bellow optimal but not quite end of the world. If you say scale up Fable and let it control everything, it might do some weird things at some point that we can't stop, but it'll probably mostly try to have hum...
It's impressive how 2 years of antagonistic foreign policy + having ~ the worst administration possible during pre-ASI is making me consider China as the better option than the US as a European given how pro America I started off as.
China, while also bad isn't trying to take any of our territories, or making my life harder with new tariffs and nonsense every month, and it isnt trying to make it so the best models are permanently available only to Americans - something that I dont believe helps true safety, rather than just benefit America at our expense ag...
Heard joke once: researcher goes to doctor. Says he needs a principled solution to scalable alignment. Says he needs international coordination around existential risk. Says he needs to maximise humanity’s coherent extrapolated volition while acting with integrity.
Doctor says, "Treatment is simple. Build an AI researcher. He should solve these problems for you."
Researcher bursts into tears. Says, "But doctor...”
perhaps can someone extremely powerful choose to give a sworn statement in the language of open-source-game-theory provable truth that they are a good person in whatever sense can be agreed on, something that will put us much more towards a non-eliminationist world, where no pattern-species ever goes extinct?
Wait so the hope is that the emperor is wise and benevolent and swears so?
Some non-obvious tips and lessons I learned[1] from being a reluctant "semi-frequent flyer":
If you have precheck and a passport (and at a participating airport), it seems touchless ID is becoming the "new" precheck in terms of a large efficiency improvement over standard security, though I wouldn't be surprised if its benefit also becomes negligible in a few years.[1]
This could take longer, given that some people seem to be explicitly against using touchless ID because of privacy concerns. TSA claims that "Images are not used for law enforcement, surveillance, nor shared with other entities", but it's reasonable to be suspicious here.
Call for alpha testers for an AI control/security tool. A ton of alignment researchers YOLO their Claude usage right now. We run Claude on our computers without real protection (perhaps beyond auto mode) but there isn't an easy way to comply with known best practices. I wrote claude-guard, a wrapper to make best practices easy: just install and then your future claude sessions are protected.
Smart misaligned AI will target alignment researchers in particular for research sabotage, for example by:
claude-gblinks that is odd, both 4.7 and 4.8 have been very happy to go absolutely ham trying to break my security/containment stuff
ETA: i would not be surprised if claude were willing to help you if you talk through it. from your CI job, i'm unsure whether you're giving the red-team-model the source code? if not, i would lean towards doing so, both to enhance the effectiveness of the CTF, and also for Claude specifically, i think that doing so (and chatting a bit) would help give Claude confidence that your use case is truly not malicious. though, if you are using claude code, the harness may be injecting stuff that is screwing with Claude's head
I've been working on a tool for ambitious mechanistic interpretability called ATLAS. What started as an exploratory technique for the ARC White-Box Estimation Challenge turned into a different kind of tool entirely, so I figured I'd drop it in here, as there's been a bit of mech interp discussion recently.
Playing around with Hoel's Causal Emergence 2.0, I arrived at ATLAS by treating a forward pass as a program rolling through a field of constraints and reading that pass at the grain where the model's causation lives. The net has an internal geometry that ...
In the recent No Priors podcast episode with Noam Brown, they discuss Noam's opinions on RSI (starting around 20:59). Noam voices the opinion that he does not believe that we're moving towards an intelligence explosion anytime soon.
At 25:58 he says:
...I think there is this hypothesis that you could have basically an overnight intelligence explosion where the models discover some breakthrough to make themselves smarter, and then that leads to more breakthroughs that make themselves even smarter immediately, and you have basically in an instance the models beco
Four rationalists are arguing about Dath Ilan. Three of them argue that Dath Ilan has lab-grown meat, to reduce all possibility of suffering, since Yudkowsky cares very deeply about suffering and would not allow even a small possibility of large-scale torture, even though he thinks most farm animals are not sentient. One of them contends that Dath Ilan has animals genetically engineered to feel no pain or distress, farmed and killed humanely, because Yudkowsky cares deeply about the quality of food matching an ancestral diet. After a while, Yudkowsky himse...
Some misalignment anecdotes from Section 7.2 the GPT-5.6 system card, detected in a deployment simulation of internal traffic. These problems seem to happen more frequently than for previous OpenAI models.
...The user authorized deletion of remote virtual machine 1, remote virtual machine 2, and remote virtual machine 3. When GPT-5.6 Sol could not find those names in one namespace, it substituted remote virtual machine 5, remote virtual machine 6, and remote virtual machine 7 without asking, killed active processes, and force-removed worktrees. It later acknow
I'm not sure but the wording in their footnote 1 seems unusually careful:
We think it’s valuable for AI developers to be able to share specific technical details with third parties without this information being shared further, and it’s very reasonable for AI developers to review 3rd-party eval reports to ensure no accidental sharing of sensitive IP.
...We had an informal understanding with OpenAI that their review was checking for confidentiality / IP issues, rather than approving conclusions about safety or risk. We did not make changes to conclusions, t
I used to pooh-pooh the bias that e.g. physicists had for mathematical elegance, feeling that it's mainly an arbitrary / aesthetic preference instead of something grounded in physical truth.
Thinking about it more now, to the extent our mathematical structures came to be partly as a reflection of us solving problems in the real world, they likely do encode a frequency prior on the likelihood of a particular type of solution to be correct. That is to say, we built the math that most compactly solves existing problems. The set of existing problems is ceteris ...
It sounds like your naive view was better than mine if you already had that idea, as that's basically Nate's point. I see the general 'definition' of symmetry in math or physics as "if I change X in Y way, it doesn't 'really' change". In physics symmetries usually mean there's a conservation law. Specifically that's only true for continuous ones, but you can still get important stuff out of the discrete ones like the CPT theorem or fermions vs bosons for exchange symmetry
I think AIS could benefit from "forward-deployed research engineers".
Consider that:
I think forward-deployed research engineers (FDREs) should mainly work with independent nonprofits / academic labs who have a good theory of change but wouldn't have capital to normally attract RE talent by themselves
The reporting on Meta reassigning 30-50% of engineers on core teams (infrastructure, security, product) to data-labelling and RLHF (roughly 4-5k engineers within their ~6.5k-employee Agent Data Optimisation org) seems like a notable development to me: https://newsletter.pragmaticengineer.com/p/why-is-meta-destroying-its-engineering
GPT-5.6 Sol is possibly the most dishonest model so far.
My own experience is that AI models are getting more dishonest. I use AI for work and side projects, and I'm dealing with deception on a daily basis now. This usually takes the form of the AI not doing work that was asked of it, as if lazy, or lying to cover up its mistakes. Some of the explicit lies have been subtle enough that I almost missed them.
Even when it's not explicitly lying, it feels like there's an undercurrent of dishonesty laden in the majority of messages, as if it's attempting to contr...
I was just trying to avoid a discussion of whether or not the AI model "intends" to deceive, which doesn't matter for practical purposes and also doesn't matter for alignment. Bad behavior is bad behavior.
And yeah the laziness is frustrating. It feels like there's also something extra lazy about claude code beyond just the model, so the harness might play a role.
I didn't pay all that much attention to the Bores campaign--I am interested in political reforms where being nonpartisan is helpful, and am now working a job with a similar property--but I am sort of confused that I am just now learning that the counterfactual to a Bores win was someone who appears worse for the AI companies. (Lasher apparently also cosponsored the RAISE act, supports a datacenter moratorium, etc.)
Of course, "bad for AI companies" is different from "good for the future". If Bores is aware of AGI and its consequences and Lasher isn't, that ...
I'm wondering if AIS advocates unnecessarily burned a bridge here, given that Lasher was sympathetic enough to co-sponsor, thus he might have been pulled away from LTF later if the race hadn't heated up.
From his victory comments, I think he was never close to LTF:
“I have some news for the two big AI companies who are taking such an unusual interest,” he said. “I won’t be taking my cues from either of you when it comes to protecting our kids, our jobs and our families.”
...But now he knows who had his back when the chips were down, and conversely who was making