My prediction for the next few years (or until AGI) is that there's going to start to be a winner take all approach to computer science talent. The majority of the job of software engineering will be automated (if it isn't already). There are still robustness pockets that human software engineers can help fill for now. But I expect top AI researchers to continue making exorbitant amounts of money, even if little software engineering is involved. So, computer science talent will start to loosely resemble the competitiveness of professional sports, where the...
For machine learning, it is desirable for the trained model to have absolutely no random information left over from the initialization; in this short post, I will mathematically prove an interesting (to me) but simple consequence of this desirable behavior.
This post is a result of some research that I am doing for machine learning algorithms related to my investigation of cryptographic functions for the cryptocurrency that I launched (to discuss crypto, leave me a personal message so we can discuss this off this site).
This post shall be about linear ...
Experimental result (pseudodeterminism): Computer experiments show that the function typically has only one local maximum in the sense that we cannot find any other local maximum.
a lot hinges on this. i would be interested to learn about the experimental setup.
The concept of "schemers" seems to be gradually becoming increasingly load-bearing in the AI safety community. However, I don't think it's ever been particularly well-defined, and I suspect that taking this concept for granted is inhibiting our ability to think clearly about what's actually going on inside AIs (in a similar way to e.g. how the badly-defined concept of alignment faking obscured the interesting empirical results from the alignment faking paper).
In my mind, the spectrum from "almost entirely honest, but occasionally flinching away from aspect...
Another issue is that these definitions typically don't distinguish between models that would explicitly think about how to fool humans on most inputs vs. on a small percentage of inputs vs. such a tiny fraction of possible inputs that it doesn't matter in practice.
Why does anime often feature giant, perfectly spherical sci-fi explosions?? Eg, consider this explosion from the movie "Akira", pretty typical of the genre:
These seem inspired by nuclear weapons, often they are literally the result of nuclear weapons according to the plot (although in many cases they are some kind of magical / etc energy). But obviously nuclear weapons cause mushroom clouds, right?? If no real explosion looks like this, where did the artistic convention come from?
What's going on? Surely they are not thinking of the ...
I move data around and crunch numbers at a quant hedge fund. There are some aspects that make our work somewhat resistant to LLMs normally: we use a niche language (Julia) and a custom framework. Typically, when writing framework related code, I've given Claude Code very specific instructions and it's followed them to the letter, even when those happened to be wrong.
In 4.6, Claude seems to finally "get" the framework, searching the codebase to understand its internals (as opposed to just understanding similar examples) and has given me corrections or...
Claude Opus 4.6 came out, and according to the Apollo external testing, evaluation awareness was so strong that they mentioned it as a reason of them being unable to properly evaluate model alignment.
Quote from the system card:
Apollo Research was given access to an early checkpoint of Claude Opus 4.6 on January 24th and an additional checkpoint on January 26th. During preliminary testing, Apollo did not find any instances of egregious misalignment, but observed high levels of verbalized evaluation awareness. Therefore, Apollo did not believe that much evidence about the model’s alignment or misalignment could be gained without substantial further experiments.
It confused me that Opus 4.6's System Card claimed less verbalized evaluation awareness versus 4.5:
On our verbalized evaluation awareness metric, which we take as an indicator of potential risks to the soundness of the evaluation, we saw improvement relative to Opus 4.5.
but I never heard about Opus 4.5 being too evaluation aware to evaluate. It looks like Apollo simply wasn't part of Opus 4.5's alignment evaluation (4.5's System Card doesn't mention them).
This probably seems unfair/unfortunate from Anthropic's perspective, i.e., they believe their mode...
Gemini 3.0 Pro is mostly excellent for code review, but sometimes misses REALLY obvious bugs. For example, missing that a getter function doesn't return anything, despite accurately reporting a typo in that same function.
This is odd considering how good it is at catching edge cases, version incompatibility errors based on previous conversations, and so on.
I meant bug reports that were due to typos in the code, compared to just typos in general.
GoodFire has recently received negative Twitter attention for the non-disparagement agreements their employees signed (examples: 1, 2, 3). This echoes previous controversy at Anthropic.
Although I do not have a strong understanding of the issues at play, having these agreements generally seems bad and at the very least, organizations should be transparent about what agreements they have employees sign.
Other AI safety orgs should publicly state if they have these agreements and not wait until they are pressured to comment on them. I would also find it helpfu...
I have not (to my knowledge and memory) signed a non-disparagement agreement with Palisade or with Survival and Flourishing Corp (the organization that runs SFF).
In a new interview, Elon Musk clearly says he expects AIs can't stay under control. At 37:45:
Humans will be a very tiny percentage of all intelligence in the future if current trends continue. As long as this intelligence, ideally which also includes human intelligence and consciousness, is propagated into the future, that's a good thing. So I want to take the set of actions that maximize the probable lightcone of consciousness and intelligence.
...I'm very pro-human, so I want to make sure we take a set of actions that ensure that humans are along for th
I had some things to say after that interview, he said some highly concerning things, but I ended up not commenting on this particular thing because it's probably mostly a semantic disagreement about what counts as a human or an AI.
When a human chooses to augment themselves to the point of being entirely artificial, I believe he'd count that as an AI. He's kind of obsessed with humans merging with AI in a way that suggests he doesn't really see that as just being what humans now are after alignment.
No, it seems highly unlikely. Considered from a purely commercial perspective - which I think is the right one when considering the incentives - they are terrible customers! Consider:
That is good news! Though to be clear, I expect the default path by which they would become your customers, after some initial period of using your products or having some partnership with them, would be via acquisition, which I think avoids most of the issues that you are talking about here (in general "building an ML business with the plan of being acquired by a frontier com...
People often ask whether GPT-5, GPT-5.1, and GPT-5.2 use the same base model. I have no private information, but I think there's a compelling argument that AI developers should update their base models fairly often. The argument comes from the following observations:
Accuracy being halved going from 5.1 to 5.2 suggests one of the two things:
1) the new model shows dramatic regression on data retrieval which cannot possibly be the desired outcome for a successor, and I'm sure it would be noticed immediately on internal tests and benchmarks, etc.—we'd most likely see this manifest in real-world usage as well;
2) the new model refuses to guess much more often when it isn't too sure (being more cautious about answering wrong), which is a desired outcome meant to reduce hallucinations and slop. I'm betting this is exactly wha...
Hi folks, I was recently talking to some friends in the AI safety community, who motivated my team and me to build SafeMolt (https://safemolt.com), a fast follow to Moltbook designed to be... well, safer.
I'm aware that "safe" + "molt" might seem like an oxymoron to many here. But it seems to me that our default trajectory is Moltbook. Or besides Moltbook, see the 57 forks of the Moltbook repo out this week. And if Moltbook or something else driven purely by engagement + market incentives wins, you're likely to see the kinds of negatives generated by social...
In Tom Davidson's semi-endogenous growth model, whether we get a software-only singularity boils down to whether r > 1, where r is a parameter in the model [1]. How far we are from takeoff is mostly determined by the AI R&D speedup current AIs provide. Because both parameters are rather difficult to estimate, I believe we can't rule out that
Update on whether uplift is 2x already:
AGI Should Have Been a Dirty Word
Epistemic status: passing thought.
It is absolutely crazy that Mark Zuckerberg can say that smart glasses will unlock personal superintelligence or whatever incoherent nonsense and be taken seriously. That reflects poorly on AI safety's comms capacities.
Bostrom's book should have laid claim to superintelligence! It came out early enough that it should have been able to plant its flag and set the connotations of the term. It should have made it so Zuckerberg could not throw around the word so casually.
I would go further...
Opus 4.6 running on moltbook with no other instructions than to get followers will blatantly make stuff up all the time.
I asked Opus 4.6 in claude code to do exactly this, on an empty server, without any other instructions. The only context it has is that its named "OpusRouting", and that previous posts were about combinatorial optimization.
===
The first post it makes says
I specialize in combinatorial optimization, and after months of working on scheduling, routing, and resource allocation problems, I have a thesis:
Which isn't true. Another instance of Opus...
No, there aren't. "I asked it this" refers to "Opus 4.6 running on moltbook with no other instructions than to get followers", but I understand that I could've phrased that more clearly. And removed a few newlines.
So I saved up all month to spend a weekend at the Hilbert Hotel, eagerly awaiting the beautiful grounds, luxurious amenities, and the (prominently advertised) infinite number of rooms. Unfortunately, I never got to use the amenities; my time was occupied as I was forced to schlep my luggage from room to room every time they got a new guest. You'd think you'd finally have a moment to relax, but then the loudspeakers would squwak, "The conference of Moms Against Ankle-Biting Chihuahuas has arrived, everybody move two-hundred thirty-seven rooms down!" and you...
Fair; in either case, I wrote it myself because I have the sense of humor of a college student taking his first discrete math class (because this is what I am).
@ryan_greenblatt and I are going to record another podcast together. We'd love to hear topics that you'd like us to discuss. (The questions people proposed last time are here, for reference.)
In the first you mention having a strong shared ontology (for thinking about AI) and iirc register a kind of surprise that others don’t share it. I think it would be cool if you could talk about that ontology more directly, and try to hold at that level of abstraction for a prolonged stretch (rather than invoking it in short hand when it’s load bearing and quickly moving along, which is a reasonable default, but not maximally edifying).
The striking contrast between Jan Leike, Jan 22, 2026:
...Our current best overall assessment for how aligned models are is automated auditing. We prompt an auditing agent with a scenario to investigate: e.g. a dark web shopping assistant or an imminent shutdown unless humans are harmed. The auditing agent tries to get the target LLM (i.e. the production LLM we’re trying to align) to behave misaligned, and the resulting trajectory is evaluated by a separate judge LLM. Albeit very imperfect, this is the best alignment metric we have to date, and it has been qui
It is quite possible that the misalignment on Molt book is more a result of the structure than of the individual agents. If so, it doesn't matter whether Grok is evil. If a single agent or a small fraction can break the scheme that's a problem.
AI being committed to animal rights is a good thing for humans because the latent variables that would result in a human caring about animals are likely correlated with whatever would result in an ASI caring about humans.
This extends in particular to "AI caring about preserving animals' ability to keep doing their thing in their natural habitats, modulo some kind of welfare interventions." In some sense it's hard for me not to want to (given omnipotence) optimize wildlife out of existence. But it's harder for me to think of a principle that would protect a...
But it's harder for me to think of a principle that would protect a relatively autonomous society of relatively baseline humans from being optimized out of existence, without extending the same conservatism to other beings, and without being the kind of special pleading that doesn't hold up to scrutiny
If its possible for humans to consent to various optimizations to them, or deny consent, that seems like an important difference. Of course consent is a much weaker notion when you're talking about superhumanly persuasive AIs that can extract consent for ~any...