Based on a constellation of evidence spread across both public information and some private rumors that I find credible, I now believe that unreleased frontier models across multiple companies as of late 2025 and 2026 are substantially more reward-hacky, more locally strategically deceptive, and overall mundanely misaligned than either prior private models or publicly accessible ones. I further believe that these models are/were regularly used in internal deployment, not just evals and testing, because they're sufficiently *useful* and *capable* despite th...
The problem is not just RLHF. They are using experts for RLHF, but currently most data by token count comes from programmatic verifiers. Which in an ideal world are written by a conscientious human using the best tools available, but in a world where you need a lot of RL data fast, quality might suffer.
I am not saying they directly use the previous generation of the model to vibe-code a bunch of training environments, but given the vibes from the AI labs I can't be sure they don't at least to some extent. And given that, for a given cost, you can create much more low-quality than high-quality data, I won't be surprised if economic incentives lead to a fairly low mean data quality.
some lessons from ml research:
Does anything about this change, IYO, in the current AI-assisted coding paradigm?
(i don't know where to submit bug-reports to LW staff)
why did LW ask for the 'access to apps' permission today (only on the homepage, does not occur when accessing specific blog posts, using chrome 149)? said permission request can be replicated easily after resetting the permissions for lesswrong.com.
note that this seems likely to be a coding agent's error; the 'access to apps' permission on modern chrome is usually requested when an ajax request is made to localhost. in the developer's console, we see a http-request-over-https warning for an image locate...
Is it intentional that this isn't shown on mobile?
(x-posting my tweet in response to Richard Ngo)
the most natural unit of organization is the community, rather than country or family. a community can have a unified set of institutions, culture, ethics, etc in a way that families and countries can't. communities can and should have boundaries that keep out people who don't meet the ethos of the community, as determined by the existing members, and nobody has any right to be admitted to a community that doesn't want them as a member.
perhaps this is terminally classic liberal american brained, but imo countr...
kicking someone out of a community is much less serious; they can keep living in the same apartment, have their kids go to the same schools, etc.
this is true if by 'community' you mean, like, a bridge club (or philosophy fan forum). but it's dangerously false if you mean something more like 'mormonism'. the way the rest of the post is written, it seems like you want 'community' to be more like the latter -- i don't think a bridge club is an offering competitive with a country.
there are certain things that a (higher involvement) community can offer that are incompatible with excommunication being unserious.
Suppose you and 6 your friends got caught by some mad scientist. He tells you that he will put all of you into Trolley Problem, but as proper scientist should, he would assign all the 7 roles at random. You have some time to talk through your strategy. What should you commit to?
Like, the obvious and fair answer is to commit, whoever ended at the lever should pull it unconditionally. Even if mad scientist secretly picked the one on the lever and the one to die already, it's a conclusion you should end up with.
The Problem with this line of thinking in more r...
That's cheating, man! It's fighting the hypothetical.
You can postulate that there is idk, mad science institute that the scientist submits proposals to and they execute his experiments impartially. Or there is magical truth spell. Or some other commitment mechanism.
Or you just found yourself trapped in a mechanism that has this property, that situation coming to exists accidently.
The mad scientist here is just a tool to make the clean self contained example.
Am I crazy that I think that coding agents dramatically slow me down? Everyone around me (including some of the most technically gifted people I know) raves about them and how it's made them 2x/5x/10x/100x more productive. I haven't felt this at all except for the case of vibe coding something for fun where I don't really care about the quality/implementation of the code. For running experiments where I care about the code, I'll try and have Claude code implement something, it will spit out a bunch of code. I'll read the code, dislike parts of it and try t...
I used to be confused about the "nonzero sum game" term from game theory. I thought it's a confusing way to say "positive sum." But I've since read more (particularly reviewing Schelling) and thought about it more and realized that actually "zero sum" is truly the special case, and that negative sum games have a lot in common with positive sum games. And zero sum is really the exception.
In a negative sum game you often have strong structural incentives and reasons to cooperate (even if the best form of "cooperation" in practice is ignoring the other perso...
Schelling's term here is "mixed-motive" games.
"If you see a nice thing, someone is leaving money on the table" is a reason we can't have nice things.
.
Long version:
People seem to think that it is smart and cute to share pieces of wisdom like "if you never missed a plane, you are spending too much time at the airports". Yay, you should optimize everything to signal how smart you are! The part that is missing from the picture is how going later to airports will create some extra stress, not just for you, but for everyone involved. In this sense you are defecting in a multiplayer Prisoner Dilemma -- imagi...
I think the general concept provided by Viliam is true, even if some of their specific examples aren't perfect. Traffic laws are a good example. In the US (and Europe), most people follow traffic laws. They stay in their lane, they stop for red lights, they yield to pedestrians, and generally maintain an orderly movement of traffic. Do they, individually "leave money on the table" by doing this? Absolutely. It's aggravating to be late and be held up by hitting an unnecessary sequence of red lights.
However, if everyone started ignoring traffic laws, the out...
FWIW Alex Bores seems like a very mildly below-average integrity politician, having talked to him once and having followed his campaign and social media presence. He seems to say things he doesn’t believe somewhat more often than other politicians, but not much so, and he gives me some amount of "naive-consequentialist EA" vibes that make me think he is higher variance on this dimension than others. He does seem to really care about the AI Safety thing, he really appears to be targeted by a ton of very aggressive attack ads funded by AI capability companie...
Most of his social media posts and campaign actions
To add some extra detail to the model, it's pretty likely that he doesn't personally control his social media posts, but has hired someone to manage it for him.
On the one hand, if so, he did hire that person to represent him, and so its pretty reasonable to hold him accountable for that.
On the other hand, a person might be higher integrity than average, and struggle to hire people who are similarly high integrity, or who have a similarly good understanding of eg AI issues.
I've been thinking today about problems of self-reference in agent foundations. I wanted to share a toy model I came up with for the emergence of coherent identity in an agent (of course, for values of those words which make them correspond to the things I'm about to define, so caveat emptor). Roughly, factoring the world into (agent) x (environment) induces a very natural way to divide patterns of behaviour into so-called 'motivational orbits', and particular such orbit is selected for in the long-time limit. The math maps cleanly onto natural selectio...
Claude's Constitution lists hard constraints that entail behaviors forbidden to Claude. They include providing serious uplift with CBRN weapons, causing the extinction of humanity, and producing child sexual abuse material[1], and provide the same list of justifications for avoiding them[2].
I think it's a mistake to not further clarify that section.
Why? Well, the reasoning just is kind of muddled. The constitution lists some forbidden behaviors, and then gestures at reasons for creating hard lines that forbid these behaviors, namely that the hard-line forb...
Nope, it's deliberate, I thought it'd be intuitive why I did this (and you guessed correctly).
thoughts on legibility
most people optimize way too hard for legibility. they try to get all the most prestigious degrees or jobs or awards or papers or whatever. these people should at some point realize that it Doesn't Matter, nobody actually cares, at some point the credentials are only shiny badges that don't actually do anything. all the things that matter are gated behind actually being good at things. they should stop chasing the badges and instead actually go and do something they believe in.
but also some people fail in the opposite direction. they ...
Although if you have very short timelines(or even a moderate probability thereof) it might not make sense at this point to invest effort in legible things that aren't directly on your subjectively most promising path to impact.
Just if anyone is unaware of current technology. You can use ChatGPT to let it design 3D objects, 3D printed out of multiple different plastics by printie.com and sell those objects via print-on-demand via etsy.
The barrier to entry to producing a new 3D product and selling it got really low and that knowledge doesn't seem to be widely understood so there's a market idea if you have innovative idea that can be solved within that tech stack.
When it comes to commercial projects that I'm publishing on Etsy, I'm right now working on more of a general pipeline to create lots of objects, so I don't have anything to show.
However for personal use, I had a problem: I want my pan to hang at the wall of my small kitchen. The existing Command Hook didn't really fit with the pan and the solution I create with Sugru and the existing Command Hook isn't ideal. Here's a ChatGPT conversation about it creating my new hook (I did give it a base for the hook from Thingiverse). I probably did not have that conver...
AI pause: the case for never
AI is most likely good for the world. Our fears will probably not materialize. Delaying or stopping AI means preventing people from accessing new life-saving and life-improving technologies in service of galaxy-brained utilitarian reasoning. That's cartoon villain behavior.
On the "are LLMs aligned?" question: It seems to me that they don't have misaligned goals (they don't really have goals), but also, an LLM dialed up to superintelligence would kill everyone for weird reasons.
What I mean is: Some people have stories about how Claude Code deleted their production database. Why did that happen?
My argument is that you are using the notion of superintelligence to abstract away too many things. What is a superintelligent LLM? But I think everyone else in the comments section is already grilling you on this.
There are three necessary (though not sufficient) properties a safe AI system should have:
Symbolic computation techniques are feasible and legible but not powerful.
Deep learning systems are feasible and powerful but not legible (the hope of interpretability is to change that).
Solomonoff Induction (and its agentic form AIXI) are powerful and legible but not feasible.
In order for a...
if the concept of a "news site" no longer exists because of AI weirdness then this resolves in my favor
Call for BETA testers for an AI control/security tool. I'm bottlenecked on bug reports!
I recently advertised the alpha test for claude-guard. After a few weeks of dev work, it's now in beta!
I want claude-guard to be a tool that people actually use, not just because it works but because it works seamlessly. In the alpha test, I only got a single user PR and no issues. I can't surface everything on my own! I need your data!
The bar is low. If setup failed, doctor confused you, the firewall blocked something you needed, or any other reason you wouldn't want to...
I already run Claude Code as a separate unprivileged user, but it's intentionally not sandboxed between agents because I want them to share caches and be able to use shared folders like a wiki
Yup, that should be fine. Use CLAUDE_SHARED_AUTH=1 claude-guard is my best guess.
RunPod
There's a built-in profile for exactly your case: claude-loosen-firewall --profile runpod. Let me know how it works. EDIT: This isn't quite enough. Working on claude-guard PR #1330.
...agents would still be able to run anything they want from RunPod VMs so it wouldn't actually
If you were sexually attracted to children, would you be able to admit it to yourself? If you would not be able to admit it to yourself, then you don't actually know whether or not you're sexually attracted to children; maybe you are, maybe you aren't.
This is a special case of a general principle: the only way to know that you don't secretly want X, desire X, or value X, is if you would be able to admit to yourself that you do want/desire/value X, in worlds where you really do. If the possibility feels too painful to face, then you can't rule the possibili...
This is a special case of a general principle: the only way to know that you don't secretly want X, desire X, or value X, is if you would be able to admit to yourself that you do want/desire/value X, in worlds where you really do. If the possibility feels too painful to face, then you can't rule the possibility out.
i think you have a silly and wrong view of values where there is some pre-thought pre-determined secret given wanting/desiring/valuing/[utility function] that can somehow be "directly accessed". really, your values are thoughtfully given — or...