Is there anyone exploring how AI might be used to increase integrity and build trustworthiness.
For example it could scan the behaviour of people, businesses or AI and see whether it is consistent to stated promises, flagging things that are not.
It might be used to train LLMs to be consistent if they are too be used as agents
From my reader's perspective, Inkhaven was probably bad. No shade to the authors, this level of output is a lot of work and there was plenty I enjoyed. But it shouldn't be a surprise that causing people to write a lot more posts even when they're not inspired leads to a lot more uninspired posts.
A lot of the uninspired posts were still upvoted on LW. I even did some of that upvoting myself, just automatically clicking upvote as I start reading a post with an interesting first paragraph by someone whose name I recognize. Mostly this is fine, but it dilutes ...
It wasn't clear to me from the Inkhaven website that you, Ben Pace, and John Wentworth were participating to that degree (though I did mention you three), and I missed aggliu and RobertM. So fair enough, I'll retract my comment. (ETA: I missed aggliu since I didn't know their name and they had only that one LW post in November, and I thought RobertM might be Rob Miles, but none of RobertM's November LW posts seem to be listed among Rob Miles's posts on the Inkhaven website. But obviously you were there and I was not so I defer to you.)
Anecdotal evidence only. I hope this might be useful for someone, especially that semaglutide is often considered a sort of miracle drug (and for good reasons). TL;DR:
I've been taki...
My wife coaches teen athletes and the signs she's taught to look out for are not weight loss but withdrawal, depression, poor digestion, feeling cold etc. Not to say that you're doing something unhealthy, just that low mood is a known effect of being in an extended calorie deficit and not necessarily an effect of the semaglutide.
As you go further into deficit you tend to downregulate other stuff (eg bone and tissue maintenance) before losing weight faster, so the amount of weight loss only gives you an lower bound on your deficit. It might be worth tr...
Error-correcting codes work by running some algorithm to decode potentially-corrupted data. But what if the algorithm might also have been corrupted? One approach to dealing with this is triple modular redundancy, in which three copies of the algorithm each do the computation and take the majority vote on what the output should be. But this still creates a single point of failure—the part where the majority voting is implemented. Maybe this is fine if the corruption is random, because the voting algorithm can constitute a very small proportion of the total...
But it feels like you'd need to demonstrate this with some construction that's actually adversarially robust, which seems difficult.
I agree it's kind of difficult.
Have you seen Nicholas Carlini's Game of Life series? It starts by building up logical gates up to a microprocessor that factors 15 in to 3 x 5.
Depending on the adversarial robustness model (e.g. every second the adversary can make 1 square behave the opposite of lawfully), it might be possible to make robust logic gates and circuits. In fact the existing circuits are a little robust already -...
The big question about working memory (WM) training is whether it results in transfer -- better performance on tasks other than WM itself. Near transfer is for tasks that are similar but not identical to WM training. Far transfer is for tasks that are quite different from WM training. Typically, studies find that WM training strongly boosts performance on the WM task and near transfer, but results in weak far transfer.
I am curious about whether any gains in far transfer might be masked by test insensitivity, noise, overshadowing by learning effects, or int...
Updates about LLM agency.
The AI 2027 forecast for mid-2025 scores on SWE-bench was not correct:
For example, we think coding agents will move towards functioning like Devin. We forecast that mid-2025 agents will score 85% on SWEBench-Verified.
(From the footnotes here.)
As of December 2025, the SOTA is around 81% for Claude 4.5 Opus, so this threshold probably will not be passed until 2026. Still, it does not seem far off.
Also, GPT-5.1-Codex-Max has a longer task length than I expected (perhaps because it is specifically for coding? But it seems there a...
Thanks for following up! Yeah at some point (perhaps January?) we should do a blog post retrospective enumerating all the forecasts we made in AI 2027 and comparing them to what actually happened. My general sense right now is that progress has been somewhat slower than AI 2027 expected, and even slower than I expected at the time (my median was 2028 at the time) but not dramatically slower. It would be good to quantify this.
Crossposted from Twitter:
This year I’ve been thinking a lot about how the western world got so dysfunctional. Here’s my rough, best-guess story:
1. WW2 gave rise to a strong taboo against ethnonationalism. While perhaps at first this taboo was valuable, over time it also contaminated discussions of race differences, nationalism, and even IQ itself, to the point where even truths that seemed totally obvious to WW2-era people also became taboo. There’s no mechanism for subsequent generations to create common knowledge that certain facts are true but usefully ...
I am not one of the tagged people but I certainly would not so agree. One reason I would not so agree is because I have talked to leftist people (prominence debatable) who celebrated the 10/7 attacks, and when I asked them whether they support Hamas, they were coherently able to answer "no, but I support armed resistance against Israel and don't generally condemn actions that fall in that category, even when I don't approve of or condone the group organizing those actions generally." One way to know what people believe and support is to ask them. (Of cours...
I'm writing a response to https://www.lesswrong.com/posts/FJJ9ff73adnantXiA/alignment-will-happen-by-default-what-s-next and https://www.lesswrong.com/posts/epjuxGnSPof3GnMSL/alignment-remains-a-hard-unsolved-problem where I tried to measure how "sticky" the alignment of current LLMs is. I'm proofreading and editing that now. Spoiler: Models differ wildly in how committed they are to being aligned and alignment-by-default may not be a strong enough attractor to work out.
Would anyone want to proofread this?
The capability evaluations in the Opus 4.5 system card seem worrying. The provided evidence in the system card seem pretty weak (in terms of how much it supports Anthropic's claims). I plan to write more about this in the future; here are some of my more quickly written up thoughts.
[This comment is based on this X/twitter thread I wrote]
I ultimately basically agree with their judgments about the capability thresholds they discuss. (I think the AI is very likely below the relevant AI R&D threshold, the CBRN-4 threshold, and the cyber thresholds.) But, i...
The AIs are obviously fully (or almost fully) automating AI R&D and we're trying to do control evaluations.
REASONS BESIDES JEALOUSY TO NOT BE POLYAMOROUS
Recently Amanda Bethlehem published comparing monogamous jealousy to kidney disease. Eneasz Brodski doubled down on this. I disagree with a lot of their implications, but today I'm going to focus on the implicit claim that jealousy is the only reason to be monogamous. Here is a list of other reasons you might choose monogamy:
Would you also approve other costly signals? Like, I dunno, cutting off a phalanx from a pinky when entering a relationship.
Someone on the EA forum asked why I've updated away from public outreach as a valuable strategy. My response:
I used to not actually believe in heavy-tailed impact. On some gut level I thought that early rationalists (and to a lesser extent EAs) had "gotten lucky" in being way more right than academic consensus about AI progress. I also implicitly believed that e.g. Thiel and Musk and so on kept getting lucky, because I didn't want to picture a world in which they were actually just skillful enough to keep succeeding (due to various psychological blockers)....
“The people you need to soften/moderate your message to reach (or who need social proof in order to get involved) are seldom going to be the ones who can think clearly about this stuff. And we are very bottlenecked on high-quality thinking.”.
I think this is true only in a part of contexts. If we are talking about AI alignment - probably skilled mathematicians or AI researches can be very fit. At least in the directions like interpretability. And this doesn’t necessarily correlate with their desires to do societally unconventional work. Why isn’t that so?
My colleagues and I are finding it difficult to replicate results from several well-received AI safety papers. Last week, I was working with a paper that has over 100 karma on LessWrong and discovered it is mostly false but gives nice-looking statistics only because of a very specific evaluation setup. Some other papers have even worse issues.
I know that this is a well-known problem that exists in other fields as well, but I can’t help but be extremely annoyed. The most frustrating part is that this problem should be solvable. If a junior-level p...
Thank you! I look forward to seeing your proposal!
Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't[1].
I think AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing a better job at this than other companies, I think all of these companies should provide more information about this.
It's particularly striking that Anthropic says nothing about whether they train against CoT given their system card (for 4.5 Sonnet) is very thorough and includes a section on "Reasoning faithfulness" (kudo...
Looks like it isn't specified again in the Opus 4.5 System card despite Anthropic clarifying this for Haiku 4.5 and Sonnet 4.5. Hopefully this is just a mistake...
Can't we lean into the spikes on the jagged frontier? It's clear that specialized models can transform many industries now. Wouldn't it be better for OpenAI to release best-in-class in 10 or so domains (medical, science, coding, engineering, defense, etc.)? Recoup the infra investment, revisit AGI later?
I think evan's post on deep double descent looks really prescient (i.e. I think its now widely accepted that larger models tend to generalize better than smaller models conditioned on achieving the same training loss)
https://www.lesswrong.com/posts/nGqzNC6uNueum2w8T/inductive-biases-stick-around
the implications for scheming risk are a little less clear: reasoning models don't have strong speed priors (and do inherit simplicity priors from the NN), but don't seem to be schemers (perhaps due to output-to-thinking generalization). I don't think we should upda...
LW feature request (low on the importance scale):
It would be nice to be able to refresh the TTS for a post if it has been edited. I was reading this post, and it was a bit confusing to keep track of the audio since it had been edited.