Back when I was in school and something came up on a test that required knowledge or skills I didn't have, I would often find the closest thing I did know how to do, and then do that thing even though it's not what the question asked for, in the hopes of getting partial credit.
Looking back, I'm sure that the teachers grading my tests were fully aware of what I was doing, and yet the strategy did work out for me often enough to be worth doing.
Anyway, working with LLM coding agents gives me sympathy for my teachers back then. LLMs are capable of doing a sign...
With all the discussion of age restrictions on social media, I wrote down a rough idea of how we could do age restrictions much more effectively than the current proposals. It's not typical LessWrong content but I'd still love any feedback from the LW community.
https://kaleb.ruscitti.ca/2026/01/18/private-age-verification.html
As I posted as a reply to OP, the EU is specifically working on measures to enable age verification in an convenient, privacy-preserving and accurate manner (open source app that checks the biometric data on your id). More generally, parents being genuinely concerned about their children's safety seems so well established to me that it makes sense to assume they are being genuine in this discussion as well.
As another counter piece of evidence, Jonathan Haidt is probably the best known scientific advocate for social media restrictions for kids, and hi...
Follow-up to this post, wherein I was shocked to find that Claude Code failed to do a low-context task which took me 4 hours and involved some skills I expected it would have significant advantages [1] .
I kept going to see if Claude Code could eventually succeed. What happened instead was that it built a very impressive-looking 4000 LOC system to extract type and dependency injection information for my entire codebase and dump it into a sqlite database.
To my shock this the tool Claude built [2] actually worked. I ended up playing with ...
Plot idea: Isekaied from 2026 into some date in the past. Only goal: get cryogenically preserved into the glorious transhumanist singularity. How to influence the trajectory of history into a direction that would broadly enable this kind of future, while setting up the long-lasting infrastructure & incentives that would cryogenically preserve oneself for centuries to come?
I'm not going to write this, but I think this is a very interesting premise & problem (especially if thousands of years into the past) and would love to see someone build on it.
So...
If more people were egoistic in such a forward-looking way, the world would be better off for it.
Random note: Congressman Brad Sherman just held up If Anyone Builds It, Everyone Dies in a Congressional hearing and recommended it, saying (rough transcript, might be slight paraphrase): "they're [the AI companies] not really focused on the issue raised by this book, which I recommend, but the title tells it all, If Anyone Builds It Everyone Dies"
I think this is a clear and unambiguous example of the theory of change of the book having at least one success -- being an object that can literally be held and pointed to by someone in power.
This is awesome! It broadly aligns with my understanding of the situation, although it does miss some folks that are known to care a bunch about this from their public statements. Downloading the JSON to take a deeper look!
Strong upvoting the underlying post for Doing The Thing.
I agree that a human doesnt have cleanly defined goals and I agree with most of the additional nuances in your comment to the extent that I can understand them, but OP is talking about superintelligence and I think modelling a superintelligence as having a constant-across-time utility function is appropriate.
The dream is that prediction markets greatly outperform individual experts, but there's a limit on how much this can actually happen. The reason prediction markets aren't more useful is that you can only profit from a prediction market if gathering information is cheaper than the money you'd make from gathering it.
Let's imagine I write down either "horse" or "donkey" on a slip of paper, and put that paper in a drawer in my kitchen. I then create a prediction market based on what's written on the paper. The market would sit at around 50%. Maybe people would...
I agree that it's plausible there could be some benefit to creating an AI prediction market.
I mostly haven't taken any of the other AI benchmarks seriously, but I just looked into ForecastBench and surprisingly it seems to me to be worth taking seriously. (The other benchmarks are just like "hey, we promise there aren't similar problems in the LLM's training data! Trust us!") I notice their website suggests ForecastBench is a "proxy for general intelligence", so it seems like I'm not the only one who thinks forecasting and general intelligence might be rel...
Has anybody checked if finetuning LLMs to have inconsistent “behavior” degrades performance? Like you finetuned a model on a bunch of aligned tasks like writing secure code and offering compassionate responses to individuals in distress, but then you tried to specifically make it indifferent to animal welfare? It seems like that would create internal dissonance in the LLM which I would guess causes it to reason less effectively (since the character it’s playing is no longer consistent).
Just said to someone that I would by default read anything they wrote to me in a positive light, and that if they wanted to be mean to me in text, they should put '(mean)' after the statement.
Then realized that, if I had to put '(mean)' after everything I wrote on the internet that I wanted to read as slightly abrupt or confrontational, I would definitely be abrupt and confrontational on the internet less.
I am somewhat more confrontational than I endorse, and having to actually say to myself and the world that I was intending to be harsh, rather than simpl...
Yes exactly; I'm curious about how many more opportunities for greater intentionality/reflective-endorsement might be lurking in other areas, where I just haven't created the right test/handle (but may believe that I've created the right one).
I'm also mindful of the opposite failure mode, though, where attempting to surface something to yourself internally actually causes you to over-index on it, leading to paralysis, where the thing was only present in very small doses and your threshold was poorly calibrated.
ilya's AGI predictions circa 2017 (Musk v. Altman, Dkt. 379-40):
...Within the next three years, robotics should be completely solved, AI should solve a long-standing unproven theorem, programming competitions should be won consistently by Als, and there should be convincing chatbots (though no one should pass the Turing test). In as little as four years, each overnight experiment will feasibly use so much compute compute that there's an actual chance of waking up to AGI, given the right algorithm - and figuring out the algorithm will actually happen within 2-
Zilllis, in email to Musk about OpenAI (Id., Dkt 379-45):
Tech:
-Says Data 5v5 looking better than anticipated.
-The sharp rise in Data bot performance is apparently causing people internally to worry that the timeline to AGI is sooner than they'd thought before.
-Thinks they are on track to beat Montezuma's Revenge shortly.
I'm saying that major deployments might well happen in 2026, not that they already have.
Most obviously, he's already sent a small number of troops into Venezuela for a mission, plus he's literally said that he wants the US to be in charge of Venezuela (his Administration walked it back, but he himself still seems happy with the thought). There are plausible things that could happen that would lead him to send in a significant number of soldiers.
Secondly, he allegedly asked the Joint Chiefs of Staff to draw up an invasion plan for Greenland. Which would cer...
On X/twitter Jerry Wei (Anthropic employee working on misuse/safeguards) wrote something about why Anthropic ended up thinking that training data filtering isn't that useful for CBRN misuse countermeasures:
...An idea that sometimes comes up for preventing AI misuse is filtering pre-training data so that the AI model simply doesn't know much about some key dangerous topic. At Anthropic, where we care a lot about reducing risk of misuse, we looked into this approach for chemical and biological weapons production, but we didn’t think it was the right fit. Here
Training a 6B model for 500B tokens costs about 20k USD. That number increases linearly with model size and # of tokens, and decreases with the amount of money you have. Work like this is super doable, especially at large labs.
The 50% reliability mark on METR is interpreted wrong. A long 50% time horizon is more useful than it seems because a 50% failure rate doesn't mean 50% of the time your output is useful and 50% of the time it's worthless.
For shorter tasks, this maybe true, since fixing a short task takes as much time as just doing it yourself, but for longer tasks, among the 50% of failures, it's more like 30% of the time you need to nudge it a bit, 10% of the time you need to go to another model, final 10% you need to sit down and take your time to debug.
Even the final 10% where I need to debug, sometimes looking at the model's attempt teaches me a trick or two about a tool I didn't know about, or an obscure feature of a tool I did know about.
On the flip side, the 50% "success" rate is more like 30% "you need substantial follow-up in the form of additional instructions or manual edits to get something production ready", 10% "it's not production ready but in a way you can solve with a new workflow/skill step that is always safe or a noop", 10% "it's fine to deploy as-is".
having the right mental narrative and expectation setting when you do something seems extremely important. the exact same object experience can be anywhere from amusing to irritating to deeply traumatic depending on your mental narrative. some examples:
a skill which I respect in other people and which I aspire towards is noticing when other people are experiencing suffering due to violations of positive narratives, or fulfillment of negative narratives, and comforting them and helping nudge them back into a good narrative.
interesting, any examples online?
I tried ChatGPT(-5.2-Thinking) on the original D&D.Sci challenge (which is tough, but not tricky) and it got almost a perfect answer, one point shy of the optimal.
I also tried ChatGPT on the second D&D.Sci challenge (which is tricky, but not tough), and it completely failed (albeit in a sensible and conservative manner). Repeated prompts of "You're missing something, please continue with the challenge" didn't help.
There is also a seventh book planned.
Many many things in the world are explained by society having an unconscious visceral forecast of AGI timelines in the late 2020s. Many actors behaviors make more sense if you assume they have late 2020s timelines.
I disagree with your post, but I will add an additional example: falling birthrates. I don't remember in which of his essays it was (probably in Fanged Noumena), but Nick Land posits that the technocapital system of capitalism which he views as being AGI has figured out that it won't need humans much longer and thus has no incentive to keep the birth rates up. I obviously do not literally believe this, but I think it helps illustrate what you're trying to describe.
Here are some of my claims about AI welfare (and sentience in general):
"Utility functions" are basically internal models of reward that are learned by agents as part of modeling the environment. Reward attaches values to every instrumental thing and action in the world, which may be understood as gradients dU/dx of an abstract utility over each thing---these are various pressures, inclinations, and fears (or "shadow prices") while U is the actual experienced pleasure/pain that must exist in order to justify belief in these gradients.
If the agent alwa