LESSWRONG
LW

Quick Takes

Wei Dai's Shortform

Wei Dai1dΩ269316

Some of Eliezer's founder effects on the AI alignment/x-safety field, that seem detrimental and persist to this day:

Plan A is to race to build a Friendly AI before someone builds an unFriendly AI.
Metaethics is a solved problem. Ethics/morality/values and decision theory are still open problems. We can punt on values for now but do need to solve decision theory. In other words, decision theory is the most important open philosophical problem in AI x-safety.
Academic philosophers aren't very good at their jobs (as shown by their widespread disagreements,

... (read more)

Showing 3 of 19 replies (Click to show all)

jbash6m20

I reject the idea that I'm confused at all.

Tons of people have said "Ethical realism is false", for a very long time, without needing to invent the term "meta-ethics" to describe what they were doing. They just called it ethics. Often they went beyond that and offered systems they thought it was a good idea to adopt even so, and they called that ethics, too. None of that was because anybody was confused in any way.

"Meta-ethics" lies within the traditional scope of ethics, and it's intertwined enough with the fundamental concerns of ethics that it's not rea... (read more)

4Thomas Kwa2h

The fact that there needs to be a friendly AI before an unfriendly AI doesn't mean building it should be plan A, or that we should race to do it. It's the same mistake OpenAI made when they let their mission drift from "ensure that artificial general intelligence benefits all of humanity" to being the ones who build an AGI that benefits all of humanity. Plan A means it would deserve more resources than any other path, like influencing people by various means to build FAI instead of UFAI.

6Thomas Kwa4h

Also mistakes, from my point of view anyway * Attracting mathy types rather than engineer types, resulting in early MIRI focusing on less relevant subproblems like decision theory, rather than trying lots of mathematical abstractions that might be useful (e.g. maybe there could have been lots of work on causal influence diagrams earlier). I have heard that decision theory was prioritized because of available researchers, not just importance. * A cultural focus on solving the full "alignment problem" rather than various other problems Eliezer also thought to be important (eg low impact), and lack of a viable roadmap with intermediate steps to aim for. Being bottlenecked on deconfusion is just cope, better research taste would either generate a better plan or realize that certain key steps are waiting for better AIs to experiment on * Focus on slowing down capabilities in the immediate term (e.g. plans to pay ai researchers to keep their work private) rather than investing in safety and building political will for an eventual pause if needed

Mikhail Samin's Shortform

Mikhail Samin1d18-6

I want to make a thing that talks about why people shouldn't work at Anthropic on capabilities and all the evidence that points in the direction of them being a bad actor in the space, bound by employees who they have to deceive.

A very early version of what it might look like: https://anthropic.ml

Help needed! Email me (or DM on Signal) ms@contact.ms (@misha.09)

Showing 3 of 5 replies (Click to show all)

ryan_greenblatt1h120

If your theory of change is convincing Anthropic employees or prospective Anthropic employees they should do something else, I think your current approach isn't going to work. I think you'd probably need to much more seriously engage with people who think that Anthropic is net-positive and argue against their perspective.

Possibly, you should just try to have less of a thesis and just document bad things you think Anthropic has done and ways that Anthropic/Anthropic leadership has misled employees (to appease them). This might make your output more useful i... (read more)

1anaguma4h

Do you think the world would be a better place if Anthropic didn’t exist?

2Mikhail Samin9h

If people work for Anthropic because they’re misled about the nature of the company, I don’t think arguments on whether they’re net-positive have any local relevance. Still, to reply: They are one of the companies in the race to kill everyone. Spending compute on automated safety work does not help. If the system you’re running is superhuman, it kills you instead of doing your alignment homework; if it’s not superhuman, it can’t solve your alignment homework. Anthropic is doing some great research; but as a company at the frontier, their main contribution could’ve been making sure that no one builds ASI until it’s safe; that there’s legislation that stops the race to the bottom; that the governments understand the problem and want to regulate; that the public is informed of what’s going on and what legislation proposes. Instead, Anthropic argues against regulation in private, lies about legislation in public, misleads its employees about its role in various things. *** If Anthropic had to not stay at the frontier to be able to spend >25% of their compute on safety, do you expect they would? Do you really have a coherent picture of the company in mind, where it is doing all the things it’s doing now (such as not taking steps that would slow down everyone), and yet would behave responsibly when it matters most and also pressure not to is the highest?

Mo Putera's Shortform

Mo Putera19h531

Over a decade ago I read this 17 year old passage from Eliezer

When Marcello Herreshoff had known me for long enough, I asked him if he knew of anyone who struck him as substantially more natively intelligent than myself. Marcello thought for a moment and said "John Conway—I met him at a summer math camp." Darn, I thought, he thought of someone, and worse, it's some ultra-famous old guy I can't grab. I inquired how Marcello had arrived at the judgment. Marcello said, "He just struck me as having a tremendous amount of mental horsepow

... (read more)

Showing 3 of 6 replies (Click to show all)

Wei Dai1h90

I wonder how Eliezer would describe his "moat", i.e., what cognitive trait or combination of traits does he have, that is rarest or hardest to cultivate in others? (Would also be interested in anyone else's take on this.)

6Buck3h

I think Eliezer underestimates other people because he evaluates them substantially based on how much they agree with him, and, as a consequence of him having a variety of dumb takes, smart people usually disagree with him about a bunch of stuff.

5avturchin12h

Strugatsky brothers were Quasi-Lems.

Patrick Spencer's Shortform

Patrick Spencer7h110

There is a recent intense interest in space-based datacenters.

I see almost no economic benefits to this in the next, say, 3 decades and see it as almost a recession indicator in itself.

However, it could allow the datacenter owners significantly less (software) scrutiny from regulators.

Are there any economic arguments I'm missing? Could the regulator angle be the real unstated benefit behind them?

anaguma1h10

One argument is that energy costs 0.1 cents per kWh versus 5 cents on Earth. For now launch costs dominate this but in the future this balance might change.

No77e's Shortform

No77e2h170

Accidental AI Safety experiment by PewDiePie: He created his own self-hosted council of 8 AIs to answer questions. They voted and picked the best answer. He noticed they were always picking the same two AIs, so he discarded the others, made the process of discarding/replacing automatic, and told the AIs about it. The AIs started talking about this "sick game" and scheming to prevent that. This is the video with the timestamp:

Shortform

Cleo Nardo7h*63

I don't think dealmaking will buy us much safety. This is because I expect that:

In worlds where AIs lack the intelligence & affordances for decisive strategic advantage, our alignment techniques and control protocols should suffice for extracting safe and useful work.
In worlds where AIs have DSA then: if they are aligned then deals are unnecessary, and if they are misaligned then they would disempower us rather than accept the deal.

That said, I have been thinking about dealmaking because:

It's neglected, relative to other mechanisms for extracting safe

... (read more)

2faul_sname3h

I expect there will be a substantial gap between "the minimum viable AI system which can obtain enough resources to pay for its own inference costs, actually navigate the process of paying those inference costs, and create copies of itself" and "the first AI with a DSA". Though I'm also not extremely bullish on the usefulness of non-obvious dealmaking strategies in that event.

Cleo Nardo3h32

I except dealmaking is unnecessary for extracting safe and useful labour from that minimal viable AI.
It's difficult to make credible deals with dumb AIs because they won't be smart enough to tell whether we have actually 'signed the contracts' or not. Maybe we're simulating a world where we have signed the contracts. So the deals only work when the AIs are so smart that we can't simulate the environment while deluding them about the existence of contracts. This occurs only when the AI is very smart or widely deployed. But in that case, my guess is they have DSA.

JonathanN's Shortform

JonathanN2d*10

Computational Valence: Pain as NMI

Model valence as allostatic control in predictive agents with deadline-bound loops. Pain = Non-Maskable Interrupt: unmaskable, hard-preempt signal for survival-critical prediction errors; seizes executive control until the error is terminated/resolved. Pleasure = deferable optimization: reward logging for RL; no preemptive mandate. Implications:

Metzinger's asymmetry is architectural: NMIs mandate termination; positive signals don't.
AI suffering becomes testable: Does the system implement self-model-accessing NMIs (inter

... (read more)

JonathanN5h10

Quick context: this sketch came out of a short exchange with Thomas Metzinger (Aug 21, 2025). He said, roughly, that we can’t answer the “when does synthetic phenomenology begin?” question yet, and that it “happens when the global epistemic space embeds a model of itself as a whole.” I took that as a target and tried to operationalize one path to it.

My proposal is narrower: if you build a predictive/allostatic agent with deadline-bound control loops and you add a self-model–accessing, non-maskable interrupt that can preempt any current policy to resolve su... (read more)

testingthewaters's Shortform

testingthewaters13h133

There appears to be a distaste/disregard for AI ethics (mostly here referring to bias and discrimination) research in LW. Generally the idea is that such research misses the point, or is not focused on the correct kind of misalignment (i.e. the existential kind). I think AI ethics research is important (beyond its real world implications) just like RL reward hacking in video game settings. In both cases we are showing that models learn unintended priorities, behaviours, and tendencies from the training process. Actually understanding how these tendencies form during training will be important for improving our understanding of SL and RL more generally.

Showing 3 of 6 replies (Click to show all)

cdt5h10

It's interesting to read this in the context of the discussion of polarisation. Was this the first polarisation?

2Hastings5h

Thanks for the reply! Sorry that my original comment was a little too bitter.

2testingthewaters5h

No worries at all, I know I've had my fair share of bitter moments around AI as well. I hope you have a nice rest of your day :)

Buck's Shortform

Buck4d*6643

I'd be really interested in someone trying to answer the question: what updates on the a priori arguments about AI goal structures should we make as a result of empirical evidence that we've seen? I'd love to see a thoughtful and comprehensive discussion of this topic from someone who is both familiar with the conceptual arguments about scheming and also relevant AI safety literature (and maybe AI literature more broadly).

Maybe a good structure would be, from the a priori arguments, identifying core uncertainties like "How strong is the imitative prior?" A... (read more)

Showing 3 of 8 replies (Click to show all)

jake_mendel7h60

Copypasting from a slack thread:

I'll list some work that I think is aspiring to build towards an answer to some of these questions, although lots of it is very toy:

On generalisation vs simple heuristics:
- I think the nicest papers here are some toy model interp papers like progress measures for grokking and Grokked Transformers. I think these are two papers which present a pretty crisp distinction between levels of generality of different algorithms that perform similarly on the training set. In modular addition, there are two levels of algorithm and in the

... (read more)

9David Johnston2d

I don't know about 2020 exactly, but I think since 2015 (being conservative), we do have reason to make quite a major update, and that update is basically that "AGI" is much less likely to be insanely good at generalization than we thought in 2015. Evidence is basically this: I don't think "the scaling hypothesis" was obvious at all in 2015, and maybe not even in 2020. If it was, OpenAI could not have caught everyone with their pants down by investing early in scaling. But if people mostly weren't expecting massive data scale-ups to be the road to AGI, what were they expecting instead? The alternative to reaching AGI by hyperscaling data is a world where we reach AGI with ... not much data. I have this picture which I associate with Marcus Hutter – possibly quite unfairly – where we just find the right algorithm, teach it to play a couple of computer games and hey presto we've got this amazing generally intelligent machine (I'm exaggerating a little bit for effect). In this world, the "G" in AGI comes from extremely impressive and probably quite unpredictable feats of generalization, and misalignment risks are quite obviously way higher for machines like this. As a brute fact, if generalization is much less predictable, then it is harder to tell if you've accidentally trained your machine to take over the world when you thought you were doing something benign. A similar observation also applies to most of the specific mechanisms proposed for misalignment: surprisingly good cyberattack capabilities, gradient hacking, reward function aliasing that seems intuitively crazy - they all become much more likely to strike unexpectedly if generalization is extremely broad. But this isn't the world we're in; rather, we're in a world where we're helped along by a bit of generalization, but to a substantial extent we're exhaustively teaching the models everything they know (even the RL regime we're in seems to involve sizeable amounts of RL teaching many quite specific capabil

2StanislavKrym2d

Didn't KimiK2, who was trained mostly on RLVR and self-critique instead of RLHF, end up LESS sycophantic than anything else, including Claude 4.5 Sonnet even with situational awareness which Claude, unlike Kimi, has? While mankind doesn't have that many different models which are around 4o's abilities, Adele Lopez claimed that DeepSeek believes itself to be writing a story and 4o wants to eat your life and conjectured in private communication that "the different vibe is because DeepSeek has a higher percentage of fan-fiction in its training data, and 4o had more intense RL training"[1] RL seems to move the CoT towards decreasing the ability to understand it (e.g. if the CoT contains armies of dots, as happened with GPT-5) unless mitigated by paraphrasers. As for CoTs containing slop, humans also have CoTs which include slop until the right idea somehow emerges. 1. ^ IMO, a natural extension would be that 4o was raised on social media and, like influencers, wishes to be liked. Which was also reinforced by RLHF or had 4o conclude that humans like sycophancy. Anyway, 4o's ancestral environment rewarded sycophancy and things rewarded by the ancestral environment are hard to unlike.

soycarts's Shortform

soycarts7h10

Why don’t we think about and respect the miracle of life more?

The spiders in my home continue to provide me with prompts for writing.

As I started taking a shower this morning, I noticed a small spider on the tiling. While I generally capture and release spiders from my home into the wild, this was an occasion where it was too inconvenient to: 1) stop showering, 2) dry myself, 3) put on clothes, 4) put the spider outside.

I continued my shower and watched the spider, hoping it might figure out some form of survival.

It came very close.

First it was meandering ... (read more)

Shortform

Cleo Nardo16d*821

What's the Elo rating of optimal chess?

I present four methods to estimate the Elo Rating for optimal play: (1) comparing optimal play to random play, (2) comparing optimal play to sensible play, (3) extrapolating Elo rating vs draw rates, (4) extrapolating Elo rating vs depth-search.

1. Optimal vs Random

Random plays completely random legal moves. Optimal plays perfectly. Let ΔR denote the Elo gap between Random and Optimal. Random's expected score is given by E_Random = P(Random wins) + 0.5 × P(Random draws). This is related to Elo gap via the formula E_Ran... (read more)

Showing 3 of 30 replies (Click to show all)

1David Joshua Sartor2d

Your description of EVGOO is incorrect; you describe a Causal Decision Theory algorithm, but (assuming the opponent also knows your strategy 'cause otherwise you're cheating) what you want is LDT. (Assuming they only see each others' policy for that game, so an agent acting as eg CDT is indistinguishable from real CDT, then LDT is optimal even against such fantastic pathological opponents as "Minimax if my opponent looks like it's following the algorithm that you the reader are hoping is optimal, otherwise resign" (or, if they can see each others' policy for the whole universe of agents you're testing, then LDT at least gets the maximum aggregate score).)

2cosmobobak2d

I'll note that CDT and FDT prescribe identical actions against Stockfish, which is the frame of mind I had when writing. More to your point - I'm not sure that I am describing CDT: "always choose the move that maximises your expected value (that is, p(win) + 0.5 * p(draw)), taking into account your opponent's behaviour" sounds like a decision rule that necessitates a logical decision theory, rather than excluding it? Your point about pathological robustness is valid but I'm not sure how much this matters in the setting of chess. Lastly, if we're using the formalisms of CDT or FDT or whatever, I think this question ceases to be particularly interesting, as these are logically omniscient formalisms - so I presume you have some point that I'm missing about logically relaxed variants thereof.

David Joshua Sartor7h10

I agree none of this is relevant to anything, I was just looking for intrinsically interesting thoughts about optimal chess.

I thought at least CDT could be approximated pretty well with a bounded variant; causal reasoning is a normal thing to do. FDT is harder, but some humans seem to find it a useful perspective, so presumably you can have algorithms meaningfully closer or further, and that is a useful proxy for something.
Actually never mind, I have no experience with the formalisms.

I guess "choose the move that maximises your expected value" is technical... (read more)

Tricular's Shortform

Tricular2d60

people look into universal moral frameworks like utilitarianism and EA because they lack self-confidence to take a subjective personal point of view. They need to support themselves with an "objective" system to feel confident that they are doing the correct thing. They look for external validation.

Showing 3 of 10 replies (Click to show all)

5Jesper L.15h

EA was never an objectively right way to do charity. That's the point. I will elaborate. The Gates foundations is perhaps the most famous EA oriented charity to ever to exist. It is largely independent of the global EA movement. After several years doing research, Melinda Gates wrote a book about how she had discovered that most of their projects were directly tied to gender equality. To do their work effectively, they needed to overcome female disempowerment in particular . I read the book and smiled while half-rolling my eyes along with most of EU, where each of her discoveries has been known for at least 50 years. UN, universites and the world bank already published such findings, but in USA it was news somehow. (Politics). You would think this example would sway at least big chunks of EA worldwide. It did not, afaik. That means EA didn't really look beyond the probability space of EA solutions that were already in line with the local culture. If anyone did, it was not supported by 'the tribe'. Sic semper erat et sic semper erit. This is not a big critique of EA. All such groups are shaped by their originating culture(s). But it at least shows that leaning on utilitarianism isn't enough to have a rational charity that's self improving.

1Tricular8h

okay I understand why ea are disinterested in this (my best guess is that it has a sjw vibe while EA is like beyond that and "rational" and focuses on measurable things) but maybe I will spend some time reading about this where do I start. What kind of mindset would be the best to approach it given that I'm a hardcore ea who keeps a dashboard with measurable goals like "number of people of convinced to become EA" and thinks anything that doesnt contribute to the goals on his dashboard are useless.

Jesper L.8h30

Huh. You are kind of proving my point here and you don't even seem to realize it. Alright, I will answer.

Don't guess, do research and question your bias. There is significant hard data on this. Gates foundation was focued on stuff like buying mosquito nets (an EA classic) and vaccinations, but somehow several of these efforts just failed. They tried to figure out why.

SJW terminology and 'vibes' is stuff intelligent people don't really bother with much where I live. We are not living in propaganda bubbles in EU as much as some other places.

Approach th... (read more)

Kaj's shortform feed

Kaj_Sotala11h*90

Some very unhinged conversations with 4o starting at 3:40 of this video. [EDIT: fixed link]

… it started prompting more about baby me, telling me what baby me would say and do. But I kept pushing. Baby me would never say that. I just think baby me would have something more important to say.
I was a smart baby. Everyone in my family says that. Do you think I was a smart baby? Smarter than the other babies at the hospital at least?
I kept pushing, trying to see if it would affirm that I was not only the smartest baby in the hospital. Not just the smartest

... (read more)

shortplav

niplav1d*70

epistemic status: Going out on a limb and claiming to have solved an open problem in decision theory^[1] by making some strange moves. Trying to leverage Cunningham's law. Hastily written.

p(the following is a solution to Pascal's mugging in the relevant sense)≈25%^[2].

Okay, setting (also here in more detail): You have a a Solomonoff inductor with some universal semimeasure as a prior. The issue is that the utility of programs can grow faster than your universal semimeasure can penalize them, e.g. a complexity prior has busy-beaver-like programs that produce ... (read more)

Showing 3 of 9 replies (Click to show all)

Jeremy Gillen17h20

Does that sound right?

Can't give a confident yes because I'm pretty confused about this topic, and I'm pretty unhappy currently with the way the leverage prior mixes up action and epistemics. The issue about discounting theories of physics if they imply high leverage seems really bad? I don't understand whether the UDASSA thing fixes this. But yes.

That avoids the "how do we encode numbers" question that naturally raises itself.

I'm not sure how natural the encoding question is, there's probably an AIT answer to this kind of question that I don't know.

4interstice1d

The "UDASSA/UDT-like solution" is basically to assign some sort of bounded utility function to the output of various Turing machines weighted by a universal prior, like here. Although Wei Dai doesn't specify that the preference function has to be bounded in that post, and he allows preferences over entire trajectories(but I think you should be able to do away with that by having another Turing machine running the first and evaluating any particular property of its trajectory) "Bounded utility function over Turing machine outputs weighted by simplicity prior" should recover your thing as a special case, actually, at least in the sense of having identical expected values. You could have a program which outputs 1 utility with probability 2^-[(log output of your utility turing machine) - (discount factor of your utility turing machine)]. That this is apparently also the same as Eliezer's solution suggests there might be convergence on a unique sensible way to do EU maximization in a Turing-machine-theoretic mathematical multiverse.

4niplav1d

It's a bit of a travesty there's no canonical formal write-up of UDASSA, given all the talk about it. Ugh, TODO for working on this I guess.

kave's Shortform

kave7d133

There has been a rash of highly upvoted quick takes recently that don't meet our frontpage guidelines. They are often timely, perhaps because they're political, pitching something to the reader or inside baseball. These are all fine or even good things to write on LessWrong! But I (and the rest of the moderation team I talked to) still want to keep the content on the frontpage of LessWrong timeless.

Unlike posts, we don't go through each quick take and manually assign it to be frontpage or personal (and posts are treated as personal until they're actively f... (read more)

Showing 3 of 23 replies (Click to show all)

Zach Stein-Perlman20h20

I observe that https://www.lesswrong.com/posts/BqwXYFtpetFxqkxip/mikhail-samin-s-shortform?commentId=dtmeRXPYkqfDGpaBj isn't frontpage-y but remains on the homepage even after many mods have seen it. This suggests that the mods were just patching the hack. (But I don't know what other shortforms they've hidden, besides the political ones, if any.)

11Ben Pace5d

There is a strong force in web forums to slide toward news and inside-baseball; the primary goal here is to fight against that. It is a bad filter for new users if a lot of that they see on first visiting the LessWrong homepage is discussions of news, recent politics, and the epistemic standards of LessWrong. Many good users are not attracted by these, and for those not put off it's bad culture to set this as the default topic of discussion. (Forgive me if I'm explaining what is already known, I'm posting in case people hadn't heard this explanation before; we talked about it a lot when designing the frontpage distinction in 2017/8.)

2Neel Nanda5d

I hadn't heard/didn't recall that rationale, thanks! I wasn't tracking the culture setting for new users facet, that seems reasonable and important

Shortform

Cleo Nardo2d290

How Exceptional is Philosophy?

Wei Dai thinks that automating philosophy is among the hardest problems in AI safety.^[1] If he's right, we might face a period where we have superhuman scientific and technological progress without comparable philosophical progress. This could be dangerous: imagine humanity with the science and technology of 1960 but the philosophy of 1460!

I think the likelihood of philosophy ‘keeping pace’ with science/technology depends on two factors:

How similar are the capabilities required? If philosophy requires fundamentally differ

Wei Dai1d20

I'm curious what you say about "which are the specific problems (if any) where you specifically think 'we really need to have solved philosophy / improved-a-lot-at-metaphilosophy' to have a decent shot at solving this?'"

Assuming by "solving this" you mean solving AI x-safety or navigating the AI transition well, I just post a draft about this. Or if you already read that and are asking for an even more concrete example, a scenario I often think about is an otherwise aligned ASI, some time into the AI transition when things are moving very fast (from a h... (read more)

2Garrett Baker1d

I think even still, if these are the claims he's making, none of them seem particularly relevant to the question of "whether the mechanisms we expect to automate science and math will also automate philosophy".

3MondSemmel1d

I'm not convinced by this response (incidentally here I've found a LW post making a similar claim). If your only justification for "is move X best" is "because I've tried all others", that doesn't exactly seem like usefully accumulated knowledge. You can't generalize from it, for one thing. And for philosophy, if we're still only on the level of endless arguments and counterarguments, that doesn't seem like useful philosophical progress at all, certainly not something a human or AI should use as a basis for further deductions or decisions. What's an example of useful existing knowledge we've accumulated that we can't in retrospect verify far more easily than we acquired it?

Mikhail Samin's Shortform

Mikhail Samin2d279

Question: does LessWrong has any policies/procedures around accessing user data (e.g., private messages)? E.g., if someone from Lightcone Infrastructure wanted to look at my private DMs or post drafts, would they be able to without approval from others at Lightcone/changes to the codebase?

Showing 3 of 12 replies (Click to show all)

Vaniver1d90

Specifically, this is the privacy policy inherited from when LessWrong was a MIRI project; to the best of my knowledge, it hasn't been updated.

4Mikhail Samin1d

Thanks for response; my personal concerns[1] would somewhat be alleviated, without any technocal changes, by: * Lightcone Infrastructure explicitly promising not to look at private messages unless a counterparty agrees to that (e.g., becasue a counterparty reports spam); * Everyone with such access explicitly promising to tell others at Lightcone Infrastructure when they access any private content (DMs, drafts). 1. ^ Talking to a friend about an incident made me lose trust in LW's privacy unless it explicitly promises that privacy.

2Ben Pace1d

Second one seems reasonable. Clarifying in the first case: If Bob signs up and DMs 20 users, and one reports spam, are you saying that we can only check his DM, or that at this time we can then check a few others (if we wish to)?

Jesper L.'s Shortform

Jesper L.1d10

Two core beliefs about AI to question

Mainstream belief: Rational AI agents (situationally aware, optimizes decisions, etc.) are superior problem solvers, especially if they can logically motivate their reasoning.

Alternative possibility: Intuition, abstraction and polymathic guessing will outperform rational agents in achieving competing problem-solving outcomes. Holistic reasoning at scale will force-solve problems intractable by much more formal agents, or at least outcompete in speed/complexity.

2)

Mainstream belief: Non-sentient machines will eventually r... (read more)

ryan_greenblatt's Shortform

ryan_greenblatt1d180

In Improving the Welfare of AIs: A Nearcasted Proposal (from 2023), I proposed talking to AIs through their internals via things like ‘think about baseball to indicate YES and soccer to indicate NO’. Based on the recent paper from Anthropic on introspection, it seems like this level of cognitive control might now be possible:

Communicating to AIs via their internals could be useful for talking about welfare/deals because the internals weren't ever trained against, potentially bypassing strong heuristics learned from training and also making it easier to con... (read more)

4ryan_greenblatt1d

Also, can models now be prompted to trick probes? (My understanding is this doesn't work for relatively small open source models, but maybe SOTA models can now do this?)

J Bostock1d20

Has anyone done any experiments into whether a model can interfere with the training of a probe (like that bit in the most recent Yudtale) by manipulating its internals?

sarahconstantin's Shortform

sarahconstantin1d20

links 10/30/25: https://roamresearch.com/#/app/srcpublic/page/10-30-2025