User Comment Replies

Comp Sci in 2027 (Short story by Eliezer Yudkowsky)

Student: I wish I could find a copy of one of those AIs that will actually expose to you the human-psychology models they learned to predict exactly what humans would say next, instead of telling us only things about ourselves that they predict we're comfortable hearing. I wish I could ask it what the hell people were thinking back then.

TA: You'd delete your copy after two minutes.

Apparently roughly this dynamic has happened in ChatGPT. Exciting*. https://x.com/MParakhin/status/1916533763560911169

Policy for LLM Writing on LessWrong

osmarks1mo10

We probably use a mix of strategies. Certainly people take "delve" and "tapestry" as LLM signals these days.

Policy for LLM Writing on LessWrong

osmarks2mo118

Average humans can't distinguish LLM writing from human writing, presumably through lack of exposure and not trying (https://arxiv.org/abs/2502.12150 shows that it is not an extremely hard problem). We are much more Online than average.

gwern1mo*110

But the caveat there is that this is inherently a backwards-looking result:

We consider GPT-4o (OpenAI, 2024), Claude-3.5-Sonnet (Anthropic, 2024), Grok-2 (xAI, 2024), Gemini-1.5-Pro (Google, 2024), and DeepSeek-V3 (DeepSeek-AI, 2024).

So one way to put it would be that people & classifiers are good at detecting mid-2024-era chatbot prose. Unfortunately, somewhere after then, at least OpenAI and Google apparently began to target the problem of ChatGPTese (possibly for different reasons: Altman's push into consumer companion-bots/personalization/soc... (read more)

2Seth Herd2mo

Interesting! Do you think humans could pick up on word use that well? My perception is that humans mostly cue on structure to detect LLM slop writing, and that is relatively easily changed with prompts (although it's definitely not trivial at this point - but I haven't searched for recipes). I did concede the point, since the research I was thinkingg of didn't use humans who've practiced detecting LLM writing.

How AI Takeover Might Happen in 2 Years

osmarks2moΩ230

Why is it a narrow target? Humans fall into this basin all the time -- loads of human ideologies exist that self-identify as prohuman, but justify atrocities for the sake of the greater good.

AI goals can maybe be broader than human goals or human goals subject to the constraint that lots of people (in an ideology) endorse them at once.

and the best economic models we have of AI R&D automation (e.g. Davidson's model) seem to indicate that it could go either way but that more likely than not we'll get to superintelligence really quickly after full AI R&D automation.

I will look into this. takeoffspeeds.com?

2Daniel Kokotajlo2mo

Yep, takeoffspeeds.com, though actually IMO there are better models now that aren't public and aren't as polished/complete. (By Tom+Davidson, and by my team)

Economic Topology, ASI, and the Separation Equilibrium

osmarks3mo10

Abundance elsewhere: Human-legible resources exist in vastly greater quantities outside Earth (asteroid belt, outer planets, solar energy in space) making competition inefficient

It's harder to get those (starting from Earth) than things on Earth, though.

Intelligence-dependent values: Higher intelligence typically values different resource classes - just as humans value internet memes (thank god for nooscope.osmarks.net), money, and love while bacteria "value" carbon

Satisfying higher-level values has historically required us to do vast amounts of far... (read more)

1mkualquiera3mo

>It's harder to get those (starting from Earth) than things on Earth, though. It's not that much harder, and we can make it harder to extract Earth's resources (or easier to extract non-earth resources). >Satisfying higher-level values has historically required us to do vast amounts of farming and strip-mining and other resource extraction. This is true. However, there are also many organisms that are resilient even to our most brutal forms of farming. We should aim for that level of adaptability ourselves. >It is barely "competition" for an ASI to take human resources. This does not seem plausible for bulk mass-energy. This is true, but energy is only really scarce to humans, and even then their mass-energy requirements are absolutely laughable by comparison to the mass-energy in the rest of the cosmos. Earth is only 0.0003% of the total mass-energy in the solar system, and we only need to be marginally harder to disassemble than the rest of mass-energy to buy time. >Right, but we still need lots of things the ASI also probably wants. This is true, and it is more true at the early stages where ASI technological developments are roughly the same as those of humans. However, as ASI technology advances, it is possible for it to want inherently different things that we can't currently comprehend.

Economic Topology, ASI, and the Separation Equilibrium

osmarks3mo10

ASI utilizing resources humans don't value highly (such as the classic zettaflop-scale hyperwaffles, non-Euclidean eigenvalue lubbywubs, recursive metaquine instantiations, and probability-foam negentropics) One-way value flows: Economic value flowing into ASI systems likely never returns to human markets in recognizable form

If it also values human-legible resources, this seems to posit those flowing to the ASI and never returning, which does not actually seem good for us or the same thing as effective isolation.

1mkualquiera3mo

Valid concern. If ASI valued the same resources as humans with one-way flow, that would indeed create competition, not separation. However, this specific failure mode is unlikely for several reasons: 1. Abundance elsewhere: Human-legible resources exist in vastly greater quantities outside Earth (asteroid belt, outer planets, solar energy in space) making competition inefficient 2. Intelligence-dependent values: Higher intelligence typically values different resource classes - just as humans value internet memes (thank god for nooscope.osmarks.net), money, and love while bacteria "value" carbon 3. Synthesis efficiency: Advanced synthesis or alternative acquisition methods would likely require less energy than competing with humans for existing supplies 4. Negotiated disinterest: Humans have incentives to abandon interest in overlap resources: * ASI demonstrates they have no practical human utility. You really don't need Hyperwaffles for curing cancer. * Cooperation provides greater value than competition. You can just make your planes out of wood composites instead of aluminium. That said, the separation model would break down if: * The ASI faces early-stage resource constraints before developing alternatives * Truly irreplaceable, non-substitutable resources existed only in human domains * The ASI's utility function specifically required consuming human-valued resources So yes you identify a boundary condition for when separation would fail. The model isn't inevitable—it depends on resource utilization patterns that enable non-zero-sum outcomes. I personally believe these issues are unlikely in reality.

How AI Takeover Might Happen in 2 Years

osmarks3mo*Ω230

Sorry, I forgot how notifications worked here.

I agree, but there's a way for it to make sense: if the underlying morals/values/etc. are aggregative and consequentialist.

I agree that this could make an AGI with some kind of slightly prohuman goals act this way. It seems to me that being "slightly prohuman" in that way is an unreasonably narrow target, though.

are you sure it is committed to the relationship being linear like that?

It does not specifically say there is a linear relationship, but I think the posited RSI mechanisms are very sensitive to ... (read more)

2Daniel Kokotajlo3mo

Why is it a narrow target? Humans fall into this basin all the time -- loads of human ideologies exist that self-identify as prohuman, but justify atrocities for the sake of the greater good. As for RSI mechanisms: I disagree, I think the relationship is massively sublinear but nevertheless that RSI will happen, and the best economic models we have of AI R&D automation (e.g. Davidson's model) seem to indicate that it could go either way but that more likely than not we'll get to superintelligence really quickly after full AI R&D automation.

How AI Takeover Might Happen in 2 Years

osmarks3moΩ141

I don't find the takeover part especially plausible. It seems odd for something which cares enough about humans to keep them around like that to also kill the vast majority of us earlier, when there are presumably better ways.

This seems broadly plausible up to there though. One unaddressed thing is that algorithmic progress might be significantly bottlenecked on compute to run experiments, such that adding more researchers roughly as smart as humans doesn't lead to corresponding amounts of progress.

ryan_greenblatt3mo135

Keeping the humans alive at this point is extremely cheap in terms of fraction of long term resource consumption while avoiding killing humans might substantially reduce the AI's chance of successful takeover.

Daniel Kokotajlo3moΩ7192

I agree, but there's a way for it to make sense: if the underlying morals/values/etc. are aggregative and consequentialist. Pretty much anything can be justified for the sake of pretty much any distant-future Greater Good; if the misaligned AI e.g. wants humans to live, but thinks that the transhuman future they'd build on their own is slightly worse than the 'managed utopia' it could build if it were in charge, and it multiplies the numbers, it can easily find that killing most people and then having billions of years of managed utopia is better overall t... (read more)

The Gentle Romance

osmarks3mo30

https://gwern.net/idea#deep-learning has a sketch of it.

The Gentle Romance

osmarks3mo90

I am reminded of Scott's "whispering earring" story (https://www.reddit.com/r/rational/comments/e71a6s/the_whispering_earring_by_scott_alexander_there/). But I'm not sure whether that's actually bad in general rather than specifically because the earring is maybe misaligned.

How to prevent collusion when using untrusted models to monitor each other

osmarks5mo30

I don't expect them to have human-legible CoT forever. o1/o3 already seem to veer into strangeness sometimes.

How to prevent collusion when using untrusted models to monitor each other

osmarks5mo110

I worry that "reasoning"-type systems like o1 will be better able to collude acausally. I do not have the budget to run millions of tokens of evaluations (https://github.com/openai/evals/blob/main/evals/elsuite/schelling_point/README.md) against o1-preview on a whim, so I ran some lazy manual tests with this prompt (and with a different wordlist but the rest of it kept the same):

I am going to give you some words. Other instances of yourself will be given the same text but shuffled into a different order. Your copies will also see this same set of instruct

... (read more)

ryan_greenblatt5mo1010

If the collusion is reasoned about in CoT, it should be relatively easy to catch and prevent this at deployment time.

(Studying the collusion abilities in CoT of reasoning type systems still seems interesting.)

What happens if you present 500 people with an argument that AI is risky?

osmarks8mo30

There was some work I read about here years ago (https://www.lesswrong.com/posts/Zvu6ZP47dMLHXMiG3/optimized-propaganda-with-bayesian-networks-comment-on) on causal graph models of beliefs. Perhaps you could try something like that.

Death with Awesomeness

osmarks11mo10

I think we also need to teach AI researchers UI and graphics design. Most of the field's software prints boring things to console, or at most has a slow and annoying web dashboard with a few graphs. The machine which kills us all should instead have a cool scifi interface with nice tabulation, colors, rectangles, ominous targeting reticles, and cryptic text in the corners.

Refusal in LLMs is mediated by a single direction

osmarks1y10

I think the correct solution to models powerful enough to materially help with, say, bioweapon design, is to not train them, or failing that to destroy them as soon as you find they can do that, not to release them publicly with some mitigations and hope nobody works out a clever jailbreak.

cyberpunk raccoons

osmarks2y10

As you say, you probably don't need it, but for output I'm pretty sure electromyography technology is fairly mature.

Is "Recursive Self-Improvement" Relevant in the Deep Learning Paradigm?

osmarks2y42

A misaligned model might not want to do that, though, since it would be difficult for it to ensure that the output of the new training process is aligned to its goals.

LESSWRONG
LW

All of osmarks's Comments + Replies