LESSWRONG
LW

All of Tao Lin's Comments + Replies

When is it important that open-weight models aren't released? My thoughts on the benefits and dangers of open-weight models in response to developments in CBRN capabilities.

Tao Lin24dΩ684

This overestimates the impact of large models on external safety research. My impression is that the AI safety community has barely used deepseek r1 and v3 open source weights at all. I checked again and still see little evidence of v3/r1 weights in safety research. People use r1 distill 8b, and qwq 32b, but the decision to open source the most capable small model is different than the decision to open source the frontier. So then it matters when 8b or 32b models can assist with bioterrorism, which happens a bit later, and we get most of the benefits of op... (read more)

2ryan_greenblatt24d

I expect the benefits of open large models on safety research to increase over time as open source tooling improves.

Season Recap of the Village: Agents raise $2,000

Tao Lin1mo180

What are your API costs, and how do they compare to the $ raised?

Shoshannah Tekofsky1mo230

Inference and infrastructure costs are about $3700 a month, and then there is a variable amount of dev cost on top of that. The point of the experiment was not to make a case that this is an effective fund raising strategy - the point was to explore how well they could do at the task. Which, I think, is surprisingly well :)

steve2152's Shortform

Tao Lin2mo10

I can somewhat see where you're coming from about a new method being orders of magnitude more data efficient in RL, but I very strongly bet on transformers being core even after such a paradigm shift. I'm curious whether you think the transformer architecture and text input/output need to go, or whether the new training procedure / architecture fits in with transformers because transformers are just the best information mixing architecture.

2Noosphere892mo

My guess the main issue of current transformers turns out to be the fact that they don't have a long-term state/memory, and I think this is a pretty critical part of how humans are able to learn on the job as effectively as they do. The trouble as I've heard it is the other approaches which incorporate a state/memory for the long-run are apparently much harder to train reasonably well than transformers, plus first-mover effects.

Warty's Shortform

Tao Lin2mo20

Calibration is a super important signal of quality because it means you can actually act on the given probabilities! Even if someone is gaming calibration by betting given ratios on certain outcomes, you can still bet on their predictions and not lose money (often). That is far better than other news sources such as tweets or NYT or whatever. If a calibrated predictor and a random other source are both talking about the same thing, the fact that the predictor is calibrated is enough to make them the #1 source on that topic.

Semen and Semantics: Understanding Porn with Language Embeddings

Tao Lin2mo1019

Incest is not a subcategory of sexual violence, and it's unethical for unrelated reasons. Then again I see the appeal of sexual violence porn but not incest porn, and maybe incest appeals to other people because they conflate it with violence?

3Viliam2mo

Not in theory... but in practice, I think most sexual abuse happens in families.

-20future_detective2mo

How Fast Can Algorithms Advance Capabilities? | Epoch Gradient Update

Tao Lin2mo127

Some compute dependent advancements are easier to extrapolate from small scale than others. For instance, I strongly suspect that small scale experiments + naively extrapolating memory usage is sufficient to discover (and be confident in) GQA. Note that the gpt-4 paper predicted the performance of gpt-4 from 1000x scaled down experiments! The gpt-4 scaling law extrapolation, and similar scaling laws work, is proof that a lot of advances can be extrapolated from much smaller compute scale.

4O O2mo

Yeah but aren’t false positives also a problem here?

Generating the Funniest Joke with RL (according to GPT-4.1)

Tao Lin2mo116

Gpt-4.1 is an expecially soulless model. It's intended for API use only, whereas chatgpt-latest is meant to chat with humans. It's not as bad as o1-mini - that model is extremely autistic and has no concept of emotion. This would work much better with ~pretrained models. Likely you can get gpt-4-base or llama 405b base to do much better with just prompting and no RL.

3cubefox2mo

Or DeepSeek-V3-Base.

MichaelDickens's Shortform

Tao Lin2mo178

Note that any competent capital holder has significant conflict of interest with AI, AI is already a significant fraction of the stock market and a pause would bring down most capital, not just private lab equity

Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study

Tao Lin3mo151

I agree frontier models severely lack spatial reasoning on images, which I attribute to a lack of in-depth spatial discussion of images on the internet. My model of frontier models' vision capabilities is that they have very deep knowledge of aspects of images that relate to text that happens to be immediately before or after it in web text, and only a very small fraction of images on the internet have accompanying in-depth spatial discussion. The models are very good at for instance guessing the location of where photos were taken, vastly better than most humans, because locations are more often mentioned around photos. I expect that if labs want to, they can construct enough semi-synthetic data to fix this.

9Adam Karvonen3mo

I do agree that it looks like there has been a lack of data to address this ability. That being said, I'm pretty surprised at how terrible models are, and there's a hierarchy of problems to be addressed here before models are actually useful in the physical world. Each step feels much more difficult than the step before, and all models are completely terrible at steps 2-4. 1. First, simply look at a part and identify features / if a part is symmetric / etc. This requires basically no spatial reasoning ability, yet almost all models are completely terrible. Even Gemini is very bad. I'm pretty surprised that this ability didn't just fall out of scaling on data, but it does seem like this could be easily addressed with synthetic data. 2. Have some basic spatial reasoning ability where you can propose operations that are practical and aren't physically impossible. This is much more challenging. First, it could be difficult to automatically generate practical solutions. Secondly, it may require moving beyond text chain of thought - when I walk through a setup, I don't use language at all and just visualize everything. 3. Have an understanding of much of the tacit knowledge in machining, or rederive everything from first principles. Getting data could be especially challenging here. 4. Once you can create a single part correctly, now propose multiple different ways to manufacture the part. Evaluate all of the different plans and choose the best combination of cost, simplicity, and speed. This is the part of the job that's actually challenging.

7Rana Dexsin3mo

You mean something like using libraries of 3D models, maybe some kind of generative grammar of how to arrange them, and renderer feedback for what's actually visible to produce (geometry description, realistic rendering) pairs to train visuospatial understanding on?

Show, not tell: GPT-4o is more opinionated in images than in text

Tao Lin3mo20

Yeah they may be the same weights. The above quote does not absolutely imply the same weights generate the text and images IMO, just that it's based on the 4o and sees the whole prompt. OpenAI's audio generation is also 'native', but it's served as a separate model on the API with different release dates, and you can't mix audio and some function calling in chatgpt in a way that's consistent with them not actually being the same weights.

2eggsyntax3mo

Of course we don't know the exact architecture, but although 4o seems to make a separate tool call, that appears to be used only for a safety check ('Is this an unsafe prompt'). That's been demonstrated by showing that content in the chat appears in the images even if it's not mentioned in the apparent prompt (and in fact they can be shaped to be very different). There are some nice examples of that in this twitter thread.

Show, not tell: GPT-4o is more opinionated in images than in text

Tao Lin3mo10

Note that the weights of 'gpt-4o image generation' may not be the same - they may be separate finetuned models! The main 4o chat llm calls a tool start generating an image, which may use the same weights but may just use different weights that have different post training

4cubefox3mo

source

Why do many people who care about AI Safety not clearly endorse PauseAI?

Tao Lin3mo0-1

EU AI Code of Practice is better, a little closer to stopping ai development

4Davidmanheim3mo

Disagree that it could stop dangerous work, and doubly disagree given the way things are headed, especially with removing whistleblower protections and the lack of useful metrics for compliance. I don't think it would even be as good as SB-1047, even in the amended weaker form. I was previously more hopeful that if the EU COP was a strong enough code, then when things inevitably went poorly anyways we could say "look, doing pretty good isn't enough, we need to actually regulate specific parts of this dangerous technology," but I worry that it's not even going to be strong enough to make that argument.

Good Research Takes are Not Sufficient for Good Strategic Takes

Tao Lin3mo10

yeah there's generalization, but I do thing that eg (AGI technical alignment strategy, AGI lab and government strategy, AI welfare, AGI capabilities strategy) are sufficiently different that experts at one will be significantly behind experts on the others

Good Research Takes are Not Sufficient for Good Strategic Takes

Tao Lin3mo70

Also, if you're asking a panel of people, even those skilled at strategic thinking will still be useless unless they've thought deeply about the particular question or adjacent ones. And skilled strategic thinkers can get outdated quickly if they haven't thought seriously about the problem in awhile.

3Dan H3mo

If a strategy is likely to be outdated quickly it's not robust and not a good strategy. Strategies should be able to withstand lots of variation.

4Neel Nanda3mo

I'm not trying to agree with that one. I think that if someone has thought a bunch about the general topic of AI and has a bunch of useful takes. They can probably convert this on the fly to something somewhat useful, even if it's not as reliable as it would be if they spent a long time thinking about it. Like I think I can give useful technical mechanistic interpretability takes even if the question is about topics I've not spent much time thinking about before

Daniel Kokotajlo's Shortform

Tao Lin4mo30

The fact that they have a short lifecycle with only 1 lifetime breeding cycle is though. A lot of intelligent animals, like humans, chimps, elephants, dolphins, orcas, have long lives with many breeding cycles and grandparent roles. Ideally we want an animal that starts breeding in 1 year AND lives for 5+ breeding cycles to be able to learn enough to be useful over its lifetime. It takes so long for humans to learn enough to be useful!

How Much Are LLMs Actually Boosting Real-World Programmer Productivity?

Tao Lin4mo126

Empirically, we likewise don't seem to be living in the world where the whole software industry is suddenly 5-10 times more productive. It'll have been the case for 1-2 years now, and I, at least, have felt approximately zero impact. I don't see 5-10x more useful features in the software I use, or 5-10x more software that's useful to me, or that the software I'm using is suddenly working 5-10x better, etc.

Diminishing returns! Scaling laws! One concrete version of "5x productivity" is "as much productivity as 5 copies of me in parallel", and we know that usually 5x-ing most inputs, like training compute and data, # of employees, etc, more often scales logarithmically instead of linearly

5Aprillion4mo

that's not how productivity ought to be measured - it should measure some output per (say) a workday 1 vs 5 FTE is a difference in input, not output, so you can say "adding 5 people to this project will decrease productivity by 70% next month and we hope it will increase productivity by 2x in the long term" ... not a synonym of "5x productivity" at all it's the measure by which you can quantify diminishig results, not obfuscate them! ...but the usage of "5-10x productivity" seems to point to a diffent concept than a ratio of useful output per input 🤷 AFAICT it's a synonym with "I feel 5-10x better when I write code which I wouldn't enjoy writing otherwise"

Fabien's Shortform

Tao Lin4mo10

I was actually just making some tree search scaffolding, and i had the choice between honestly telling each agent would be terminated if it failed or not. I ended up telling them relatively gently that they would be terminated if they failed. Your results are maybe useful to me lol

Daniel Kokotajlo's Shortform

Tao Lin4mo104

Maybe, you could define it that way. I think R1, which uses ~naive policy gradient, is evidence that long generations are different and much easier than long eposides with environment interaction - GRPO (pretty much naive policy gradient) does no attribution to steps or parts of the trajectory, it just trains on the whole trajectory. Naive policy gradient is known to completely fail at more traditional long horizon tasks like real time video games. R1 is more like brainstorming lots of random stuff that doesn't matter and then selecting the good stuff at the end than taking actions that actually have to be good before the final output

Daniel Kokotajlo's Shortform

Tao Lin4mo54

If by "new thing" you mean reasoning models, that is not long-horizon RL. That's many generation steps with a very small number of environment interaction steps per eposide, whereas I think "long-horizon RL" means lots of environment interaction steps

4Daniel Kokotajlo4mo

I don't think that distinction is important? I think of the reasoning stuff as just long-horizon but with the null environment of only your own outputs.

Catastrophe through Chaos

Tao Lin5mo140

I agree with this so much! Like you I very much expect benefits to be much greater than harms pre superintelligence. If people are following the default algorithm "Deploy all AI which is individually net positive for humanity in the near term" (which is very reasonable from many perspectives), they will deploy TEDAI and not slow down until it's too late.

I expect AI to get better at research slightly sooner than you expect.

MONA: Managed Myopia with Approval Feedback

Tao Lin5mo54

Interested to see evaluations on tasks not selected to be reward-hackable and try to make performance closer to competitive with standard RL

4Rohin Shah5mo

Us too! At the time we started this project, we tried some more realistic settings, but it was really hard to get multi-step RL working on LLMs. (Not MONA, just regular RL.) I expect it's more doable now. For a variety of reasons the core team behind this paper has moved on to other things, so we won't get to it in the near future, but it would be great to see others working on this!

AI Timelines

Tao Lin6mo50

a hypothetical typical example would be it tries to use the file /usr/bin/python because it's memorized that that's the path to python, that fails, then it concludes it must create that folder which would require sudo permissions, if it can it could potentially mess something

AI Timelines

Tao Lin6mo90

not running amock, just not reliably following instructions "only modify files in this folder" or "don't install pip packages". Claude follows instructions correctly, some other models are mode collapsed into a certain way of doing things, eg gpt-4o always thinks it's running python in chatgpt code interpreter and you need very strong prompting to make it behave in a way specific to your computer

5Tao Lin6mo

AI Timelines

Tao Lin6mo70

i've recently done more AI agents running amok and i've found Claude was actually more aligned and did stuff i asked it not to much less than oai models enough that it actaully made a difference lol

2Daniel Kokotajlo6mo

lol what? Can you compile/summarize a list of examples of AI agents running amok in your personal experience? To what extent was it an alignment problem vs. a capabilities problem?

Tao Lin's Shortform

Tao Lin6mo10

i'd guess effort at google/banks to be more leveraged than demos if you're only considering harm from scams and not general ai slowdown and risk

2ryan_greenblatt6mo

I think I probably agree (though uncertain as demos could prompt this effort), but I wasn't just considering reducing harm from scams. I care more about general societal understanding of AI and risks and a demo has positive spill over effects.

Tao Lin's Shortform

Tao Lin6mo21

Working on anti spam/scam features at Google or banks could be a leveraged intervention on some worldviews. As AI advances it will be more difficult for most people to avoid getting scammed, and including really great protections into popular messaging platforms and banks could redistribute a lot of money from AIs to humans

2ryan_greenblatt6mo

Why not think the scams will be run by humans (using AIs) and thus the intervention would reduce the transfer to these groups? In principle, groups could legally eat (some of) the free energy here by just red teaming everyone using a similar approach, but not actually taking their money. Currently, I'm more interested in work demonstrating that AI scams could get really good.

A Three-Layer Model of LLM Psychology

Tao Lin6mo40

Like the post! I'm very interested in how the capabilities of prediction vs character are changing with more recent models. Eg sonnet new may have more of its capabilities tied to its character. And Reasoning models have maybe a fourth layer between ground and character, possibly even completely replacing ground layer in highly distilled models

Jimrandomh's Shortform

Tao Lin6mo40

there is https://shop.nist.gov/ccrz__ProductList?categoryId=a0l3d0000005KqSAAU&cclcl=en_US which fulfils some of this

jimrandomh6mo114

Some of it, but not the main thing. I predict (without having checked) that if you do the analysis (or check an analysis that has already been done), it will have approximately the same amount of contamination from plastics, agricultural additives, etc as the default food supply.

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

Tao Lin6mo70

Wow thank you for replying so fast! I donated $5k just now, mainly because you reminded me that lightcone may not meet goal 1 and that's definitely worth meeting.

About web design, am only slightly persuaded by your response. In the example of Twitter, I don't really buy that there's public evidence that twitter's website work besides user-invisible algorithm changes has had much impact. I only use Following page, don't use spaces, lists, voice, or anything on twitter. Comparing twitter with bluesky/threads/whatever, really looks to me like cultural s... (read more)

4philh6mo

It's hard for me! I had to give up on trying. The problem is that if I read the titles of most posts, I end up wanting to read the contents of a significant minority of posts, too many for me to actually read.

4habryka6mo

I do think you are very likely overfitting heavily on your experience :P As an example, the majority of traffic on LW goes to posts >1 year old, and for those, it sure matters how people discover them, and what UI you have for highlighting which of the ~100k LessWrong posts to read. Things like the Best of LessWrong, Sequences and Codex pages make a big difference in what people read and what gets traffic, as does the concept page. I agree for some of the most engaged people it matters more what the culture and writing tools and other things are, but I think for the majority of LessWrong users, even weighted by activity, recommendation systems and algorithm changes and UI affordances make a big difference.

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

Tao Lin6mo8-8

My main crux about how valuable Lightcone donations are is how impactful great web dev on LessWrong is. If I look around, impact of websites doesn't look strongly correlated with web design, expecially on the very high end. My model is more like platforms / social networks rise or fall by zeitgeist, moderation, big influencers/campaigns (eg elon musk for twitter), web design, in that order. Olli has thought about this much more than me, maybe he's right. I certainly don't believe there's a good argument for LW web dev is responsible for its user metrics. Zeitgeist, moderation, and lightcone people personally posting seems likely more important to me. Lightcone is still great despite my (uninformed) disagreement!

3Said Achmiz6mo

I strongly disagree. In fact, Less Wrong is an excellent example of the effect of web design on impact/popularity/effectiveness (both for better and for worse; mostly better, lately).

5habryka6mo

I think you are probably thinking of "web design" as something too narrow. I think the key attribute of"good web design" is not that it looks particularly beautiful, but that it figures out how to manage high levels of complexity in a way that doesn't confuse people. And of course, a core part of that managing complexity is to make tradeoffs about the relative importance of different user actions, and communicating the consequences of user actions in a way that makes sense with the core incentives and reward loops you want to set up for your site. On Twitter, "web design" choices are things like "do you have Twitter spaces", "what dimensions of freedom do you give users for customizing their algorithm?", "how do you display long-form content on Twitter?". These choices have large effect sizes and make-or-break a platform. On LessWrong, these choices are things like "developing quick takes and figuring out how to integrate it into the site", or "having an annual review", or "having inline reacts" or "designing the post page in a way that causes people to link to them externally". And then the difficulty is not in making things nice, but in figuring out how to display all of these things in ways that doesn't obviously look overwhelming and broken. As a concrete example, I think quick takes have been great for the site, but they only really took off in 2023. This is because we (in this case largely thanks to the EA Forum team) finally figured out how to give them the right level of visibility for the site where it's subdued enough to not make anything you write on shortform feel high stakes, but where the best shortforms can get visibility comparable to the best posts. (I could also go into the relationship between web design and moderation, which is large and where of course how your website is structured will determine what kind of content people write, which will determine the core engine of your website. Moderation without tech changes I think is rarely tha

A breakdown of AI capability levels focused on AI R&D labor acceleration

Tao Lin6mo113

The AI generally feels as smart as a pretty junior engineer (bottom 25% of new Google junior hires)

I expect it to be more smart than that. Plausibly o3 now generally feels as smart as 60th percentile google junior hires

Yonatan Cale's Shortform

Tao Lin7mo51

note: the minecraft agents people use have far greater ability to act than to sense. They have access to commands which place blocks anywhere, and pick up blocks from anywhere, even without being able to see them, eg the llm has access to mine(blocks.wood) command which does not require it to first locate or look at where the wood is currently. If llms played minecrafts using the human interface these misalignments would happen less

1Yonatan Cale7mo

I agree.

evhub's Shortform

Tao Lin8mo10

Building in california is bad for congresspeople! better to build across all 50 states like United Launch Alliance

evhub's Shortform

Tao Lin8mo20

I likely agree that anthropic-><-palantir is good, but i disagree about blocking hte US government out of AI being a viable strategy. It seems to me like many military projects get blocked by inefficient beaurocracy, and it seems plausible to me for some legacy government contractors to get exclusive deals that delay US military ai projects for 2+ years

1Tao Lin8mo

Building in california is bad for congresspeople! better to build across all 50 states like United Launch Alliance

Daniel Kokotajlo's Shortform

Tao Lin8mo10

Why would the defenders allow the tunnels to exist? Demolishing tunnels isnt expensive, if attackers prefer to attack through tunnels there likely isn't enough incentive for defenders to not demolish tunnels

3Daniel Kokotajlo8mo

The expensiveness of demolishing tunnels scales with the density of the tunnel network. (Unless the blast effects of underground explosives are generally stronger than I expect; I haven't done calculations). For sufficiently dense tunnel networks, demolishing enough of them would actually be quite expensive. E.g. if there are 1000 tunnels that you need to demolish per 1km of frontline, the quantity of explosive needed to do that would probably be greater than the quantity you'd need to make a gigantic minefield on the surface. (Minefields can be penetrated... but also, demolished tunnels can be re-dug.)

The hostile telepaths problem

Tao Lin8mo111

I'm often surprised how little people notice, adapt to, or even punish self deception. It's not very hard to detect when someone's deceiving them self, people should notice more and disincentivise that

9Valentine8mo

A few notes: * Sometimes this is obviously true. I agree. * It's a curious question why many folk turn their attention away from someone else's self-deception when it's obvious. Often they don't, but sometimes they do. Why they (we) do that is an interesting question worthy of some sincere curiosity. * Confirmation bias. You don't notice the cases where you don't pick up on someone else's self-deception. Boy oh boy do I disagree. If someone's only option for dealing with a hostile telepath is self-deception, and then you come in and punish them for using it, thou art a dick. Like, do you think it helps the abused mothers I named if you punish them somehow for not acknowledging their partners' abuse? Does it even help the social circle around them? Even if the "hostile telepath" model is wrong or doesn't apply in some cases, people self-deceive for some reason. If you don't dialogue with that reason at all and just create pain and misery for people who use it, you're making some situation you don't understand worse. I agree that getting self-deception out of a culture is a great idea. I want less of it in general. But we don't get there by disincentivizing it.

Ratios8mo1210

This reads to me as, "We need to increase the oppression even more."

Change My Mind: Thirders in "Sleeping Beauty" are Just Doing Epistemology Wrong

Answer by Tao LinOct 16, 202410

I prefer to just think about utility, rather than probabilities. Then you can have 2 different "incentivized sleeping beauty problems"

Each time you are awakened, you bet on the coin toss, with $ payout. You get to spend this money on that day or save it for later or whatever
At the end of the experiment, you are paid money equal to what you would have made betting on your average probability you said when awoken.

In the first case, 1/3 maximizes your money, in the second case 1/2 maximizes it.

To me this implies that in real world analogues to the Sleeping Beauty problem, you need to ask whether your reward is per-awakening or per-world, and answer accordingly

1Radford Neal9mo

That argument just shows that, in the second betting scenario, Beauty should say that her probability of Heads is 1/2. It doesn't show that Beauty's actual internal probability of Heads should be 1/2. She's incentivized to lie. EDIT: Actually, on considering further, Beauty probably should not say that her probability of Heads is 1/2. She should probably use a randomized strategy, picking what she says from some distribution (independently for each wakening). The distribution to use would depend on the details of what the bet/bets is/are.

sarahconstantin's Shortform

Tao Lin9mo75

I disagree a lot! Many things have gotten better! Is sufferage, abolition, democracy, property rights etc not significant? All the random stuff eg better angels of our nature claims has gotten better.

Either things have improved in the past or they haven't, and either people trying to "steer the future" in some sense have been influential on these improvements. I think things have improved, and I think there's definitely not strong evidence that people trying to steer the future was always useless. Because trying to steer the future is very important and mo... (read more)

3sarahconstantin9mo

"Let's abolish slavery," when proposed, would make the world better now as well as later. I'm not against trying to make things better! I'm against doing things that are strongly bad for present-day people to increase the odds of long-run human species survival.

Wei Dai's Shortform

Tao Lin9mo30

Do these options have a chance to default / are the sellers stable enough?

2ESRogs9mo

Default seems unlikely, unless the market moves very quickly, since anyone pursuing this strategy is likely to be very small compared to the market for the S&P 500. (Also consider that these pay out in a scenario where the world gets much richer — in contrast to e.g. Michael Burry's "Big Short" swaps, which paid out in a scenario where the market was way down — so you're just skimming a little off the huge profits that others are making, rather than trying to get them to pay you at the same time they're realizing other losses.)

What are the best arguments for/against AIs being "slightly 'nice'"?

Tao Lin9mo32

A core part of Paul's arguments is that having 1/million of your values towards humans only applies a minute amount of selection pressure against you. It could be that coordinating causes less kindness because without coordination it's more likely some fraction of agents have small vestigial values that never got selected against or intentionally removed

The case for a negative alignment tax

Tao Lin10mo1913

to me "alignment tax" usually only refers to alignment methods that don't cost-effectively increase capabilities, so if 90% of alignment methods did cost effectively increase capabilities but 10% did not, i would still say there was an "alignment tax", just ignore the negatives.

Also, it's important to consider cost-effective capabilities rather than raw capabilities - if a lab knows of a way to increase capabilities more cost-effectively than alignment, using that money for alignment is a positive alignment tax

Judd Rosenblatt10mo100

I think this risks getting into a definitions dispute about what concept the words ‘alignment tax’ should point at. Even if one grants the point about resource allocation being inherently zero-sum, our whole claim here is that some alignment techniques might indeed be the most cost-effective way to improve certain capabilities and that these techniques seem worth pursuing for that very reason.

Proveably Safe Self Driving Cars [Modulo Assumptions]

Tao Lin10mo10

there's steganography, you'd need to limit total bits not accounted for by the gating system or something to remove them

4Davidmanheim9mo

I partly disagree; steganography is only useful when it's possible for the outside / receiving system to detect and interpret the hidden messages, so if the messages are of a type that outside systems would identify, they can and should be detectable by the gating system as well. That said, I'd be very interested in looking at formal guarantees that the outputs are minimally complex in some computationally tractable sense, or something similar - it definitely seems like something that @davidad would want to consider.

Proveably Safe Self Driving Cars [Modulo Assumptions]

Tao Lin10mo10

yes, in some cases a much weaker (because it's constrained to be provable) system can restrict the main ai, but in the case of llm jailbreaks there is no particular hope that such a guard system could work (eg jailbreaks where the llm answers in base64 require the guard to understand base64 and any other code the main ai could use)

2Davidmanheim10mo

I agree that in the most general possible framing, with no restrictions on output, you cannot guard against all possible side-channels. But that's not true for proposals like safeguarded AI, where a proof must accompany the output, and it's not obviously true if the LLM is gated by a system that rejects unintelligible or not-clearly-safe outputs.

In Defense of Open-Minded UDT

Tao Lin10moΩ110

interesting, this actually changed my mind, to the extent i had any beliefs about this already. I can see why you would want to update your prior, but the iterated mugging doesn't seem like the right type of thing that should cause you to update. My intuition is to pay all the single coinflip muggings. For the digit of pi muggings, i want to consider how different this universe would be if the digit of pi was different. Even though both options are subjectively equally likely to me, one would be inconsistent with other observations or less likely or have something wrong with it, so i lean toward never paying it

2abramdemski10mo

Yeah, in hindsight I realize that my iterated mugging scenario only communicates the intuition to people who already have it. The Lizard World example seems more motivating.

The Pragmascope Idea

Tao Lin10mo30

Train two nets, with different architectures (both capable of achieving zero training loss and good performance on the test set), on the same data.
...
Conceptually, this sort of experiment is intended to take all the stuff one network learned, and compare it to all the stuff the other network learned. It wouldn’t yield a full pragmascope, because it wouldn’t say anything about how to factor all the stuff a network learns into individual concepts, but it would give a very well-grounded starting point for translating stuff-in-one-net into stuff-in-another-net

Tao Lin10mo61

yeah, i agree the movie has to be very high quality to work. This is a long shot, although the best rationalist novels are actually high quality which gives me some hope that someone could write a great novel/movie outline that's more targeted at plausible ASI scenarios

Please stop using mediocre AI art in your posts

Tao Lin10mo10

it's sad that open source models like Flux have a lot of potential for customized workflows and finetuning but few people use them

5Raemon10mo

We've talked (a little) about integrating Flux more into LW, to make it easier to make good images. (maybe with a soft-nudge towards using "LessWrong watercolor style" by default if you don't specify something else), Although something habryka brought up is a lot of people's images seem to be coming from substack, which has it's own (bad) version of it.

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Tao Lin10mo41

yeah. One trajectory could be someone in-community-ish writes an extremely good novel about a very realistic ASI scenario with the intention to be adaptable into a movie, it becomes moderately popular, and it's accessible and pointed enough to do most of the guidence for the movie. I don't know exactly who could write this book, there are a few possibilities.

... Wait, our models of semantics should inform fluid mechanics?!?

Tao Lin10mo42

Another way this might fail is if fluid dynamics is too complex/difficult for you to constructively argue that your semantics are useful in fluid dynamics. As an analogy, if you wanted to show that your semantics were useful for proving fermat's last theorem, you would likely fail because you simply didn't apply enough power to the problem, and I think you may fail that way in fluid dynamics.

6Thane Ruthenis10mo

I'd expect that if the natural-abstractions theory gets to the point where it's theoretically applicable to fluid dynamics, then demonstrating said applicability would just be a matter of devoting some amount of raw compute to the task; it wouldn't be bottlenecked on human cognitive resources. You'd be able to do things like setting up a large-scale fluid simulation, pointing the pragmascope at it, and seeing it derive natural abstractions that match the abstractions human scientists and engineers derived for modeling fluids. And in the case of fluids specifically, I expect you wouldn't need that much compute. (Pure mathematical domains might end up a different matter. Roughly speaking, because of the vast gulf of computational complexity between solving some problems approximately (BPP) vs. exactly. "Deriving approximately-correct abstractions for fluids" maps to the former, "deriving exact mathematical abstractions" to the latter.)

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Tao Lin10mo143

Great post!

I'm most optimistic about "feel the ASI" interventions to improve this. I think once people understand the scale and gravity of ASI, they will behave much more sensibly here. The thing I intuitively feel most optimistic (whithout really analyzing it) is movies or generally very high quality mass appeal art.

habryka10mo122

I think better AGI-depiction in movies and novels also seems to me like a pretty good intervention. I do think these kinds of things are very hard to steer on-purpose (I remember some Gwern analysis somewhere on the difficulty of getting someone to create any kind of high-profile media on a topic you care about, maybe in the context of hollywood).