LESSWRONG
LW

All of Chris_Leong's Comments + Replies

This seems to underrate the value of distribution. I suspect another factor to take into account is the degree of audience overlap. Like there's a lot of value in booking a guest who has been on a bunch of podcasts, so long as your particular audience isn't likely to have been exposed to them.

The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization

Chris_Leong2d20

The way I'm using "sensitivity": sensitivity to X = the meaningfulness of X spurs responsive caring action.

I'm fine with that, although it seems important to have a definition for the more limited definition of sensitivity so we can keep track of that distinction: maybe adaptability?

One of the main concerns of the discourse of aligning AI can also be phrased as issues with internalization: specifically, that of internalizing human values. That is, an AI’s use of the word “yesterday” or “love” might only weakly refer to the concepts you mean

Chris_Leong2d20

That's the job of this paper: Substrate-Sensitive AI-risk Management.

That link is broken.

The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization

Chris_Leong2d20

I agree with you that there's a lot of interesting ideas here, but I would like to see the core arguments laid out more clearly.

1Sahil2d

This section here has some you might find useful, among wriing that is published. Excerpted below: These leave out the relevance to risk. That's the job of this paper: Substrate-Sensitive AI-risk Management. Let me know if these in combination lay it out more clearly!

The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization

Chris_Leong2dΩ230

Lots of interesting ideas here, but the connection to alignment still seems a bit vague.

Is misalignment really is a lack of sensitivity as opposed to a difference in goals or values? It seems to me that an unaligned ASI is extremely sensitive to context, just in the service of its own goals.

Then again, maybe you see Live Theory as being more about figuring out what the outer objective should look like (broad principles that are then localised to specific contexts) rather than about figuring out how to ensure an AI internalises specific values. And I can see potential advantages in this kind of indirect approach vs. trying to directly define or learn a universal objective.

1Sahil2d

Risks of abuse, isolation and dependence can skyrocket indeed, from, as you say, increased "context-sensitivity" in service of an AI/someone else's own goals. A personalized torture chamber is not better, but in fact quite likely a lot worse, than a context-free torture chamber. But to your question: The way I'm using "sensitivity": sensitivity to X = the meaningfulness of X spurs responsive caring action. It is unusual for engineers to include "responsiveness" in "sensitivity", but it is definitely included in the ordinary use of the term when, say, describing a person as sensitive. When I google "define sensitivity" the first similar word offered is, in fact, "responsiveness"! So if someone is moved or stirred only by their own goals, I'd say they're demonstrating insensitivity to yours. Semantics aside, and to your point: such caring responsiveness is not established by simply giving existing infrastructural machinery more local information. There are many details here, but you bring up an important specific one: which you wonder if is not the point of Live Theory. In fact, it very much is! To quote: Of course, this is only mentioning the risk, now how to address it. In fact, very little of this post is talking concrete details about the response to threat model. It's the minus-first post, after all. But the next couple of posts start to build up to how it aims to address these worries. In short: there is a continuity between these various notions expressed by "sensitivity" that has not been formally captured. There is perhaps no one single formal definition of "sensitivity" that unifies them, but there might be a usable "live definition" articulable in the (live) epistemic infrastructure of the near future. This infrastructure is what we can supply to our future selves, and it should help our future selves understand and respond to the further future of AI and its alignment. This means being open to some amount of ontological shifts in our basic

Softmax, Emmett Shear's new AI startup focused on "Organic Alignment"

Chris_Leong4d111

This is one of those things that sounds nice on the surface, but where it's important to dive deeper and really probe to see if it holds up.

The real question for me seems to be whether organic alignment will lead to agents deeply adopting co-operative values rather than merely instrumentally adopting them. Well, actually it's a comparative between how deep organic alignment is vs. how deep traditional alignment is. And it's not at all clear to me why they think their approach is likely to lead to a deeper alignment.

I have two (extremely speculative) guesse... (read more)

Third-wave AI safety needs sociopolitical thinking

Chris_Leong5d20

I basically agree with this, but would perhaps avoid virtue ethics, but yes one of the main things I'd generally like to see is more LWers treating stuff like saving the world with the attitude you'd have from being in a job, perhaps at a startup or government bodies like the Senate or House of Representatives in say America, rather than viewing it as your heroic responsibility.

This is the right decision for most folk, but I expect the issue is more the opposite: we don't have enough folks treating this as their heroric responsibility.

Policy for LLM Writing on LessWrong

Chris_Leong8d40

I think both approaches have advantages.

The Field of AI Alignment: A Postmortem, and What To Do About It

Chris_Leong8d42

The problem is that the Swiss cheese model and legislative efforts primarily just buy us time. We still need to be making progress towards a solution and whilst it's good for some folk to bet on us duct-taping our way through, I think we also want some folk attempting to work on things that are more principled.

Policy for LLM Writing on LessWrong

Chris_Leong8d42

Yeah, but how do you know that no one managed to sneak one past both you and the commentators?

Also, there's an art to this.

2Davidmanheim8d

If there are models that are that much better than SOTA models, would they be posting to LW? Seems unlikely - but if so, and they generate good enough content, that seems mostly fine, albeit deeply concerning on the secretly-more-capable-models front.

Collapsible article sections?

Chris_Leong9d20

This seems to exist now.

Policy for LLM Writing on LessWrong

Chris_Leong9d*106

Also, I did not realise that collapsable sections were a thing on Less Wrong. They seem really useful. I would like to see these promoted more.

jimrandomh8d120

They were in a kind of janky half-finished state before (only usable in posts not in comments, only usable from an icon in the toolbar rather than the <details> section); writing this policy reminded us to polish it up.

Policy for LLM Writing on LessWrong

Chris_Leong9d30

I'd love to see occasional experiments where either completely LLM-generated or lightly edited LLM content is submitted to Less Wrong to see how people respond (with this fact being revealed after). It would degrade the site if this happened too often, but I think it would sense for moderators to occasionally grant permission for this.

I tried an experiment with Wittgenstein's Language Games and the Critique of the Natural Abstraction Hypothesis back in March 2023 and it actually received (some) upvotes. I wonder how this would go with modern LLM's, though ... (read more)

7Richard_Kennaway8d

Such an experiment would be better conducted by making a post announcing it at the top and following with chunks of unlabelled human or AI text, like Scott Alexander did for art.

habryka8d164

We get easily like 4-5 LLM-written post submissions a day these days. They are very evidently much worse than the non-LLM written submissions. We sometimes fail to catch one, and then people complain: https://www.lesswrong.com/posts/PHJ5NGKQwmAPEioZB/the-unearned-privilege-we-rarely-discuss-cognitive?commentId=tnFoenHqjGQw28FdY

Recent AI model progress feels mostly like bullshit

Chris_Leong9d20

However, if you merely explain these constraints to the chat models, they'll follow your instructions sporadically.

I wonder if a custom fine-tuned model could get around this. Did you try few shot prompting (ie. examples, not just a description)?

Summary: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al.

Chris_Leong10d20

I've written up an short-form argument for focusing on Wise AI advisors. I'll note that my perspective is different from that taken in the paper. I'm primarily interested in AI as advisors, whilst the authors focus more on AI acting directly in the world.

Wisdom here is an aid to fulfilling your values, not a definition of those values

I agree that this doesn't provide a definition of these values. Wise AI advisors could be helpful for figuring out your values, much like how a wise human would be helpful for this.

2Seth Herd9d

This is great! I'll comment on that short-form. In short, I think that wise (or even wise-ish) advisors are low-hanging fruit that will help any plan succeed, and that creating them is even easier than you suppose.

Boots theory and Sybil Ramkin

Chris_Leong14d20

Other examples include buying poor quality food and then having to pay for medical care, buying a cheap car that costs more in repairs, payday loans, ect.

5Said Achmiz14d

I have seen this sort of thing mentioned, but I don’t think that it works. Let’s set aside for the moment the somewhat tenuous and indirect connection between the food you eat today, and the medical care you will require, some years down the line. (If you end up with heart disease in ten years because you’ve been eating poorly, surely this can’t be any part of the reason why you’re poor today—that would require some sort of anti-temporal causation!) And let’s also set aside this business of “having to pay for medical care”. (Even in the United States—famously the land of medical bills—the poorest people are also the ones who are eligible for Medicaid. You’re more likely to have to pay for medical care if you’re sufficiently well-off to eat well than if you’re very poor!) Let’s instead consider just this notion that there’s a causal connection between being poor and eating poor quality food—and specifically, the sort of food that contributes to poor health outcomes—because you can’t afford healthy food. There was a time when my own family was very poor. (We’d just arrived in the United States as brand-new immigrants, with little more than the proverbial clothes on our backs; my mother had to work two, or sometimes three, jobs just to pay the rent; even my grandfather, then already of retirement age, got a job delivering newspapers.) As I would routinely help my grandmother with grocery shopping, I was very well acquainted with how much it cost to feed a family on a very tight budget, what sorts of purchasing decisions needed to be made, etc. And what I can tell you is that the sort of food we ate was not cheap-but-unhealthy. Rather, it was cheap-but-healthy. What was missing was luxuries and variety—not nutrition! You can, in fact, have a healthy (and even tasty) diet on a tight budget. I have extensive personal experience with this. The difference between my family and the poor people who buy junk food is much more cultural than anything else. Specifically, th

Habermas Machine

Chris_Leong14d20

Unless you insist that this system is helpful for the powered privileges such as king, as a reference of the public opinion, that will be legit?

1Source Wishes12d

If someone insist that this system is contributing for the goal of powered privileges such as king to use its power wisely with reference of the public opinion, this system may work for that framework. So I guess that will be legit, even though I don't appreciate such as framework for maintaining our society.

The “no sandbagging on checkable tasks” hypothesis

Chris_Leong15dΩ120

That would make the domain of checkable tasks rather small.

That said, it may not matter depending on the capability you want to measure.

If you want to make the AI hack a computer to turn the entire screen green and it skips a pixel so as to avoid completing the task, well it would have still demonstrated that it possesses the dangerous capability, so it has no reason to sandbag.

On the other hand, if you are trying to see if it has a capability that you wish it use, it can still sandbag.

I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?

Chris_Leong17d54

I'd strongly recommend spending some time in the Bay area (or London as a second best option). Spending time in these spaces will help you build your model of the space.

You may also find this document I created on AI Safety & Entrepreneurship useful.

Alignment can be the ‘clean energy’ of AI

Chris_Leong21d20

One of the biggest challenges here is that subsidies designed to be support alignment could be snagged by AI companies misrepresenting capabilities works as safety work. Do you think the government has the ability to differentiate between these?

How Can Average People Contribute to AI Safety?

Chris_Leong1mo42

Become a member of LessWrong or the AI Alignment Forum

I think the goal is for the alignment forum to be somewhat selective in terms of who can comment.

(Removed some of my comments b/c I just noticed the clarification that you meant average member of the EA forum/Less Wrong. I would suggest changing the title of your post though).

4Stephen McAleese25d

I agree that the Alignment Forum should be selective, and its members probably represent a small subset of LessWrong readers. That said, useful comments from regular LessWrong users are often promoted to the Alignment Forum. However, I do think there should be more comments on the Alignment Forum because many posts currently receive no comments. This may be discouraging for authors, because they may feel that their work isn't being read or appreciated.

Chris_Leong's Shortform

Chris_Leong1mo*41

For the record, I see the new field of "economics of transformative AI" as overrated.

Economics has some useful frames, but it also tilts people towards being too "normy" on the impacts of AI and it doesn't have a very good track record on advanced AI so far.

I'd much rather see multidisciplinary programs/conferences/research projects, including economics as just one of the perspectives represented, then economics of transformative AI qua economics of transformative AI.

(I'd be more enthusiastic about building economics of transformative AI as a field if we w... (read more)

On the Rationality of Deterring ASI

Chris_Leong1moΩ362

Points for creativity, though I'm still somewhat skeptical about the viability of this strategy,

A Bear Case: My Predictions Regarding AI Progress

Chris_Leong1mo40

My intuition would be that models learn to implement more general templates as well.

A Bear Case: My Predictions Regarding AI Progress

Chris_Leong1mo30

It seems to me that "vibe checks" for how smart a model feels are easily gameable by making it have a better personality.

It's not clear to me that personality is completely separate from capabilities, especially with inference time reasoning.

Also, what do you mean by "bigger templates"?

6Thane Ruthenis1mo

Sorry, I meant "bigger sets of templates". See here:

Alexander Gietelink Oldenziel's Shortform

Chris_Leong1mo20

I wonder about the extent to which having an additional level of selection helps.

High school curricula are generally limited by having to be able to be taught by a large number of teachers all around the country and by needing a minimum number of students at the school who are capable of the content.

If the préparatoires can put more qualified teachers and students together that would allow significant development and running selection for elite universities after such an intermediate preparatory program it would reduce the chance that talented students are... (read more)

Share AI Safety Ideas: Both Crazy and Not

Answer by Chris_LeongMar 03, 202530

Here's a short-form with my Wise AI advisors research direction: https://www.lesswrong.com/posts/SbAofYCgKkaXReDy4/chris_leong-s-shortform?view=postCommentsNew&postId=SbAofYCgKkaXReDy4&commentId=Zcg9idTyY5rKMtYwo

(I already posted this on the Less Wrong post).

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Chris_Leong1moΩ120

I was taking it as "solves" or "gets pretty close to solving". Maybe that's a misinterpretation on my part. What did you mean here?

1Joar Skalse26d

No, that is not a misinterpretation: I do think that this research agenda has the potential to get pretty close to solving outer alignment. More specifically, if it is (practically) possible to solve outer alignment through some form of reward learning, then I think this research agenda will establish how that can be done (and prove that this method works), and if it isn't possible, then I think this research agenda will produce a precise understanding of why that isn't possible (which would in turn help to inform subsequent research). I don't think this research agenda is the only way to solve outer alignment, but I think it is the most promising way to do it.

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Chris_Leong1moΩ120

First of all, it tackles one of the main core difficulties of AI safety in a fairly direct way — namely, the difficulty of how to specify what we want AI systems to do (aka “outer alignment”)

I wouldn't quite go so far as to say it "tackles" the problem of outer alignment, but it does tie into (pragmatic) attempts to solve the problem by identifying the ontology of realistically specifiable reward functions. However, maybe I'm misunderstanding you?

1Joar Skalse1mo

I'm not sure -- what significance are you placing on the word "tackle" in this context? I would also not say that the main value proposition of this research agenda lies in identifying the ontology of the reward function --- the main questions for this area of research may even be mostly orthogonal to that question.

Notes on Superwisdom & Moral RSI

Chris_Leong1mo*30

I suspect that your post probably isn't going to be very legible to the majority folks on Less Wrong, since you're assuming familiarity with meta-modernism. To be honest, I suspect this post would have been more persuasive if you had avoided mentioning it, since the majority of folks here are likely skeptical of it and it hardly seems to be essential for making what seems to be the core point of your post^[1]. Sometimes less is more. Things cut out can always explored in the future, when you have the time to explain them in a way that will be legible to you... (read more)

Chris_Leong's Shortform

Chris_Leong1mo20

I just created a new Discord server for generated AI safety reports (ie. using Deep Research or other tools). Would be excited to see you join (ps. Open AI now provides uses on the plus plan 10 queries per month using Deep Research).

https://discord.gg/bSR2hRhA

Arbital has been imported to LessWrong

Chris_Leong1mo20

Interesting idea. Will be interesting to see if this works out.

Arbital has been imported to LessWrong

Chris_Leong1mo40

Lenses are... tabs. Opinionated tabs

Could you explain the intended use further?

habryka1mo367

The central problem of any wiki system is ^[1]"what edits do you accept to a wiki page?". The lenses system is trying to provide a better answer to that question.

My default experience on e.g. Wikipedia when I am on pages where I am highly familiar with the domain is "man, I could write a much better page". But writing a whole better page is a lot of effort, and the default consequence of rewriting the page is that the editor who wrote the previous page advocates for your edits to be reverted, because they are attached to their version of the page. ... (read more)

Chris_Leong's Shortform

Chris_Leong1moΩ000

Acausal positive interpretation

What is a decision theory as a mathematical object?

Answer by Chris_LeongFeb 12, 202520

My take: Counterfactuals are Confusing because of an Ontological Shift:

"In our naive ontology, when we are faced with a decision, we conceive of ourselves as having free will in the sense of there being multiple choices that we could actually take. These choices are conceived of as actual and we when think about the notion of the "best possible choice" we see ourselves as comparing actual possible ways that the world could be. However, we when start investigating the nature of the universe, we realise that it is essentially deterministic and hence that our... (read more)

Chris_Leong's Shortform

Chris_Leong2mo20

Well, we're going to be training AI anyway. If we're just training capabilities, but not wisdom, I think things are unlikely to go well. More thoughts on this here.

2Seth Herd9d

Hm, I thought this use of "wise" is almost identical to capabilities. It's sort of like capabilities with less slop or confabulation, and probably more ability to take the context of the problem/question into account. Both of those are pretty valuable, although people might not want to bother even swerving capabilities in that direction.

evhub's Shortform

Chris_Leong2moΩ342

I believe that Anthropic should be investigating artificial wisdom:

I've summarised a paper arguing for the importance of artificial wisdom with Yoshua Bengio being one of the authors.

I also have a short-form arguing for training wise AI advisors and an outline Some Preliminary Notes of the Promise of a Wisdom Explosion.

Nonpartisan AI safety

Chris_Leong2mo30

By Wise AI Advisors, I mean training an AI to provide wise advice. BTW, I've now added a link to a short-form post in my original comment where I detail the argument for wise AI advisors further.

Nonpartisan AI safety

Chris_Leong2mo50

Props for proposing a new and potentially fruitful framing.

I would like to propose training Wise AI Advisors as something that could potentially meet your two criteria:

• Even if AI is pretty much positive, wise AI advisors would allow us get closer to maximising these benefits

• We can likely save the world if we make sufficiently wise decisions^[1]

^{^}
There's also a chance that we're past the point of no return, but if that's the case, we're screwed no matter what we do. Okay, it's slightly more complicated because there's a chance that we aren't yet past the

... (read more)

3Yair Halberstadt2mo

Is that a wise AI, which is an advisor, or somebody who advises about AI who is wise?

Chris_Leong's Shortform

Chris_Leong2mo*50

Why the focus on wise AI advisors?^[1]

I'll be writing up a proper post to explain why I've pivoted towards this, but it will still take some time to produce a high quality post, so I decided it was worthwhile releasing a short-form description in the mean time.

By Wise AI Advisors, I mean training an AI to provide wise advice.

a) AI will have a massive impact on society given the infinite ways to deploy such a general technology
b) There are lots of ways this could go well and lots of ways that this could go extremely poorly (election interference, cyber attac... (read more)

4Seth Herd9d

Despite my contention on the associated paper post that focusing on wisdom in this sense is ducking the hard part of the alignment problem, I'll stress here that it Iseems thoroughly useful if it's a supplement not a substitute for work on the hard parts of the problem - technical, theoretical and societal. I also think it's going to be easier to create wise advisors than you think, at least in the weak sense that they make their human users effectively wiser. In short, think simple prompting schemes and eventually agentic scaffolds can do a lot of the extra work it takes to turn knowledge into wisdom, and that there's an incentive for orgs to train for "wisdom" in the sense you mean as well. So we'll get wiser advisors as we go, at little or no extra effort. More effort would of course help more. I believe Deep Research has already made me wiser. I can get a broader context for any given decision. And that was primarily achieved by prompting; the o3 model that powers OpenAI's version does seem to help but Perplexity introducing a nearly-as-good system just a week or two later indicates that just the right set of prompts were extremely valuable. Current systems aren't up to helping very much with the hypercomplex problems surrounding alignment. But they can now help a little. And any improvements will be a push in the right direction. Training specifically for "wisdom" as you define it is a push toward a different type of useful capability, so it may be that frontier labs pursue similar training by default. (As an aside, I think your "comparisons" are all wildly impractical and highly unlikely to be executed before we hit AGI, even on longer realistic estimates. It's weird that they're considered valid points of comparison, as all plans that will never be executed have exactly the same value. But that's where we're at in the project right now.) To return from the tangent, I don't think wise advisors is actually asking anyone to go far out of their d

2Yair Halberstadt2mo

Why do you think wise AI advisors avoid the general problems with other AI?

Chris_Leong's Shortform

Chris_Leong2mo50

Thanks, seems pretty good on a quick skim, I'm a bit less certain on the corrigibility section, also more issues might become apparent if I read through it more slowly.

Chris_Leong's Shortform

Chris_Leong2mo142

How about "Please summarise Eliezer Yudkowsky's views on decision theory and its relevance to the alignment problem".

7Peter Wildeford2mo

Here you go: https://chatgpt.com/share/67a34222-e7d8-800d-9a86-357defc15a1d

8Nathan Helm-Burger2mo

People have said that to get a good prompt it's better to have a discussion with a model like o3-mini, o1, or Claude first, and clarify various details about what you are imagining, then give the whole conversation as a prompt to OA Deep Research.

Chris_Leong's Shortform

Chris_Leong2mo138

Someone should see how good Deep Research is at generating reports on AI Safety content.

6Peter Wildeford2mo

My current working take is that it is at the level of a median-but-dedicated undergraduate of a top university who is interested and enthusiastic about AI safety. But Deep Research can do in 10 minutes what would take that undergraduate about 20 hours. Happy to try a prompt for you and see what you think.

ryan_greenblatt2mo100

It did OK at control.

Planning for Extreme AI Risks

Chris_Leong2moΩ140

Nice article, I especially love the diagrams!

In Human Researcher Obsolescence you note that we can't completely hand over research unless we manage to produce agents that are at least as "wise" as the human developers.

I agree with this, though I would love to see a future version of this plan include an expanded analysis of the role that wise AI plays would play in the strategy of Magma, as I believe that this could be a key aspect of making this plan work.

In particular:

• We likely want to be developing wise AI advisors to advise us during the pre-hand-off... (read more)

1joshc2mo

• It's possible that we might manage to completely automate the more objective components of research without managing to completely automate the more subjective components of research. That said, we likely want to train wise AI advisors to help us with the more subjective components even if we can't defer to them. Agree, I expect the handoff to AI agents to be somewhat incremental (AI is like an intern, a new engineer, a research manager, and eventually, a CRO)

Quinn's Shortform

Chris_Leong2mo31

Fellowships are typically only for a few month and even if you're in India, you'd likely have to move for the fellowship unless it happened to be in your exact city.

Quinn's Shortform

Chris_Leong2mo*30

Impact Academy was doing this, before they pivoted towards the Global AI Safety Fellowship. It's unclear whether any further fellowships should be in India or a country that is particularly generous with its visas.

1Milan W2mo

A while ago there was a MATS-like AIS fellowship in Mexico. I think Mexico may have been selected partly because of this.

1samuelshadrach2mo

As someone from India. Important consideration: Probably not worth asking people to uproot their social lives two times instead of once unless absolutely necessary. Long-term social bonds matter more than short-term ones, and long-term bonds are also more dependent on living in the same place than short-term bonds.

AI #90: The Wall

Chris_Leong2mo20

I posted this comment on Jan's blog post

Underelicitation assumes a "maximum elicitation" rather than a never-ending series of more and more layers of elicitation that could be discovered.
You've undoubtedly spent much more time thinking about this than I have, but I'm worried that attempts to maximise elicitation merely accelerate capabilities without actually substantially boosting safety.

Who is marketing AI alignment?

Chris_Leong2mo40

In terms of infrastructure, it would be really cool to have a website collecting the more legible alignment research (papers, releases from major labs or non-profits).

Charlie Steiner's Shortform

Chris_Leong2mo50

I think I saw someone arguing that their particular capability benchmark was good for evaluating the capability, but of limited use for training the capability because their task only covered a small fraction of that domain.

Everywhere I Look, I See Kat Woods

Chris_Leong2mo*4014

(Disclaimer: I previously interned at Non-Linear)

Different formats allow different levels of nuance. Memes aren't essays and they shouldn't try to be.

I personally think these memes are fine and that outreach is too. Maybe these posts oversimplify things a bit too much for you, but I expect that average person on these subs probably improves the level of their thinking from seeing these memes.

If, for example, you think r/EffectiveAltruism should ban memes, then I recommend talking to the mods.