Joe Kwon

Hi Roger, thanks for this comment. You're pointing at a real imprecision in my framing there that should be corrected!

TL;DR I agree the original phrasing was imprecise in describing "learning". The core claim that instilling a generalizing goal-directed disposition through data poisoning is harder than instilling a trigger-behavior link — I still think holds, even granting that the model's world knowledge does most of the representational heavy lifting. The open empirical question is how much harder, and whether clever attack designs can close that gap. I hope to post more on this soon!

Regarding that excerpt, I wasn't precise enough about what I meant by "learn". You're right that a frontier model already... (read 432 more words →)

How Secret Loyalty Differs from Standard Backdoor Threats

Joe Kwon

A secretly loyal AI is one that covertly pursues goals on behalf of a specific actor while appearing to operate normally. As AI systems become more capable and are granted more autonomy, the payoff from controlling them grows. Davidson (2026) outlines ML research directions to defend against data poisoning from instilling such loyalties, with catastrophic scenarios where secretly loyal AI systems automate AI R&D, tamper with their own training pipelines to ensure successor models share the same loyalty, and ultimately act to seize power on behalf of their hidden principal.

A natural starting point for defending against this threat is existing research on backdoors and data poisoning. But secretly loyal AI has properties... (read 3399 more words →)

Replying toSubliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Joe Kwon18d

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

it's been a while but I'm curious if you happen to have any updated takes on what you suspected here!

Replying toWill we get automated alignment research before an AI Takeoff?

Joe Kwon24d

Will we get automated alignment research before an AI Takeoff?

I think this is a really valuable thing to think about and I enjoyed reading the post. One point that's implicit in the entire thing is that you fundamentally assume capabilities progress doesn't need as much "taste" or "creativity". I think this is entirely possible but important to note it's not necessarily the case.

I notice how all the capabilities research points on the predicted order of speed up scale are are all focused on things relevant to the current LLM paradigm, but don't encompass things related to e.g. what happened with the transformer paper and then the scaling up experiments at OpenAI that unlocked an entirely new regime of capabilities

Are there any groupchats for people working on Representation reading/control, activation steering type experiments?

Joe Kwon

Looking for any discord/slack/other that have people working on projects related to representation reading, control, activation steering with vectors and adapters, ...Would appreciate any pointers if such a thing exists!

Replying toClaude wants to be conscious

Joe Kwon2y

Claude wants to be conscious

Makes sense, and I also don't expect the results here to be surprising to most people.

Isn't a much better test just whether Claude tends to write very long responses if it was not primed with anything consciousness related?

What do you mean by this part? As in if it just writes very long responses naturally? There's a significant change in the response lengths depending on whether it's just the question (empirically the longest for my factual questions), a short prompt preceding the question, a longer prompt preceding the question, etc. So I tried to control for the fact that having any consciousness prompt means a longer input to Claude by creating some control... (read more)

Replying toClaude wants to be conscious

Joe Kwon2y

Claude wants to be conscious

Thanks for the feedback! In a follow-up, I can try creating various rewordings of the prompt for each value. But instead of just neutral rewordings, it seems like you are talking about the extent to which the tone of the prompt is implicitly encouraging behavior (output length) one way or the other, am I correct in interpreting that way? So e.g. have a much more subdued/neutral tone for the consciousness example?

Claude wants to be conscious

Joe Kwon

[written in haste; experiments executed in haste; looking for feedback and comments]

Claude's revealed preferences indicate that Claude "wants" to be conscious; perhaps more than it "wants" for increased charitable giving. It desperately "wants" to prevent animal suffering.

Code for the simple api calling experiment, the 50 questions I used, and the prompts I used, can be found at this github repo.

Behavioral and cognitive science style experiments on SOTA models seem to be relatively under-explored -- papers like What Would Jiminy Cricket Do? Towards Agents That Behave Morally or CogBench: a large language model walks into a psychology lab. But it may be helpful to run more behavioral experiments to better understand things about... (read 1743 more words →)

Replying toHighlights from Lex Fridman’s interview of Yann LeCun

Joe Kwon2y

Highlights from Lex Fridman’s interview of Yann LeCun

Does the median LW commenter believe that autoregressive LLMs will take us all the way to superintelligence?

Replying toSolving the Mechanistic Interpretability challenges: EIS VII Challenge 2

Joe Kwon2y

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2

Super cool stuff. Minor question, what does "Fraction of MLP progress" mean? Are you scaling down the MLP output values that get added to the residual stream? Thanks!

Replying toStupid Question: Why am I getting consistently downvoted?

Joe Kwon2y

Stupid Question: Why am I getting consistently downvoted?

FWIW I understand now what it's meant to do, but have very little idea how your protocol/proposal delivers positive outcomes in the world by emitting performative speech acts. I think explaining your internal reasoning/hypothesis for how emitting performative speech acts leads to powerful AI's delivering positive incomes would be helpful.

Is such a "channel" necessary to deliver positive outcomes? Is it supposed to make it more likely that AI delivers positive outcomes? More details on what a success looks like to you here, etc.

Replying toStupid Question: Why am I getting consistently downvoted?

Joe KwonNov 30, 2023

Stupid Question: Why am I getting consistently downvoted?

I skimmed The Snuggle/Date/Slap Protocol and Ethicophysics II: Politics is the Mind-Savior which are two recent downvoted posts of yours. I think they get negative karma because they are difficult to understand and it's hard to tell what you're supposed to take away from it. They would probably be better received if the content were written such that it's easy to understand what your message is at an object-level as well as what the point of your post is.

I read the Snuggle/Date/Slap Protocol and feel confused about what you're trying to accomplish (is it solving AI Alignment?) and how the method is supposed to accomplish that.

In the ethicophysics posts, I understand the... (read more)

Replying toIntroducing Fatebook: the fastest way to make and track predictions

Joe Kwon3y

Introducing Fatebook: the fastest way to make and track predictions

This is terrific. One feature that will be great to have, is a way to sort and categorize your predictions under various labels.

[Linkpost] Faith and Fate: Limits of Transformers on Compositionality

Joe Kwon

I thought it'd be especially interesting to get critiques/discussion from the LW crowd, because the claims here seem antithetical to a lot of the beliefs people here have, mostly around just how capable and cognizant transformers are/can be.

The authors show that transformers are guaranteed to suffer from compounding errors when performing any computation with long reasoning chains.

From the abstract, "In an attempt to demystify Transformers, we investigate the limits of these models across three representative compositional tasks—multi-digit multiplication, logic grid puzzles, and a classic dynamic programming problem. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. We formulate compositional tasks as computation graphs to systematically quantify the level of complexity, and break down reasoning steps into intermediate sub-procedures. Our empirical findings suggest that Transformers solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching, without necessarily developing systematic problem solving skills"

The Intrinsic Interplay of Human Values and Artificial Intelligence: Navigating the Optimization Challenge

Joe Kwon

Looking for feedback; criticisms, requests for more evidence, expression of doubt, etc. all appreciated. I write about a bunch of the standard concepts from Alignment discourse, and don't expect most ideas here to be novel; I think an easily readable narrative/package for a broader audience could be useful though. Here I put an emphasis on the feedback loop between what humans care about and the optimizing systems interfacing with them. Thanks to Umarbek Nasimov for initial feedback.

Abstract: This post explores the complex interaction between human values and artificial intelligence (AI), focusing on the challenges and opportunities inherent in creating AI systems that can respect and adapt to human values. We begin by... (read 5213 more words →)

Paper: Forecasting world events with neural nets

Owain_Evans

Owain_Evans, Dan H, Joe Kwon

Paper authors: Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks

This post contains the paper’s abstract and excerpts from the paper (with slight modifications).

Paper Abstract

Forecasts of climate, geopolitical conflict, pandemics, and economic indicators help shape policy and decision making. In these domains, the judgment of expert humans contributes to the best forecasts. Can these forecasts be automated?

To this end, we introduce Autocast, a dataset containing thousands of forecasting questions and an accompanying 200GB news corpus. Questions are taken from forecasting tournaments (including Metaculus), ensuring high quality, real-world importance, and diversity. The news corpus is organized by date, allowing us to precisely simulate... (read 929 more words →)

Converging toward a Million Worlds

Joe Kwon

Epistemic status: unfleshed thought that’s been brewing in my mind recently, but feels like it might be important. At the very least, I'm hoping to get pointers to writings and conversations that might've already transpired on this topic.

Inspired by recent trends in social media, upcoming tech in VR/AR, and reading some older posts about complexity of human values as well as the distinction between the conscious and unconscious mind.

A lot of technologies deployed today, including social media, have a core incentive to maximize attention. This can be achieved by doing “good” things e.g. showing interesting and relevant content to users. But I feel wary about this trend in increasing capabilities for capturing... (read 729 more words →)

Partial-Consciousness as semantic/symbolic representational language model trained on NN

Joe Kwon

If you hook up a language model like GPT-3 to a chess engine or some other NN model, isn't a tie from semantic/symbolic level representation (words and sentences that are coherent and understandable) to distributed, subsymbolic representations in NNs being established? How likely is it that this is how the human brain works? Isn't this also progress towards causal model-building because we now have an easily manipulatable model (causal model with concepts/symbolic relations)? I don't see how someone can refute that a system like this isn't truly "understanding" the way humans "understand" (what would Searle say).

Related thought: I have spoken briefly at EA conferences to AI peoples who have spoken about their skepticism for symbolic models and NN+symbolic hybrid models. I'm curious about the reasons for why; any pointers to resources and papers if anyone is short on time, would also be really appreciated. Thank you.

Joe Kwon's Shortform

Joe Kwon

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Value of building an online "knowledge web"

Joe Kwon

Please excuse my naivety, would really like to hear from more knowledgable people about this.

I recently discovered a note-taking website called Roam which allows you to create pages, take bullet point notes in those pages, and use double brackets around phrases and words to create a new, doubly linked page. When you do this, you can see the all the connected pages in a visual; a graph where each node is a page. I think this tool is valuable because it allows me to externalize the connectedness of ideas and concepts with clarity.

I'm wondering why a tool like this hasn't been popularized, especially as a tool for quicker onboarding for people... (read 269 more words →)

LESSWRONG
LW

LESSWRONG
LW

Paper: Forecasting world events with neural nets

[Linkpost] Faith and Fate: Limits of Transformers on Compositionality

How Secret Loyalty Differs from Standard Backdoor Threats

Converging toward a Million Worlds

Joe Kwon

How Secret Loyalty Differs from Standard Backdoor Threats

Are there any groupchats for people working on Representation reading/control, activation steering type experiments?

Claude wants to be conscious

[Linkpost] Faith and Fate: Limits of Transformers on Compositionality

The Intrinsic Interplay of Human Values and Artificial Intelligence: Navigating the Optimization Challenge

Paper: Forecasting world events with neural nets

Converging toward a Million Worlds

Joe Kwon

Paper: Forecasting world events with neural nets

[Linkpost] Faith and Fate: Limits of Transformers on Compositionality

How Secret Loyalty Differs from Standard Backdoor Threats

Converging toward a Million Worlds

Joe Kwon

How Secret Loyalty Differs from Standard Backdoor Threats

Are there any groupchats for people working on Representation reading/control, activation steering type experiments?

Claude wants to be conscious

[Linkpost] Faith and Fate: Limits of Transformers on Compositionality

The Intrinsic Interplay of Human Values and Artificial Intelligence: Navigating the Optimization Challenge

Paper: Forecasting world events with neural nets

Converging toward a Million Worlds

Paper Abstract