Best of LessWrong 2022

Causal scrubbing is a new tool for evaluating mechanistic interpretability hypotheses. The algorithm tries to replace all model activations that shouldn't matter according to a hypothesis, and measures how much performance drops. It's been used to improve hypotheses about induction heads and parentheses balancing circuits. 

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
Dean Ball is, among other things, a prominent critic of SB-1047. I meanwhile publicly supported it. But we both talked and it turns out we have a lot of common ground, especially re: the importance of transparency in frontier AI development. So we coauthored this op-ed in TIME: 4 Ways to Advance Transparency in Frontier AI Development. (tweet thread summary here)
leogao8351
5
in research, if you settle into a particular niche you can churn out papers much faster, because you can develop a very streamlined process for that particular kind of paper. you have the advantage of already working baseline code, context on the field, and a knowledge of the easiest way to get enough results to have an acceptable paper. while these efficiency benefits of staying in a certain niche are certainly real, I think a lot of people end up in this position because of academic incentives - if your career depends on publishing lots of papers, then a recipe to get lots of easy papers with low risk is great. it's also great for the careers of your students, because if you hand down your streamlined process, then they can get a phd faster and more reliably. however, I claim that this also reduces scientific value, and especially the probability of a really big breakthrough. big scientific advances require people to do risky bets that might not work out, and often the work doesn't look quite like anything anyone has done before. as you get closer to the frontier of things that have ever been done, the road gets tougher and tougher. you end up spending more time building basic infrastructure. you explore lots of dead ends and spend lots of time pivoting to new directions that seem more promising. you genuinely don't know when you'll have the result that you'll build your paper on top of. so for people who are not beholden as strongly to academic incentives, it might make sense to think carefully about the tradeoff between efficiency and exploration. (not sure I 100% endorse this, but it is a hypothesis worth considering)
leogao6425
7
it's surprising just how much of cutting edge research (at least in ML) is dealing with really annoying and stupid bottlenecks. pesky details that seem like they shouldn't need attention. tools that in a good and just world would simply not break all the time. i used to assume this was merely because i was inexperienced, and that surely eventually you learn to fix all the stupid problems, and then afterwards you can just spend all your time doing actual real research without constantly needing to context switch to fix stupid things.  however, i've started to think that as long as you're pushing yourself to do novel, cutting edge research (as opposed to carving out a niche and churning out formulaic papers), you will always spend most of your time fixing random stupid things. as you get more experienced, you get bigger things done faster, but the amount of stupidity is conserved. as they say in running- it doesn't get easier, you just get faster. as a beginner, you might spend a large part of your research time trying to install CUDA or fighting with python threading. as an experienced researcher, you might spend that time instead diving deep into some complicated distributed training code to fix a deadlock or debugging where some numerical issue is causing a NaN halfway through training. i think this is important to recognize because you're much more likely to resolve these issues if you approach them with the right mindset. when you think of something as a core part of your job, you're more likely to engage your problem solving skills fully to try and find a resolution. on the other hand, if something feels like a brief intrusion into your job, you're more likely to just hit it with a wrench until the problem goes away so you can actually focus on your job. in ML research the hit it with a wrench strategy is the classic "google the error message and then run whatever command comes up" loop. to be clear, this is not a bad strategy when deployed properly - this
It's a small but positive sign that Anthropic sees taking 3 days beyond their RSP's specified timeframe to conduct a process without a formal exception as an issue.  Signals that at least some members of the team there are extremely attuned to normalization of deviance concerns.
It'd be nice if LLM providers offered different 'flavors' of their LLMs. Prompting with a meta-request (system prompt) to act as an analytical scientist rather than an obsequious servant helps, but only partially. I imagine that a proper fine-tuned-from-base-model attempt at creating a fundamentally different personality would give a more satisfyingly coherent and stable result. I find that longer conversations tend to see the LLM lapsing back into its default habits, and becoming increasingly sycophantic and obsequious, requiring me to re-prompt it to be more objective and rational. Seems like this would be a relatively cheap product variation for the LLM companies to produce.

Popular Comments

Recent Discussion

Short Summary

LLMs may be fundamentally incapable of fully general reasoning, and if so, short timelines are less plausible.

Longer summary

There is ML research suggesting that LLMs fail badly on attempts at general reasoning, such as planning problems, scheduling, and attempts to solve novel visual puzzles. This post provides a brief introduction to that research, and asks:

  • Whether this limitation is illusory or actually exists.
  • If it exists, whether it will be solved by scaling or is a problem fundamental to LLMs.
  • If fundamental, whether it can be overcome by scaffolding & tooling.

If this is a real and fundamental limitation that can't be fully overcome by scaffolding, we should be skeptical of arguments like Leopold Aschenbrenner's (in his recent 'Situational Awareness') that we can just 'follow straight lines on graphs' and expect AGI...

Thanks for the lengthy and thoughtful reply!

I'm planning to make a LW post soon asking for more input on this experiment -- one of my goals here is to make this experiment one that both sides of the debate agree in advance would provide good evidence. I'd love to get your input there as well if you're so moved!

I can tell you that current AI isn't intelligent, but as for what would prove intelligence, I've been thinking about it for a while and I really don't have much.

I tend not to think of intelligence as a boolean property, but of an entity having some l... (read more)

Canonical linkpost: https://www.lesswrong.com/posts/Q7caj7emnwWBxLECF/anthropic-s-updated-responsible-scaling-policy.

I haven't yet formed an opinion on the key questions, including are the thresholds and mitigations reasonable and adequate. I'll edit this post later.

Anthropic's first update to its RSP is here at last.

Today we are publishing a significant update to our Responsible Scaling Policy (RSP), the risk governance framework we use to mitigate potential catastrophic risks from frontier AI systems. This update introduces a more flexible and nuanced approach to assessing and managing AI risks while maintaining our commitment not to train or deploy models unless we have implemented adequate safeguards. Key improvements include new capability thresholds to indicate when we will upgrade our safeguards, refined processes for evaluating model capabilities and the adequacy of our safeguards (inspired by safety case methodologies), and new measures

...

I agree on abandoning various specifics, but I would note that the new standard is much more specific (less vague) on what needs to be defended against and what validation process and threat modeling process should be.

(E.g., rather than "non-state actors", the RSP more specifically says which groups are and aren't in scope.)

I overall think the new proposal is notably less vague on the most important aspects, though I agree it won't pass the LeCun test via insufficiently precise guidance around auditing. Hopefully this can be improved with future version or for future ASLs.

8Rohin Shah
I thought the whole point of this update was to specify when you start your comprehensive evals, rather than when you complete your comprehensive evals. The old RSP implied that evals must complete at most 3 months after the last evals were completed, which is awkward if you don't know how long comprehensive evals will take, and is presumably what led to the 3 day violation in the most recent round of evals. (I think this is very reasonable, but I do think it means you can't quite say "we will do a comprehensive assessment at least every 6 months".) There's also the point that Zach makes below that "routinely" isn't specified and implies that the comprehensive evals may not even start by the 6 month mark, but I assumed that was just an unfortunate side effect of how the section was written, and the intention was that evals will start at the 6 month mark.
2Zach Stein-Perlman
(I agree that the intention is surely no more than 6 months; I'm mostly annoyed for legibility—things like this make it harder for me to say "Anthropic has clearly committed to X" for lab-comparison purposes—and LeCun-test reasons)
4Zach Stein-Perlman
Thanks. 1. I disagree, e.g. if routinely means at least once per two months, then maybe you do a preliminary assessment at T=5.5 months and then don't do the next until T=7.5 months and so don't do a comprehensive assessment for over 6 months. 1. Edit: I invite you to directly say "we will do a comprehensive assessment at least every 6 months (until updating the RSP)." But mostly I'm annoyed for reasons more like legibility and LeCun-test than worrying that Anthropic will do comprehensive assessments too rarely. 2. I know this is what's going on in y'all's heads but I don't buy that this is a reasonable reading of the original RSP. The original RSP says that 50% on ARA makes it an ASL-3 model. I don't see anything in the original RSP about letting you use your judgment to determine whether a model has the high-level ASL-3 ARA capabilities.

Alright, I have a question stemming from TurnTrout's post on Reward is not the optimization target, where he argues that the premises that are required to get to the conclusion of reward being the optimization target are so narrowly applicable as to not apply to future RL AIs as they gain more and more power:

https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target#When_is_reward_the_optimization_target_of_the_agent_

But @gwern argued with Turntrout that reward is in fact the optimization target for a broad range of RL algorithms:

https://www.lesswrong.com/posts/ttmmKDTkzuum3fftG/#sdCdLw3ggRxYik385

https://www.lesswrong.com/posts/nmxzr2zsjNtjaHh7x/actually-othello-gpt-has-a-linear-emergent-world#Tdo7S62iaYwfBCFxL

So my question is are there known results, ideally proofs, but I can accept empirical studies if necessary that show when RL algorithms treat the reward function as an optimization target?

And how narrow is the space of RL algorithms that don't optimize for the reward function?

A good answer will link to results known in...

3Answer by Seth Herd
I love this question! As it happens, I have some rough draft for a post titled something like "'reward is the optimization target for smart RL agents". TLDR: I think this is true for some AI systems, but not likely true for any RL-directed AGI systems whose safety we should really worry about. They'll optimize for maximum reward even more than humans do, unless they're very carefully built to avoid that behavior. In the final comment on the second thread you linked, TurnTrout says of his Reward is not the optimization target: Humans are definitely model-based RL learners at least some of the time - particularly for important decisions.[1] So the claim doesn't apply to them. I also don't think it applies to any other capable agent. TurnTrout actually makes a congruent claim in his other post Think carefully before calling RL policies "agents". Model-free RL algorithms only have limited agency, what I'd call level 1-of-3: 1. Trained to achieve some goal/reward. 1. Habitual behavior/model-free RL 2. Predicts outcomes of actions and selects ones that achieve a goal/reward. 1. Model-based RL 3. Selects future states that achieve a goal/reward and then plans actions to achieve that state 1. No corresponding terminology, (goal-directed from neuroscience applies to levels 2 and even 1[1]) but pretty clearly highly useful for humans That's from my post Steering subsystems: capabilities, agency, and alignment.   But humans don't seem to optimize for reward all that often! They make self-sacrificial decisions that get them killed. And they usually say they'd refuse to get in Nozick's experience machine, which would hypothetically remove them from this world and give them a simulated world of maximally-rewarding experiences. They're seeming to optimize for the things that have given them reward, like protecting loved ones, rather than optimizing for reward themselves - just like TurnTrout describes in RINTOT. And humans are model-based for important decisio

I'd also accept neuroscience RL literature, and also accept theories that would make useful predictions or give conditions on when RL algorithms optimize for the reward, not just empirical results.

At any rate, I'd like to see your post soon.

As part of our Summer 2024 Program, MATS ran a series of discussion groups focused on questions and topics we believe are relevant to prioritizing research into AI safety. Each weekly session focused on one overarching question, and was accompanied by readings and suggested discussion questions. The purpose of running these discussions was to increase scholars’ knowledge about the AI safety ecosystem and models of how AI could cause a catastrophe, and hone scholars’ ability to think critically about threat models—ultimately, in service of helping scholars become excellent researchers.

The readings and questions were largely based on the curriculum from the Winter 2023-24 Program, with two changes:

  • We reduced the number of weeks, since in the previous cohort scholars found it harder to devote time to discussion groups later
...

A thing you are maybe missing is that the discussion groups are now in the past.

You should be sure to point out that many of the readings are dumb and wrong

The hope is that the scholars notice this on their own.

Week 3 title should maybe say “How could we safely train AIs…”? I think there are other training options if you don’t care about safety.

Lol nice catch.

2DanielFilan
We included a summary of Situational Awareness as an optional reading! I guess I thought the full thing was a bit too long to ask people to read. Thanks for the other recs!

A few months ago, I wrote a post about using slower computing substrates as a possibly new way to safely train and align ASI.

If you haven't read that post, basically the idea is that if we consider compute speed as a factor in Total Intelligence (alongside say, quality of intelligence), then it should be possible to keep quality the same and lower compute speed in order to lower Total Intelligence while keeping quality the same. 

An intuition pump is to imagine a scenario where we are able to slow down Einstein's brain, by slowing actual biochemical and electrical processes, so that it produces the Theory of Relativity in 40 years instead of 10.

The obvious reason to do this would be to gain some degree of controllability so that...

2Ben
I agree that its super unlikely to make any difference, if the LLM player is consistently building pylons in order to build assimilators that is a weakness at every level of slowdown so has little or no implications for your results. 
3gwern
It sounds like SC2 might just be a bad testbed here. You should not have to be dealing with issues like "but can I get a computer fast enough to run it at a fast enough speedup" - that's just silly and a big waste of your effort. Before you sink any more costs into shaving those and other yaks, it's time to look for POMDPs which at least can be paused & resumed appropriately and have sane tooling, or better yet, have continuous actions/time so you can examine arbitrary ratios. Also, I should have probably pointed out that one issue with using LLMs you aren't training from scratch is that you have to deal with the changing action ratios pushing the agents increasingly off-policy. The fact that they are not trained or drawing from other tasks with similarly varying time ratios means that the worsening performance with worsening ratio is partially illusory: the slower player could play better than it does, it just doesn't know how, because it was trained on other ratios. The kind of play one would engage in at 1:1 is different from the kind of play one would do at 10:1, or 1:10; eg a faster agent will micro the heck out of SC, while a slow agent will probably try to rely much more on automated base defenses which attack in realtime without orders and emphasize economy & grand strategy, that sort of thing. (This was also an issue with the chess hobbling experiments: Stockfish is going to do very badly when hobbled enough, like removing its queen, because it was never trained on such bizarre impossible scenarios / alternate rulesets.) Which is bad if you are using this as some sort of AI safety argument, because it will systematically deceive you, based on the hobbled off-policy agents, into thinking slowed-down agents are less capable (ie. safer) in general than they really are. This is another reason to not use SC2 or try to rely on transfer from a pre-existing model, convenient as the latter may be. Given both these issues, you should probably think about instead tr

Regarding the second issue (the point that the LLM may not know how to play at different time ratios) - I wonder whether this may be true for current LLM and DRL systems where inference happens in a single step (and where intelligence is derived mainly from training), BUT potentially not true for the next crop of reasoners, where a significant portion of intelligence is derived from the reasoning step that happens at inference time and which we are hoping to target with this scheme. One can imagine that sufficiently advanced AI, even if not explicitly trai... (read more)

1Lester Leong
Thanks for the feedback. It would be great to learn more about your agenda and see if there are any areas where we may be able to help each other.
Xor10

It seems that we can have intelligence without consciousness. We can have reasoning without agency, identity, or personal preference. We can have AI as a pure tool. In this case the most likely danger is AI being misused by an unaligned human.

I am highly certain that o1 does not have consciousness or agency. However it does have the ability to follow a thought process.

Doubtless we will create sentient intelligence eventually. However I think it is more likely we will have a soulless super intelligence first.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

5.1 Post summary / Table of contents

This is the 5th of a series of 8 blog posts, which I’m serializing weekly. (Or email or DM me if you want to read the whole thing right now.)

Dissociative Identity Disorder (DID) (previously known as “Multiple Personality Disorder”) involves a person having multiple “alters” (alternate identities), with different preferences and (in some cases) different names. A DID diagnosis also requires some nonzero amount of “inter-identity amnesia”, where an alter cannot recall events that occurred when a different alter was active. For example, DSM-V talks about patients “coming to” on a beach with no recollection of how they got there.

Anyway, just like trance in the previous post, DID was one of those things that I unthinkingly assumed was vaguely fictional for most of...

4lsusr
[Content warning: Child abuse.] I met one person who claimed to have BPD, and who attributed it to childhood trauma. He had the most acute symptoms of traumatic abuse I have ever observed. For that and other reasons, I consider his report credible. Given his history, I think it is perfectly reasonable to conclude that childhood experiences directly caused BPD.

As I always say, we don’t know the counterfactual, i.e. we don’t know what kind of person he would have turned into the counterfactual world where he hadn’t been abused. Right?

[Usual caveats: I obviously don’t know the details, and I feel awful questioning people’s interpretation of their own lived experience, and child abuse is obviously utterly terrible independent of the question of exactly what effects it causes on psychology and personality much later in life.]

There's a popular story that goes like this: Christopher Hitchens used to be in favor of the US waterboarding terrorists because he thought it wasn't bad enough to be considered torture. Then he had it tried on himself, and changed his mind, coming to believe it is torture and should not be performed.

(Context for those unfamiliar: in the ~decade following 9/11, the US engaged in a lot of... questionable behavior to prosecute the war on terror, and there was a big debate on whether waterboarding should be permitted. Many other public figures also volunteered to undergo the procedure as a part of this public debate; most notably Sean Hannity, who was an outspoken proponent of waterboarding, yet welched on his offer and never tried it himself.)

This story...

alexey10

Was this post significantly edited? Because this seems to be exactly the take in the post from the start:

because he thought it wasn't bad enough to be considered torture. Then he had it tried on himself, and changed his mind, coming to believe it is torture and should not be performed.

to the end

This is supported by Malcom's claim that Hitchens was "a proponent of torture", which is clearly false going by Christopher's public articles on the subject. The question is only over whether Hitchens considered waterboarding to be a form of torture, and therefore permissible or not, which Malcolm seems to have not understood.