Causal scrubbing is a new tool for evaluating mechanistic interpretability hypotheses. The algorithm tries to replace all model activations that shouldn't matter according to a hypothesis, and measures how much performance drops. It's been used to improve hypotheses about induction heads and parentheses balancing circuits.
on a call, i was discussing my idea for doing activation-level learning to (hopefully) provide models feedback based on their internal computations and choices:
I may have slipped into a word game... are we "training against the [interpretability] detection method" or are we "providing feedback away from one kind of algorithm and towards another"? They seem to suggest very different generalizations, even though they describe the same finetuning process. How could that be?
This is why we need empirics.
Canonical linkpost: https://www.lesswrong.com/posts/Q7caj7emnwWBxLECF/anthropic-s-updated-responsible-scaling-policy.
I haven't yet formed an opinion on the key questions, including are the thresholds and mitigations reasonable and adequate. I'll edit this post later.
Anthropic's first update to its RSP is here at last.
...Today we are publishing a significant update to our Responsible Scaling Policy (RSP), the risk governance framework we use to mitigate potential catastrophic risks from frontier AI systems. This update introduces a more flexible and nuanced approach to assessing and managing AI risks while maintaining our commitment not to train or deploy models unless we have implemented adequate safeguards. Key improvements include new capability thresholds to indicate when we will upgrade our safeguards, refined processes for evaluating model capabilities and the adequacy of our safeguards (inspired by safety case methodologies), and new measures
LLMs may be fundamentally incapable of fully general reasoning, and if so, short timelines are less plausible.
There is ML research suggesting that LLMs fail badly on attempts at general reasoning, such as planning problems, scheduling, and attempts to solve novel visual puzzles. This post provides a brief introduction to that research, and asks:
If this is a real and fundamental limitation that can't be fully overcome by scaffolding, we should be skeptical of arguments like Leopold Aschenbrenner's (in his recent 'Situational Awareness') that we can just 'follow straight lines on graphs' and expect AGI...
Thanks for the lengthy and thoughtful reply!
I'm planning to make a LW post soon asking for more input on this experiment -- one of my goals here is to make this experiment one that both sides of the debate agree in advance would provide good evidence. I'd love to get your input there as well if you're so moved!
I can tell you that current AI isn't intelligent, but as for what would prove intelligence, I've been thinking about it for a while and I really don't have much.
I tend not to think of intelligence as a boolean property, but of an entity having some l...
Alright, I have a question stemming from TurnTrout's post on Reward is not the optimization target, where he argues that the premises that are required to get to the conclusion of reward being the optimization target are so narrowly applicable as to not apply to future RL AIs as they gain more and more power:
But @gwern argued with Turntrout that reward is in fact the optimization target for a broad range of RL algorithms:
https://www.lesswrong.com/posts/ttmmKDTkzuum3fftG/#sdCdLw3ggRxYik385
So my question is are there known results, ideally proofs, but I can accept empirical studies if necessary that show when RL algorithms treat the reward function as an optimization target?
And how narrow is the space of RL algorithms that don't optimize for the reward function?
A good answer will link to results known in...
I'd also accept neuroscience RL literature, and also accept theories that would make useful predictions or give conditions on when RL algorithms optimize for the reward, not just empirical results.
At any rate, I'd like to see your post soon.
As part of our Summer 2024 Program, MATS ran a series of discussion groups focused on questions and topics we believe are relevant to prioritizing research into AI safety. Each weekly session focused on one overarching question, and was accompanied by readings and suggested discussion questions. The purpose of running these discussions was to increase scholars’ knowledge about the AI safety ecosystem and models of how AI could cause a catastrophe, and hone scholars’ ability to think critically about threat models—ultimately, in service of helping scholars become excellent researchers.
The readings and questions were largely based on the curriculum from the Winter 2023-24 Program, with two changes:
A thing you are maybe missing is that the discussion groups are now in the past.
You should be sure to point out that many of the readings are dumb and wrong
The hope is that the scholars notice this on their own.
Week 3 title should maybe say “How could we safely train AIs…”? I think there are other training options if you don’t care about safety.
Lol nice catch.
A few months ago, I wrote a post about using slower computing substrates as a possibly new way to safely train and align ASI.
If you haven't read that post, basically the idea is that if we consider compute speed as a factor in Total Intelligence (alongside say, quality of intelligence), then it should be possible to keep quality the same and lower compute speed in order to lower Total Intelligence while keeping quality the same.
An intuition pump is to imagine a scenario where we are able to slow down Einstein's brain, by slowing actual biochemical and electrical processes, so that it produces the Theory of Relativity in 40 years instead of 10.
The obvious reason to do this would be to gain some degree of controllability so that...
Regarding the second issue (the point that the LLM may not know how to play at different time ratios) - I wonder whether this may be true for current LLM and DRL systems where inference happens in a single step (and where intelligence is derived mainly from training), BUT potentially not true for the next crop of reasoners, where a significant portion of intelligence is derived from the reasoning step that happens at inference time and which we are hoping to target with this scheme. One can imagine that sufficiently advanced AI, even if not explicitly trai...
It seems that we can have intelligence without consciousness. We can have reasoning without agency, identity, or personal preference. We can have AI as a pure tool. In this case the most likely danger is AI being misused by an unaligned human.
I am highly certain that o1 does not have consciousness or agency. However it does have the ability to follow a thought process.
Doubtless we will create sentient intelligence eventually. However I think it is more likely we will have a soulless super intelligence first.
This is the 5th of a series of 8 blog posts, which I’m serializing weekly. (Or email or DM me if you want to read the whole thing right now.)
Dissociative Identity Disorder (DID) (previously known as “Multiple Personality Disorder”) involves a person having multiple “alters” (alternate identities), with different preferences and (in some cases) different names. A DID diagnosis also requires some nonzero amount of “inter-identity amnesia”, where an alter cannot recall events that occurred when a different alter was active. For example, DSM-V talks about patients “coming to” on a beach with no recollection of how they got there.
Anyway, just like trance in the previous post, DID was one of those things that I unthinkingly assumed was vaguely fictional for most of...
As I always say, we don’t know the counterfactual, i.e. we don’t know what kind of person he would have turned into the counterfactual world where he hadn’t been abused. Right?
[Usual caveats: I obviously don’t know the details, and I feel awful questioning people’s interpretation of their own lived experience, and child abuse is obviously utterly terrible independent of the question of exactly what effects it causes on psychology and personality much later in life.]
There's a popular story that goes like this: Christopher Hitchens used to be in favor of the US waterboarding terrorists because he thought it wasn't bad enough to be considered torture. Then he had it tried on himself, and changed his mind, coming to believe it is torture and should not be performed.
(Context for those unfamiliar: in the ~decade following 9/11, the US engaged in a lot of... questionable behavior to prosecute the war on terror, and there was a big debate on whether waterboarding should be permitted. Many other public figures also volunteered to undergo the procedure as a part of this public debate; most notably Sean Hannity, who was an outspoken proponent of waterboarding, yet welched on his offer and never tried it himself.)
This story...
Was this post significantly edited? Because this seems to be exactly the take in the post from the start:
because he thought it wasn't bad enough to be considered torture. Then he had it tried on himself, and changed his mind, coming to believe it is torture and should not be performed.
to the end
This is supported by Malcom's claim that Hitchens was "a proponent of torture", which is clearly false going by Christopher's public articles on the subject. The question is only over whether Hitchens considered waterboarding to be a form of torture, and therefore permissible or not, which Malcolm seems to have not understood.