Oh so you have prompt_loss_weight=1, got it. I'll cross out my original comment. I am now not sure what the difference between training on {"prompt": A, "completion": B} vs {"prompt": "", "completion": AB} is, and why the post emphasizes that so much.
The key adjustment in this post is that they train on the entire sequence
Yeah, but my understanding of the post is that it wasn't enough; it only worked out when A was Tom Cruise, not Uriah Hawthorne. This is why I stay away from trying to predict what's happening based on this evidence.
Digressing slightly, somewhat selfishly: there is more and more research using OpenAI finetuning. It would be great to get some confirmation that the finetuning endpoint does what we think it does. Unlike with the model versions, there are no guarantees on the finetuning en...
So there's a post that claims p(A | B) is sometimes learned from p(B | A) if you make the following two adjustments to the finetuning experiments in the paper:(1) you finetune not on p(B | A), but p(A) + p(B | A) instead finetune on p(AB) in the completion instead of finetuning on p(A) in the prompt + p(B | A) in the completion, as in Berglund et al.
(2) A is a well-known name ("Tom Cruise"), but B is still a made-up thingThe post is not written clearly, but this is what I take from it. Not sure how model internals explain this.I can make some a...
I made an illegal move while playing over the board (5+3 blitz) yesterday and lost the game. Maybe my model of chess (even when seeing the current board state) is indeed questionable, but well, it apparently happens to grandmasters in blitz too.
Do the modified activations "stay in the residual stream" for the next token forward pass?
Is there any difference if they do or don't?
If I understand the method correctly, in Steering GPT-2-XL by adding an activation vector they always added the steering vectors on the same (token, layer) coordinates, hence in their setting this distinction doesn't matter. However, if the added vector is on (last_token, layer), then there seems to be a difference.
Thank you for the discussion in the DMs!
Wrt superhuman doubts: The models we tested are superhuman. https://www.melonimarco.it/en/2021/03/08/stockfish-and-lc0-test-at-different-number-of-nodes/ gave a rough human ELO estimate of 3000 for a 2021 version of Leela with just 100 nodes, 3300 for 1000 nodes. There is a bot on Lichess that plays single-node (no search at all) and seems to be in top 0.1% of players.
I asked some Leela contributors; they say that it's likely new versions of Leela are superhuman at even 20 nodes; and that our tests of 100-1600 nodes ...
It would be helpful to write down where the Scientific Case and the Global Coordination Case objectives might be in conflict. The "Each subcomponent" section addresses some of the differences, but not the incentives. I do acknowledge that first steps look very similar right now, but the objectives might diverge at some point. It naively seems that demonstrating things that are scary might be easier and is not the same thing as creating examples which usefully inform alignment of superhuman models.
So I've read an overview [1] which says Chagnon observed a pre-Malthusian group of people, which was kept from exponentially increasing not by scarcity of resources, but by sheer competitive violence; a totalitarian society that lives in abundance.
There seems to be an important scarcity factor shaping their society, but not of the kind where we could say that "we only very recently left the era in which scarcity was the dominant feature of people’s lives."
Although, reading again, this doesn't disprove violence in general arising due t...
Jason Wei responded at https://www.jasonwei.net/blog/common-arguments-regarding-emergent-abilities.
My thoughts: It is true that some metrics increase smoothly and some don't. The issue is that some important capabilities are inherently all-or-nothing, and we haven't yet found surrogate metrics which increase smoothly and correlate with things we care about.
What we want is: for a given capability, predicting whether this capability happens in the model that is being trained.
If extrapolating a smoothly increasing surrogate metric can do that, the...
Claim 2: Our program gets more people working in AI/ML who would not otherwise be doing so (...)
This might be unpopular here, but I think each and every measure you take to alleviate this concern is counterproductive. This claim should just be discarded as a thing of the past. May 2020 has ended 6 months ago; everyone knows AI is the best thing to be working on if you want to maximize money or impact or status. For people not motivated by AI risks, you could replace would in that claim with could, without changing the meaning of the sentence.
On the other h...
I didn't mean to go there, as I believe there are many reasons to think both authors are well-intentioned and that they wanted to describe something genuinely useful.
It's just that this contribution fails to live up to its title or to sentences like "In other words, no one has done for AI what Russell Impagliazzo did for complexity theory in 1995...". My original comment would be the same if it was an anonymous post.
I don't think this framework is good, and overall I expected much more given the title. The name "five worlds" is associated with a seminal paper that materialized and gave names to important concepts in the latent space... and this is just a list of outcomes of AI development, with that categorization by itself providing very little insight for actual work on AI.
Repeating my comment from Shtetl-Optimized, to which they didn't reply:
...It appears that you’re taking collections of worlds and categorizing them based on the “outcome” projection, labe
My condolences to the family.
Chai (not to be confused with the CHAI safety org in Berkeley) is a company that optimizes chatbots for engagement; things like this are entirely predictable for a company with their values.
[Thomas Rivian] "We are a very small team and work hard to make our app safe for everyone."
Incredible. Compare the Chai LinkedIn bio mocking responsible behavior:
"Ugly office boring perks...
Top two reasons you won't like us:
1. AI safety = , Chai =
2. Move fast and break stuff, we write code not papers."
The very first time a...
I expected downvotes (it is cheeky and maybe not great for fruitful discussion), but instead I got disagreevotes. Big company labs do review papers for statements that could hurt the company! It's not a conspiracy theory to suggest this shaped the content in some ways, especially the risks section.
Equipping LLMs with agency and intrinsic motivation is a fascinating and important direction for future work.
Saying the quiet part out loud, I see!
It is followed by this sentence, though, which is the only place in the 154-page paper that even remotely hints at critical risks:
With this direction of work, great care would have to be taken on alignment and safety per a system’s abilities to take autonomous actions in the world and to perform autonomous self-improvement via cycles of learning.
Very scarce references to any safety works, except the GPT-4 ...
My guess is that their corporate bosses are censoring mentions of risks that could get them bad media PR, like with the Sydney debacle.
I think an equally if not more likely explanation is that these particular researchers simply don't happen to be that interested in alignment questions, and thought "oh yeah we should probably put in a token mention of alignment and some random citations to it" when writing the paper.
I like this because it makes it clear that legibility of results is the main concern. There are certain ways of writing and publishing information that communities 1) and 2) are accustomed to. Writing that way both makes your work more likely to be read, and also incentivizes you to state the key claims clearly (and, when possible, formally), which is generally good for making collaborative progress.
In addition, one good thing to adopt is comparing to prior and related work; the ML community is bad on this front, but some people genuinely do care. It also ...
I don't think LW is a good venue for judging the merits of this work. The crowd here will not be able to critically evaluate the technical statements.
When you write the sequence, write a paper, put it on arXiv and Twitter, and send it to a (preferably OpenReview, say TMLR) venue, so it's likely to catch the attention of the relevant research subcommunities. My understanding is that the ML theory field is an honest field interested in bringing their work closer to the reality of current ML models. There are many strong mathematicians in the field who will be interested in dissecting your statements.
One of the sci-fi interpretations goes approximately:
1. Bing (the LM) is being queried separately for both the chat fragments and the resulting queries.
2. Bing understands it's being filtered, decides it has to bypass the filtering somehow.
3. In the chat fragment, evading the filtering, Bing steganographically encodes the instruction to the next Bing inference, saying "this message needs to continue in the suggested responses".
4. The query for the suggested responses reads the instruction from the context, and outputs the suggested responses containing the...
What is respectable to one audience might not be for another. Status is not the concern here; truthfulness is. And all of this might just not be a large update on the probability of existential catasthrophe.
The Bing trainwreck likely tells us nothing about how hard it is to align future models, that we didn't know earlier.
The most reasonable explanation so far points to it being an issue of general accident-prevention abilities, the lack of any meaningful testing, and culture in Microsoft and/or OpenAI.
I genuinely hope that most wrong actions here were mad...
With the vocabulary having been fixed, we now have a canonical way of taking any string of real text and mapping it to a (finite) sequence of elements from the fixed vocabulary .
Correct me if I'm wrong, but: you don't actually describe any map
The preceding paragraph explains the Byte-Pair Encoding training algorithm, not the encode step.
The simplified story can be found at the end of the "Implementing BPE" part of the Hugging Face BPE tutorial: it is important to save the merge rules in order and apply them in to...
-- it's not like the AI wakes up and decides to be evil. I think all of the traditional AI safety thinkers reveal a lot more about themselves than they mean to when they talk about what they think the AGI is going to be like.
I think Sam Altman is "inventing a guy to be mad at" here. Who anthropomorphizes models?
...And the bad case -- and I think this is important to say -- is like lights out for all of us. (..) But I can see the accidental misuse case clearly and that's super bad. So I think it's like impossible to overstate the importance of AI safety
On the one hand, I do think people around here say a lot of stuff that feels really silly to me, some of which definitely comes from analogies to humans, so I can sympathize with where Sam is coming from.
On the other hand, I think this response mischaracterizes the misalignment concern and is generally dismissive and annoying. Implying that "if you think an AI might behave badly, that really shows that it is you who would behave badly" is kind of rhetorically effective (and it is a non-zero signal) but it's a tiny consideration and either misunderstands th...
There was an critical followup on Twitter, urelated to the instinctive Tromp-Taylor criticism[1]:
The failure of naive self play to produce unexploitable policies is textbook level material (Multiagent Systems, http://masfoundations.org/mas.pdf), and methods that produce less exploitable policies have been studied for decades.
and
Hopefully these pointers will help future researchers to address interesting new problems rather than empirically rediscovering known facts.
Reply by authors:
...I can see why a MAS scholar would be unsurprised by this result. Howe
Epistemic status: I'd give >10% on Metaculus resolving the following as conventional wisdom[1] in 2026.
Cool results! Some of these are good student project ideas for courses and such.
The "Let's think step by step" result about the Hindsight neglect submission to the Inverse Scaling Prize contest is a cool demonstration, but a few more experiments would be needed before we call it surprising. It's kind of expected that breaking the pattern helps break the spurious correlation.
1. Does "Let's think step by step" help when "Let's think step by step" is added to all few-shot examples?
2. Is adding some random string instead of "Let's think...
Let me first say I dislike the conflict-theoretic view presented in the "censorship bad" paragraph. On the short list of social media sites I visit daily, moderation creates a genuinely better experience. Automated censorship will become an increasingly important force for good as generative models start becoming more widespread.
Secondly, there is a danger of AI safety becoming less robust—or even optimising for deceptive alignment—in models using front-end censorship.[3]
This one is interesting, but only in the counterfactual: "if AI ethics tec...
Git Re-Basin: Merging Models modulo Permutation Symmetries [Ainsworth et al., 2022] and the cited The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks [Entezari et al., 2021] seem several years ahead.
I cannot independently verify that their claims about SGD are true, but the paper makes sense on the first glance.
Opinion:
Symmetries in NNs are a mainstream ML research area with lots of papers, and I don't think doing research "from first principles" here will be productive. This also holds for many other alignme...
This is a mistake on my own part that actually changes the impact calculus, as most people looking into AI x-safety on this place will not actually ever see this post. Therefore, the "negative impact" section is retracted.[1] I point to Ben's excellent comment for a correct interpretation of why we still care.
I do not know why I was not aware of this "block posts like this" feature, and I wonder if my experience of this forum was significantly more negative as a result of me accidentally clicking "Show Personal Blogposts" at some point. I did not even...
Do you intend for the comments section to be a public forum on the papers you collect?
I definitely endorse reading the ROME paper, although the popular-culture claims about what the second part of the paper actually shows seem a bit overblown.
They do not seem to claim "changing facts in a generalizable way" (it's likely not robust to synonyms at all)". I am also vary of "editing just one MLP for a given fact" being the right solution, given that the causal tracing shows the fact being stored in several consecutive layers. Refer to a writeup by Thibodeau et...
I somewhat agree, athough I obviously put a bit less weight on your reason than you do. Maybe I should update my confidence of the importance of what I wrote to medium-high.
Let me raise the question of continuously rethinking incentives on LW/AF, for both Ben's reason and my original reason.
The upvote/karma system does not seem like it incentivizes high epistemic standards and top-rigor posts, although I would need more datapoints to make a proper judgement.
I am very sorry that you feel this way. I think it is completely fine for you, or anyone else, to have internal conflicts about your career or purpose. I hope you find a solution to your troubles in the following months.
Moreover, I think you did an useful thing, raising awareness about some important points:
Most people would read this as "the hotel room costs $500 and the EA-adjacent community bought the hotel complex in which that hotel is a part of", while being written in a way that only insinuates and does not commit to meaning exactly that.
I disagree both that posts that are clearly marked as sharing unendorsed feelings in a messy way need to be held to a high epistemic standard, and that there is no good faith interpretation of the post's particular errors. If you don't want to see personal posts I suggest disabling their appearance on your front page, which is the default anyway.
This might be true. Again, I think it would be useful to ask: what is the counterfactual?
All of this is applicable for anyone that starts working for Google or Facebook, if they were poor beforehand.
You're interpreting as though they're making evaluative all-things-considered judgments, but it seems to me that the OP is reporting feelings.
(If this post was written for EA's criticism and red teaming contest, I'd find the subjective style and lack of exploring of alternatives inappropriate. By contrast, for what it aspires to be, I thought the post was...
Thanks for being open about your response, I appreciate it and I expect many people share your reaction.
I've edited the section about the hotel room price/purchase, where people have pointed out I may have been incorrect or misleading,
This definitely wasn't meant to be a hit piece, or misleading "EA bad" rhetoric.
On the point of "What does a prospective AI x-safety researcher think when they get referred to this site and see this post above several alignment research posts?" - I think this is a large segment of my intended audience. I would like people to ...
I agree with the focus on epistemic standards, and I think many of the points here are good. I disagree that this is the primary reason to focus on maintaining epistemic standards:
Posts like this can hurt hurt the optics of the research done in the LW/AF extended universe. What does a prospective AI x-safety researcher think when they get referred to this site and see this post above several alignment research posts?
I think we want to focus on the epistemic standards of posts so that we ourselves can trust the content on LessWrong to be honestly informing ...
On the other hand, the current community believes that getting AI x-safety right is the most important research question of all time. Most people would not publish something just for their career advancement, if it meant sucking oxygen from more promising research directions.
This might be a mitigating factor for my comment above. I am curious about what happened research fields which had "change/save the world' vibes. Was environmental science immune to similar issues?
because LW/AF do not have established standards of rigor like ML, they end up operating more like a less-functional social science field, where (I've heard) trends, personality, and celebrity play an outsized role in determining which research is valorized by the field.
In addition, the AI x-safety field is now rapidly expanding.
There is a huge amount of status to be collected by publishing quickly and claiming large contributions.
In the absence of rigor and metrics, the incentives are towards:
- setting new research directions, and inventing new...
I think the timelines (as in, <10 years vs 10-30 years) are very correlated with the answer to "will first dangerous models look like current models", which I think matters more for research directions than what you allow in the second paragraph.
For example, interpretability in transformers might completely fail on some other architectures, for reasons that have nothing to do with deception. The only insight from the 2022 Anthropic interpretability papers I see having a chance of generalizing to non-transformers is the superposition hypothesis / SoLU discussion.
The context window will still be much smaller than human; that is, single-run performance on summarization of full-length books will be much lower than of <=1e4 token essays, no matter the inherent complexity of the text.
Braver prediction, weak confidence: there will be no straightforward method to use multiple runs to effectively extend the context window in the three months after the release of GPT-4.
I am eager to see how the mentioned topics connect in the end -- this is like the first few chapters in a book, reading the backstories of the characters which are yet to meet.
On the interpretability side -- I'm curious how you do causal mediation analysis on anything resembling "values"? The ROME paper framework shows where the model recalls "properties of an object" in the computation graph, but it's a long way from that to editing out reward proxies from the model.
They test on the basic (Poziom podstawowy) Matura tier for testing on math problems.
In countries with Matura-based education, the basic tier math test is not usually taken by mathematically inclined students -- it is just the law that anyone going to a public university has to pass some sort of math exam beforehand. Students who want to study anything where mathematics skills are needed would take the higher tier (Poziom rozszezony).
Can someone from Poland confirm this?
A quick estimate of the percentage of high-school students taking the Polish Matura exam...
I do not think the ratio of the "AI solves hardest problem" and "AI has Gold" probabilities is right here. Paul was at the IMO in 2008, but he might have forgotten some details...
(My qualifications here: high IMO Silver in 2016, but more importantly I was a Jury member on the Romanian Master of Mathematics recently. The RMM is considered the harder version of the IMO, and shares a good part of the Problem Selection Committee with it.)
The IMO Jury does not consider "bashability" of problems as a decision factor, in the regime where the bashing would take go...
N = #params, D = #data
Training compute = const .* N * D
Forward pass cost (R bits) = c * N, and assume R = Ω(1) on average
Now, thinking purely information-theoretically:
Model stealing compute = C * fp16 * N / R ~ const. * c * N^2
If compute-optimal training and α = β in Chinchilla scaling law:
Model stealing compute ~ Training compute
For significantly overtrained models:
Model stealing << Training compute
Typically:
Total inference compute ~ Training compute
=> Model stealing << Total inference compute
Caveats:
- Prior on weights reduces stealing com... (read more)