All of Rauno Arike's Comments + Replies

I'll start things off with some recommendations of my own aside from Susskind's Statistical Mechanics:

Domain: Classical Mechanics
Link: Lecture Collection | Classical Mechanics (Fall 2011) by Stanford University
Lecturer: Leonard Susskind
Why? For the same reasons as described in the main part of the post for the Statistical Mechanics lectures — Susskind is great!
For whom? This was also an undergrad-level course, so mainly for people who are just getting started with learning physics.

Domain: Deep Learning for Beginners
Link: Deep Learning for Computer Vision b... (read more)

As a counterpoint to the "go off into the woods" strategy, Richard Hamming said the following in "You and Your Research", describing his experience at Bell Labs:

Thus what you consider to be good working conditions may not be good for you! There are many illustrations of this point. For example, working with one’s door closed lets you get more work done per year than if you had an open door, but I have observed repeatedly that later those with the closed doors, while working just as hard as others, seem to work on slightly the wrong problems, while those wh

... (read more)
9johnswentworth
Bell Labs is actually my go-to example of a much-hyped research institution whose work was mostly not counterfactual; see e.g. here. Shannon's information theory is the only major example I know of highly counterfactual research at Bell Labs. Most of the other commonly-cited advances, like e.g. transistors or communication satellites or cell phones, were clearly not highly counterfactual when we look at the relevant history: there were other groups racing to make the transistor, and the communication satellite and cell phones were both old ideas waiting on the underlying technology to make them practical. That said, Hamming did sit right next to Shannon during the information theory days IIRC, so his words do carry substantial weight here.

There's an X thread showing that the ordering of answer options is, in several cases, a stronger determinant of the model's answer than its preferences. While this doesn't invalidate the paper's results—they control for this by varying the ordering of the answer results and aggregating the results—, this strikes me as evidence in favor of the "you are not measuring what you think you are measuring" argument, showing that the preferences are relatively weak at best and completely dominated by confounding heuristics at worst.

4Mantas Mazeika
Hey, first author here. We responded to the above X thread, and we added an appendix to the paper (Appendix G) explaining how the ordering effects are not an issue but rather a way that some models represent indifference.

This doesn't contradict the Thurstonian model at all. This only show order effects are one of the many factors going in utility variance, one of the factors of the Thurstonian model. Why should it be considered differently than any other such factor? The calculations still show utility variance (including order effects) decrease when scaled (Figure 12), you don't need to eyeball based on a few examples in a Twitter thread on a single factor.

What's the minimum capacity in which you're expecting people to contribute? Are you looking for a few serious long-term contributors or are you also looking for volunteers who offer occasional help without a fixed weekly commitment?

1Chi Nguyen
My current guess is that occasional volunteers are totally fine! There's some onboarding cost but mostly, the cost on our side scales with the number of argument-critique pairs we get. Since the whole point is to have critiques of a large variety of quality, I don't expect the nth argument-critque pair we get to be much more useable than the 1st one. I might be wrong about this one and change my mind as we try this out with people though! (Btw I didn't get a notification for your comment, so maybe better to dm if you're interested.)

A few other research guides:

The broader point is that even if AIs are completely aligned with human values, the very mechanisms by which we maintain control (such as scalable oversight and other interventions) may shift how the system operates in a way that produces fundamental, widespread effects across all learning machines

Would you argue that the field of alignment should be concerned with maintaining control beyond the point where AIs are completely aligned with human values? My personal view is that alignment research should ensure we're eventually able to align AIs with human v... (read more)

2Daniel Murfet
The metaphor is a simplification, in practice I think it is probably impossible to know whether you have achieved complete alignment. The question is then: how significant is the gap? If there is an emergent pressure across the vast majority of learning machines that dominate your environment to push you from de facto to de jure control, not due to malign intent but just as a kind of thermodynamic fact, then the alignment gap (no matter how small) seems to loom larger.

I saw them in 10-20% of the reasoning chains. I mostly played around with situational awareness-flavored questions, I don't know whether the Chinese characters are more or less frequent in the longer reasoning chains produced for difficult reasoning problems. Here are some examples:

The translation of the Chinese words here (according to GPT) is "admitting to being an AI."
 

This is the longest string in Chinese that I got. The English translation is "It's like when you see a realistic AI robot that looks very much like a human, but you understand that i... (read more)

This is a nice overview, thanks!

Lee Sharkey's CLDR arguments

I don't think I've seen the CLDR acronym before, are the arguments publicly written up somewhere?

Also, just wanted to flag that the links on 'this picture' and 'motivation image' don't currently work.

3StefanHex
CLDR (Cross-layer distributed representation): I don't think Lee has written his up anywhere yet so I've removed this for now. Thanks for the flag! It's these two images, I realize now that they don't seem to have direct links Images taken from AMFTC and Crosscoders by Anthropic.

My understanding of the position that scheming will be unlikely is the following:

  • Current LLMs don't have scary internalized goals that they pursue independent of the context they're in.
  • Such beyond-episode goals also won't be developed when we apply a lot more optimization pressure to the models, given that we keep using the training techniques we're using today, since the inductive biases will remain similar and current inductive biases don't seem to incentivize general goal-directed cognition. Naturally developing deception seems very non-trivial, especia
... (read more)
2Seth Herd
I agree with all of those points locally. To the extent people are worried about LLM scaleups taking over, I don't think they should be. We will get nice instruction-following tool AIs. But the first thing we'll do with those tool AIs is turn them into agentic AGIs. To accomplish any medium-horizon goals, let alone the long-horizon ones we really want help with, they'll need to do some sort of continuous learning, make plans (including subgoals), and reason in novel sub-domains. None of those things are particularly hard to add. So we'll add them. (Work is underway on all of those capacities in different LLM agent projects). Then we have the risks of aligning real AGI. That's why this post was valuable. It goes into detail on why and how we'll add the capacities that will make LLM agents much more useful but also add the ability and instrumental motivation to do real scheming. I wrote a similar post to the one you mention, Cruxes of disagreement on alignment difficulty. I think understanding the wildly different positions on AGI x-risk among different experts is critical; we clearly don't have a firm grasp on the issue, and we need it ASAP. The above is my read on why TurnTrout, Pope and co are so optimistic - they're addressing powerful tool AI, and not the question of whether we develop real AGI or how easy that will be to align. FWIW I do think that can be accomplished (as sketched out in posts linked from my user profile summary), but it's nothing like easy or default alignment, as current systems and their scaleups are. I'll read and comment on your take on the issue.

[Link] Something weird is happening with LLMs and chess by dynomight

dynomight stacked up 13 LLMs against Stockfish on the lowest difficulty setting and found a huge difference between the performance of GPT-3.5 Turbo Instruct and any other model:

all

People noticed already last year that RLHF-tuned models are much worse at chess than base/instruct models, so this isn't a completely new result. The gap between models from the GPT family could also perhaps be (partially) closed through better prompting: Adam Karvonen has created a repo for evaluating LLMs' chess-... (read more)

3Lorenzo
Here's a followup https://dynomight.net/more-chess/ apparently it depends a lot on the prompting
3ZY
This is very interesting, and thanks for sharing.  * One thing that jumps out at me is they used an instruction format to prompt base models, which isn't typically the way to evaluate base models. It should be reformatted to a completion type of task. If this is redone, I wonder if the performance of the base model will also increase, and maybe that could isolate the effect further to just RLHF. * I wonder if this has anything to do with also the number of datasets added on by RLHF (assuming a model go through supervised/instruction finetuning first, and then RLHF), besides the algorithm themselves. * Another good model to test on is https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 which only has instruction finetuning it seems as well. The author seems to say that they figured it out at the end of the article, and I am excited to see their exploration in the next post.

OpenAI models are seemingly trained on huge amounts of chess data, perhaps 1-4% of documents are chess (though chess documents are short, so the fraction of tokens which are chess is smaller than this).

Thank you for the detailed feedback, I found this very helpful and not at all rude or mean!

I suspect there are a few key disagreements between us that make me more optimistic about this project setup than you. I'd be curious about whether you agree on these points being important cruxes:

  • Though I agree that our work primarily belongs within the model organisms paradigm, I disagree that it's only useful as a means to study in-forward-pass goal-directedness. I think there's a considerable chance that the Translucent Thoughts hypotheses are true and AGI will b
... (read more)
3Aaron_Scher
Yep, I basically agree with those being the cruxes! On how much of the goal reasoning is happening out loud: Nowadays, I think about a lot of AI safety research as being aimed at an AI Control scenario where we are closely supervising what models are thinking about in CoT, and thus malign goal seeking must either happen in individual forward passes (and translated to look harmless most the time) or in an obfuscated way in CoT. (or from rogue deployments outside the control scheme) By naturalistic, I mean "from a realistic training process, even if that training is designed to create goals". Which sounds like what you said is the main threat model you're worried about? If you have the time, I would push you harder on this: what is a specific story of AI catastrophe that you are trying to study/prevent? 

Thanks, that definitely seems like a great way to gather these ideas together!

I guess the main reason my arguments are not addressing the argument at the top is that I interpreted Aaronson's and Garfinkel's arguments as "It's highly uncertain whether any of the technical work we can do today will be useful" rather than as "There is no technical work that we can do right now to increase the probability that AGI goes well." I think that it's possible to respond to the former with "Even if it is so and this work really does have a high chance of being useless, there are many good reasons to nevertheless do it," while assuming the latte... (read more)