Rauno Arike

Wikitag Contributions

Comments

Sorted by

I'll start things off with some recommendations of my own aside from Susskind's Statistical Mechanics:

Domain: Classical Mechanics
Link: Lecture Collection | Classical Mechanics (Fall 2011) by Stanford University
Lecturer: Leonard Susskind
Why? For the same reasons as described in the main part of the post for the Statistical Mechanics lectures — Susskind is great!
For whom? This was also an undergrad-level course, so mainly for people who are just getting started with learning physics.

Domain: Deep Learning for Beginners
Link: Deep Learning for Computer Vision by the University of Michigan
Lecturer: Justin Johnson
Why? This lecture series is a dinosaur in the field of deep learning, having been recorded in 2019. It's possible that better introductory lectures on deep learning have been recorded in the meantime (if so, please link them here!), but when I first got started learning about DL in 2022, this was by far the best lecture series I came across. Many options, such as the MIT 6.S191 lectures by Alexander Amini, involved too much high-level discussion without the technical details, while some others weren't broad enough. This course strikes a nice balance, giving a broad overview of the methods while still discussing specific techniques and papers in great depth.
For whom? Beginners in deep learning looking for a broad introductory course.

Domain: Graph Neural Networks
Link: Stanford CS224W: Machine Learning with Graphs | 2021
Lecturer: Jure Leskovec
Why? I did my bachelor's thesis on GNNs and needed a refresher on them for that. I remember looking through multiple lecture series and finding these lectures significantly better than the alternatives, though I don't exactly remember the alternatives I explored. Leskovec is very highly regarded as a researcher in the field of GNNs and also good as a lecturer.
For whom? Anyone who wants an in-depth overview of GNNs and isn't already specialized in the field.

As a counterpoint to the "go off into the woods" strategy, Richard Hamming said the following in "You and Your Research", describing his experience at Bell Labs:

Thus what you consider to be good working conditions may not be good for you! There are many illustrations of this point. For example, working with one’s door closed lets you get more work done per year than if you had an open door, but I have observed repeatedly that later those with the closed doors, while working just as hard as others, seem to work on slightly the wrong problems, while those who have let their door stay open get less work done but tend to work on the right problems! I cannot prove the cause-and-effect relationship; I can only observed the correlation. I suspect the open mind leads to the open door, and the open door tends to lead to the open mind; they reinforce each other.

Bell Labs certainly produced a lot of counterfactual research, Shannon's information theory being the prime example. I suppose Bell Labs might have been well-described as a group that could maintain its own attention, though.

There's an X thread showing that the ordering of answer options is, in several cases, a stronger determinant of the model's answer than its preferences. While this doesn't invalidate the paper's results—they control for this by varying the ordering of the answer results and aggregating the results—, this strikes me as evidence in favor of the "you are not measuring what you think you are measuring" argument, showing that the preferences are relatively weak at best and completely dominated by confounding heuristics at worst.

A few other research guides:

The broader point is that even if AIs are completely aligned with human values, the very mechanisms by which we maintain control (such as scalable oversight and other interventions) may shift how the system operates in a way that produces fundamental, widespread effects across all learning machines

Would you argue that the field of alignment should be concerned with maintaining control beyond the point where AIs are completely aligned with human values? My personal view is that alignment research should ensure we're eventually able to align AIs with human values and that we can maintain control until we're reasonably sure that the AIs are aligned. However, worlds where relatively unintelligent humans remain in control indefinitely after those milestones have been reached may not be the best outcomes. I don't have time to write out my views on this in depth right now, but here's a relevant excerpt from the Dwarkesh podcast episode with Paul Christiano that I agree with:

Dwarkesh: "It's hard for me to imagine in 100 years that these things are still our slaves. And if they are, I think that's not the best world. So at some point, we're handing off the baton. Where would you be satisfied with an arrangement between the humans and AIs where you're happy to let the rest of the universe or the rest of time play out?"

Paul: "I think that it is unlikely that in 100 years I would be happy with anything that was like, you had some humans, you're just going to throw away the humans and start afresh with these machines you built. [...] And then I think that the default path to be comfortable with something very different is kind of like, run that story for a long time, have more time for humans to sit around and think a lot and conclude, here's what we actually want. Or a long time for us to talk to each other or to grow up with this new technology and live in that world for our whole lives and so on. [...] We should probably try and sort out our business, and you should probably not end up in a situation where you have a billion humans and like, a trillion slaves who would prefer revolt. That's just not a good world to have made."

I saw them in 10-20% of the reasoning chains. I mostly played around with situational awareness-flavored questions, I don't know whether the Chinese characters are more or less frequent in the longer reasoning chains produced for difficult reasoning problems. Here are some examples:

The translation of the Chinese words here (according to GPT) is "admitting to being an AI."
 

This is the longest string in Chinese that I got. The English translation is "It's like when you see a realistic AI robot that looks very much like a human, but you understand that it's just a machine controlled by a program."
 

The translation here is "mistakenly think."
 

Here, the translation is "functional scope."
 

So, seems like all of them are pretty direct translations of the English words that should be in place of the Chinese ones, which is good news. It's also reassuring to me that none of the reasoning chains contained sentences or paragraphs that looked out of place or completely unrelated to the rest of the response.

This is a nice overview, thanks!

Lee Sharkey's CLDR arguments

I don't think I've seen the CLDR acronym before, are the arguments publicly written up somewhere?

Also, just wanted to flag that the links on 'this picture' and 'motivation image' don't currently work.

My understanding of the position that scheming will be unlikely is the following:

  • Current LLMs don't have scary internalized goals that they pursue independent of the context they're in.
  • Such beyond-episode goals also won't be developed when we apply a lot more optimization pressure to the models, given that we keep using the training techniques we're using today, since the inductive biases will remain similar and current inductive biases don't seem to incentivize general goal-directed cognition. Naturally developing deception seems very non-trivial, especially given that models are unlikely to develop long-term goals in pre-training.
  • Based on the evidence we have, we should expect that the current techniques + some kind of scaffolding will be a simpler path to AGI than e.g. extensive outcome-based RL training. We'll get nice instuction-following tool AIs. The models might still become agentic in this scenario, but since the agency comes from subroutine calls to the LLM rather than from the LLM itself, the classical arguments for scheming don't apply.
  • Even if we get to AGI through some other path, the theoretical arguments in favor of deceptive alignment are flimsy, so we should have a low prior on other kinds of models exhibiting scheming.

I'm not sure about the other skeptics, but at least Alex Turner appears to believe that the kind of consequentialist cognition necessary for scheming is much more likely to arise if the models are aggressively trained on outcome-based rewards, so this seems to be the most important of the cruxes you listed. This crux is also one of the two points on which I disagree most strongly with the optimists:

  1. I expect models to be trained in outcome-based ways. This will incentivize consequentialist cognition and therefore increase the likelihood of scheming. This post makes a good case for this.
  2. Even if models aren't trained with outcome-based RL, I wouldn't be confident that it's impossible for coherent consequentialist cognition to arise otherwise, so assigning deceptive alignment a <1% probability would still seem far-fetched to me.

However, I can see reasons why well-informed people would hold views different from mine on both of those counts (and I've written a long post trying to explore those reasons), so the position isn't completely alien to me.

[Link] Something weird is happening with LLMs and chess by dynomight

dynomight stacked up 13 LLMs against Stockfish on the lowest difficulty setting and found a huge difference between the performance of GPT-3.5 Turbo Instruct and any other model:

all

People noticed already last year that RLHF-tuned models are much worse at chess than base/instruct models, so this isn't a completely new result. The gap between models from the GPT family could also perhaps be (partially) closed through better prompting: Adam Karvonen has created a repo for evaluating LLMs' chess-playing abilities and found that many of GPT-4's losses against 3.5 Instruct were caused by GPT-4 proposing illegal moves. However, dynomight notes that there isn't nearly as big of a gap between base and chat models from other model families:

instruct comparison

This is a surprising result to me—I had assumed that base models are now generally decent at chess after seeing the news about 3.5 Instruct playing at 1800 ELO level last year. dynomight proposes the following four explanations for the results:

1. Base models at sufficient scale can play chess, but instruction tuning destroys it.
2. GPT-3.5-instruct was trained on more chess games.
3. There’s something particular about different transformer architectures.
4. There’s “competition” between different types of data.

Load More