All of Sodium's Comments + Replies

Sodium10

Hate to be that person, but is that April 18th deadline AoE/PDT/a secret third thing?

2Ryan Kidd
Apr 18, 11:59 pm PT :)
Sodium30

I don't think you're supposed to get the virtue of Void, if you got it, it wouldn't void anymore, would it?

Sodium40

If people outside of labs are interested in doing this, I think it'll be cool to look for cases of scheming in evals like The Agent Company, where they have an agent act as a remote worker for a company. They ask the agent to complete a wide range of tasks (e.g., helping with recruiting, messaging coworkers, writing code). 

You could imagine building on top of their eval and adding morally ambiguous tasks, or just look through the existing transcripts to see if there's anything interesting there (the paper mentions that models would sometimes "deceive"... (read more)

Sodium1112

One thing not mentioned here (and I think should be talked about more) is that the naturally occurring genetic distribution is very unequal in a moral sense. A more egalitarian society would put a stop to Eugenics Performed by a Blind, Idiot God. 

Have your doctor ever asked about if you have a family history of [illness]? For so many diseases, if your parents have it, you're more likely to have it, and your kids are more likely to have it. These illnesses plague families for generations. 

I have a higher than average chance of getting hypertension... (read more)

3David James
In clear-cut cases, this principle seems sound; if a certain gene only has deleterious effects, and it can be removed, this is clearly better (for the individual and almost certainly for everyone else too). In practice, this becomes more complicated if one gene has multiple effects. (This may occur on its own or because the gene interacts with other genes.) What if the gene in question is a mixed bag? For example, consider a gene giving a 1% increased risk of diabetes while always improving visual acuity. To be clear, I'm saying complicated not unresolvable. Such tradeoffs can indeed be resolved with a suitable moral philosophy combined with sufficient data. However, the difference is especially salient because the person deciding isn't the person that has to live with said genes. The two people may have different philosophies, risk preferences, or lifestyles.
6GeneSmith
Agreed, though unfortunately it's going to take a while to make this tech available to everyone. Also, if you want to prevent your children from getting hypertension, you can already do embryo selection right now! The reduction isn't always as large as what you can get for gene editing, but it's still noticeable. And it stacks generation after generation; your kids can use embryo selection to lower THEIR children's disease risk even more.
-4[comment deleted]
Sodium50

I'm not sure if the rationalists did anything they shouldn't do re: Ziz. Going forward though, I think epistemic learned helplessness/memetic immune systems should be among the first things to introduce to newcomers to the site/community. Being wary that some ideas are, in a sense, out to get you, is a central part of how I process information. 


Not exactly sure how to implement that recommendation though. You also don't want people to use it as a fully general counterargument to anything they don't like. 

Ranting a bit here, but it just feels like... (read more)

2Viliam
In companies where I worked, we sometimes had a security training, which included stories about the things that went wrong in the past. Some examples were from the industry in general, but some of them were from that specific company (with specific names removed). We probably should write a short report on "the things that went wrong in the rationalist community", written from our perspective, without specific names, and... it could be an interesting topic for the new members.
Sodium22

You link a comment by clicking the timestamp next to the username (which, now that I say it, does seem quite unintuitive... Maybe it should also be possible via the three dots on the right side).

While this post didn't yield a comprehensive theory of how fact finding works in neural networks, it's filled with small experimental results that I find useful for building out my own intuitions around neural network computation.

I think that's speaks to how well these experiments are scoped out that even a set of not-globally-coherent findings yield useful information. 

So I think the first claim here is wrong. 

Let’s start with one of those insights that are as obvious as they are easy to forget: if you want to master something, you should study the highest achievements of your field. If you want to learn writing, read great writers, etc.

If you want to master something, you should do things that causally/counter factually increase your ability (in the order of most to least cost-effective). You should adopt interventions that actually make you better compared to the case that you haven't done them. 

Any intervent... (read more)

Perhaps I am missing something, but I do not understand the value of this post. Obviously you can beat something much smarter than you if you have more affordances than it does.

FWIW, I have read some of the discourse on the AI Boxing game. In contrast, I think those posts are valuable. They illustrate that even with very little affordances a much more intelligent entity can win against you, which is not super intuitive especially in the boxed context. 

So the obvious question is, how does differences in affordances lead to differences in winning (i.e.,... (read more)

Sodium41

I think the alignment stress testing team should probably think about AI welfare more than they currently do, both because (1) it could be morally relevant and (2) it could be alignment-relevant. Not sure if anything concrete that would come out of that process, but I'm getting the vibe that this is not thought about enough. 

4ryan_greenblatt
Are you aware that Anthropic has an AI welfare lead?
Sodium20

since it's near-impossible to identify which specific heuristic the model is using (there can always be a slightly more complex, more general heuristic of which your chosen heuristic is a special case).  

I'm putting some of my faith in low-rank decompositions of bilinear MLPs but I'll let you know if I make any real progress with it :)

Sodium40

This sounds like a plausible story for how (successful) prosaic interpretability can help us in the short to medium term! I would say though, I think more applied mech interp work could supplement prosaic interpretability's theories. For example, the reversal curse seems mostly explained by what little we know about how neural networks do factual recall. Theory on computation in superposition help explain why linear probes can recover arbitrary XORs of features.

Reading through your post gave me a chance to reflect on why I am currently interested in mech i... (read more)

4Daniel Tan
Thanks for the kind words!  I think I mostly agree, but am going to clarify a little bit:  I also believe this (I was previously interested in formal verification). I've just kind of given up on this ever being achieved haha.  It feels like we would need to totally revamp neural nets somehow to imbue them with formal verification properties. Fwiw this was also the prevailing sentiment at the workshop on formal verification at ICML 2023.  That's pretty neat! And I broadly agree that this is what's going on. The problem (as I see it) is that it doesn't have any predictive power, since it's near-impossible to identify which specific heuristic the model is using (there can always be a slightly more complex, more general heuristic of which your chosen heuristic is a special case).   I sincerely want auto-interp people to keep doing what they're doing! It feels like they probably already have significant momentum in this direction anyway, and probably some people should be working on this. (But I argue that if you aren't currently heavily invested in pushing mech interp then you probably shouldn't invest more.)  Thanks again! You too :) 
Sodium11

I think the actual answer is: the AI isn't smart enough and trips up a lot.

But I haven't seen a detailed write up anywhere that talks about why the AI trips up and what are the types of places where it trips up. It feels like all of the existing evals work optimize for legibility/reproducibility/being clearly defined. As a result, it's not measuring the one thing that I'm really interested in: why don't we have AI agents replacing workers. I suspect that some startup's internal doc on "why does our agent not work yet" would be super interesting to read and track over time. 

I read this post in full back in February. It's very comprehensive. Thanks again to Zvi for compiling all of these. 

To this day, it's infuriating that we don't have any explanation whatsoever from Microsoft/OpenAI on what went wrong with Bing Chat. Bing clearly did a bunch of actions its creators did not want. Why? Bing Chat would be a great model organism of misalignment. I'd be especially eager to run interpretability experiments on it. 

The whole Bing chat fiasco is also gave me the impetus to look deeper into AI safety (although I think absent Bing, I would've came around to it eventually).

When this paper came out, I don't think the results were very surprising to people who were paying attention to AI progress. However, it's important to the "obvious" research and demos to share with the wider world, and I think Apollo did a good job with their paper.

TL; DR: This post gives a good summary of how models can get smarter over time, but while they are superhuman at some tasks, they can still suck at others (see the chart with Naive Scenario v. Actual performance). This is a central dynamic in the development of machine intelligence and deserves more attention. Would love to hear other's thoughts on this—I just realized that it needed one more positive vote to end up in the official review.

In other words, current machine intelligence and human intelligence are compliments, and human + AI will be more produc... (read more)

Sodium60

OpenAI released another set of emails here. I haven't looked through them in detail but it seems that they contain some that are not already in this post.

4habryka
Yep! I am working on updating this post with the new emails (as well as the emails from the March OpenAI blogpost that also had a bunch of emails not in this post).
2Ben Pace
Yes, there is, I’ll get the post up today.
Sodium*20

Almost certainly not original idea: Given the increasing fine-tuning access to models (see also the recent reinforcement fine tuning thing from OpenAI), see if fine tuning on goal directed agent tasks for a while leads to the types of scheming seen in the paper. You could maybe just fine tune on the model's own actions when successfully solving SWE-Bench problems or something. 

(I think some of the Redwood folks might have already done something similar but haven't published it yet?)

Sodium1610

What is the probability that the human race will NOT make it to 2100 without any catastrophe that wipes out more than 90% of humanity?

 

Could we have this question be phrased using no negations instead of two? Something like "What is the probability that there will be a global catastrophe that wipes out 90% or more of humanity before 2100."

7Screwtape
Argh, I hate tweaking historical questions. This seems equivalent so lets try it.  It wound up phrased that way trying to make a minimal change from the historical version of the question, where the question and the title were at odds.
Sodium810

Thanks for writing these posts Zvi <3 I've found them to be quite helpful.

Sodium10

Hi Clovis! Something that comes to mind is Zvi's dating roundup posts in case you haven't seen them yet. 

Sodium-32

I think people see it and think "oh boy I get to be the fat people in Wall-E"

(My friend on what happens if the general public feels the AGI)

Sodium50

This chapter on AI follows immediately after the year in review, I went and checked the previous few years' annual reports to see what the comparable chapters were about, they are

2023: China's Efforts To Subvert Norms and Exploit Open Societies

2022: CCP Decision-Making and Xi Jinping's Centralization Of Authority

2021: U.S.-China Global Competition (Section 1: The Chinese Communist Party's Ambitions and Challenges at its Centennial

2020: U.S.-China Global Competition (Section 1: A Global Contest For Power and Influence: China's View of Strategic Competiti... (read more)

7Orpheus16
I think we're seeing more interest in AI, but I think interest in "AI in general" and "AI through the lens of great power competition with China" has vastly outpaced interest in "AI safety". (Especially if we're using a narrow definition of AI safety; note that people in DC often use the term "AI safety" to refer to a much broader set of concerns than AGI safety/misalignment concerns.) I do think there's some truth to the quote (we are seeing more interest in AI and some safety topics), but I think there's still a lot to do to increase the salience of AI safety (and in particular AGI alignment) concerns.
Sodium30

I think[1] people[2] probably trust individual tweets way more than they should. 

Like, just because someone sounds very official and serious, and it's a piece of information that's inline with your worldviews, doesn't mean it's actually true. Or maybe it is true, but missing important context. Or it's saying A causes B when it's more like A and C and D all cause B together, and actually most of the effect is from C but now you're laser focused on A. 
 

Also you should be wary that the tweets you're seeing are optimized for piquing th... (read more)

Sodium20

Sorry, is there a timezone for when the applications would close by, or is it AoE?

1Remmelt
Fair question. You can assume it is AoE. Research leads are not going to be too picky in terms of what hour you send the application in, There is no need to worry about the exact deadline. Even if you send in your application on the next day, that probably won't significantly impact your chances of getting picked up by your desired project(s). Sooner is better, since many research leads will begin composing their teams after the 17th, but there is no hard cut-off point.
Sodium35

Man, politics really is the mind killer

Sodium30

I think knowing the karma and agreement is useful, especially to help me decide how much attention to pay to a piece of content, and I don't think there's that much distortion from knowing what others think. (i.e., overall benefits>costs)

2Nathan Helm-Burger
I'm not saying you shouldn't be able to see the karma and agreement at the top, just that you should only be able to contribute your own opinion at the bottom, after reading and judging for yourself.
SodiumΩ010

Thanks for putting this up! Just to double check—there aren't any restrictions against doing multiple AISC projects at the same time, right?

4Linda Linsefors
Yes there are, sort of... You can apply to as many projects as you want, but you can only join one team.  The reasons for this is: When we've let people join more than one team in the past, they usually end up not having time for both and dropping out of one of the projects. What this actually means: When you join a team you're making a promise to spend 10 or more hours per week on that project. When we say you're only allowed to join one team, what we're saying is that you're only allowed to make this promise to one project. However, you are allowed to help out other teams with their projects, even if you're not officially on the team.
2Ronny Fernandez
There is! It is now posted! Sorry about the delay.
Sodium20

Wait a minute, "agentic" isn't a real word? It's not on dictionary.com or Merriam-Webster or Oxford English Dictionary.

A word has to be real already to get into a dictionary.

4cubefox
Wiktionary entry
2niplav
I think normally "agile" would fulfill the same function (per its etymology), but it's very entangled with agile software engineering.
Sodium30

I agree that if you put more limitations on what heuristics are and how they compose, you end up with a stronger hypothesis. I think it's probably better to leave that out and try do some more empirical work before making a claim there though (I suppose you could say that the hypothesis isn't actually making a lot of concrete predictions yet at this stage). 

I don't think (2) necessarily follows, but I do sympathize with your point that the post is perhaps a more specific version of the hypothesis that "we can understand neural network computation by doing mech interp."

Sodium30

Thanks for reading my post! Here's how I think this hypothesis is helpful:

It's possible that we wouldn't be able to understand what's going on even if we had some perfect way to decompose a forward pass into interpretable constituent heuristics. I'm skeptical that this would be the case, mostly because I think (1) we can get a lot of juice out of auto-interp methods and (2) we probably wouldn't need to simultaneously understand that many heuristics at the same time (which is the case for your logic gate example for modern computers). At the minimum, I woul... (read more)

2Jeremy Gillen
I think the problem might be that you've given this definition of heuristic: Taking this definition seriously, it's easy to decompose a forward pass into such functions. But you have a much more detailed idea of a heuristic in mind. You've pointed toward some properties this might have in your point (2), but haven't put it into specific words. Some options: A single heuristic is causally dependent on <5 heuristics below and influences <5 heuristics above. The inputs and outputs of heuristics are strong information bottlenecks with a limit of 30 bits. The function of a heuristic can be understood without reference to >4 other heuristics in the same layer. A single heuristic is used in <5 different ways across the data distribution. A model is made up of <50 layers of heuristics. Large arrays of parallel heuristics often output information of the same type. Some combination of these (or similar properties) would turn the heuristics intuition into a real hypothesis capable of making predictions.  If you don't go into this level of detail, it's easy to trick yourself into thinking that (2) basically kinda follows from your definition of heuristics, when it really really doesn't. And that will lead you to never discover the value of the heuristics intuition, if it is true, and never reject it if it is false.
Sodium30

I think there's something wrong with the link :/ It was working fine earlier but seems to be down now

2Ben Pace
How annoying. Something about the link must expire. Anyhow, you can just go to lesswrong.com/donate and click through there to pay.
Sodium10

I think those sound right to me. It still feels like prompts with weird suffixes obtained through greedy coordinate search (or other jailbreaking methods like h3rm4l) are good examples for "model does thing for anomalous reasons."

Sodium85

You could also use \text{}

3cubefox
No LaTeX: Rana Dexsin Plain LaTex: RanaDexsin "\text{}": Rana Dexsin "\mathrm{}": RanaDexsin
Sodium33

since people often treat heuristics as meaning that it doesn't generalize at all.

Yeah and I think that's a big issue! I feel like what's happening is that once you chain a huge number of heuristics together you can get behaviors that look a lot like complex reasoning. 

Sodium32

I see, I think that second tweet thread actually made a lot more sense, thanks for sharing!
McCoy's definitions of heuristics and reasoning is sensible, although I personally would still avoid "reasoning" as a word since people probably have very different interpretations of what it means. I like the ideas of "memorizing solutions" and "generalizing solutions."

I think where McCoy and I depart is that he's modeling the entire network computation as a heuristic, while I'm modeling the network as compositions of bags of heuristics, which in aggregate would dis... (read more)

4Noosphere89
Now I understand. Though I'd still claim that this is evidence towards the view that there is a generalizing solution that is implemented inside of LLMs, and I wanted people to keep that in mind, since people often treat heuristics as meaning that it doesn't generalize at all.
Sodium51

Yeah that's true. I meant this more as "Hinton is proof that AI safety is a real field and very serious people are concerned about AI x-risk."

Sodium10

Thanks for the pointer! I skimmed the paper. Unless I'm making a major mistake in interpreting the results, the evidence they provide for "this model reasons" is essentially "the models are better at decoding words encrypted with rot-5 than they are at rot-10." I don't think this empirical fact provides much evidence one way or another.

To summarize, the authors decompose a model's ability to decode shift ciphers (e.g., Rot-13 text: "fgnl" Original text: "stay")  into three categories, probability, memorization, and noisy reasoning.

Probability just ref... (read more)

2Noosphere89
True that it isn't much evidence for reasoning directly, as it's only 1 task. As for how we can jump from the empirical result to make claims about it's ability to reason, the reason is that the shift cipher task let's us disentangle commonness and simplicity, where a bag of heuristics that has no uniform and compact description work best for common example types, whereas the algorithmic reasoning that I defined below would work better on simpler tasks, where the simplest shift cipher is 1-shift cipher, whereas the bag of heuristics model which predicts that LLMs are essentially learning shallow heuristics completely or primarily would work best on 13-shift ciphers, as that's the most common, and the paper shows that there is a spike on the 13-shift cipher accuracy, consistent with LLMs having some heuristics, but also that the 1-shift cipher accuracy was much better than expected under a view that though LLMs were solely or primarily a bag of heuristics that couldn't be improved by COT. I'm defining reasoning more formally in the quote below: This comment is where I got the quote from: https://www.lesswrong.com/posts/gcpNuEZnxAPayaKBY/othellogpt-learned-a-bag-of-heuristics-1#Bg5s8ujitFvfXuop8 This thread has an explanation of why we can disentangle noisy reasoning from heuristics, as I'm defining the terms here, so go check that out below: https://x.com/RTomMcCoy/status/1843325666231755174
Sodium21

I think it's mostly because he's well known and have (especially after the Nobel prize) credentials recognized by the public and elites. Hinton legitimizes the AI safety movement, maybe more than anyone else. 

If you watch his Q&A at METR, he says something along the lines of "I want to retire and don't plan on doing AI safety research. I do outreach and media appearances because I think it's the best way I can help (and because I like seeing myself on TV)." 

And he's continuing to do that. The only real topic he discussed in first phone interv... (read more)

2Cleo Nardo
Hmm. He seems pretty periphery to the AI safety movement, especially compared with (e.g.) Yoshua Bengio.
Sodium*30

I like this research direction! Here's a potential benchmark for MAD.

In Coercing LLMs to do and reveal (almost) anything, the authors demonstrate that you can force LLMs to output any arbitrary string—such as a random string of numbers—by finding a prompt through greedy coordinate search (the same method used in the universal and transferable adversarial attack paper). I think it’s reasonable to assume that these coerced outputs results from an anomalous computational process.

Inspired by this, we can consider two different inputs, the regular one looks som... (read more)

2Erik Jenner
Yeah, seems right that these adversarial prompt should be detectable as mechanistically anomalous---it does intuitively seem like a different reason for the output, given that it doesn't vary with the input. That said, if you look at cases where the adversarial prompt makes the model give the correct answer, it might be hard to know for sure to what extent the anomalous mechanism is present. More generally, the fact that we don't understand how these prompts work probably makes any results somewhat harder to interpret. Cases where the adversarial prompt leads to an incorrect answer seem more clearly unusual (but detecting them may also be a significantly easier task).
Sodium30

I'd imagine that RSP proponents think that if we execute them properly, we will simply not build dangerous models beyond our control, period. If progress was faster than what labs can handle after pausing, RSPs should imply that you'd just pause again. On the other hand, there's not a clear criteria for when we would pause again after, say, a six month pause in scaling.

Now whether this would happen in practice is perhaps a different question.

4DanielFilan
I think pause proponents think similarly!
Sodium20

I really liked the domesticating evolution section, cool paper!

Sodium40

That was the SHA-256 hash for:

What if a bag of heuristics is all there is and a bag of heuristics is all we need? That is, (1) we can decompose each forward pass in current models into a set of heuristics chained together and (2) heauristics chained together is all we need for agi

Here's my full post on the subject

Load More