All of Logan Riggs's Comments + Replies

That does clarify a lot of things for me, thanks!

Looking at your posts, there’s no hooks or trying to sell your work, which is a shame cause LSRDR’s seem useful. Since they are you useful, you should be able to show it.

For example, you trained an LSRDR for text embedding, which you could show at the beginning of the post. Then showing the cool properties of pseudo-determinism & lack of noise compared to NN’s. THEN all the maths. So the math folks know if the post is worth their time, and the non-math folks can upvote and share with their mathy friends.... (read more)

1Joseph Van Name
I would have thought that a fitness function that is maximized using something other than gradient ascent and which can solve NP-complete problems at least in the average case would be worth reading since that means that it can perform well on some tasks but it also behaves mathematically in a way that is needed for interpretability. The quality of the content is inversely proportional to the number of views since people don't think the same way as I do. Wheels on the Bus | @CoComelon Nursery Rhymes & Kids Songs Stuff that is popular is usually garbage. But here is my post about the word embedding. Interpreting a matrix-valued word embedding with a mathematically proven characterization of all optima — LessWrong And I really do not want to collaborate with people who are not willing to read the post. This is especially true of people in academia since universities promote violence and refuse to acknowledge any wrongdoing. Universities are the absolute worst. Instead of engaging with the actual topic, people tend to just criticize stupid stuff simply because they only want to read about what they already know or what is recommended by their buddies; that is a very good way not to learn anything new or insightful. For this reason, even the simplest concepts are lost on most people.

Is the LSRDR a proposed alternative to NN’s in general?

What interpretability do you gain from it?

Could you show a comparison between a transformer embedding and your method with both performance and interpretability? Even MNIST would be useful.

Also, I found it very difficult to understand your post (Eg you didn’t explain your acronym! I had to infer it). You can use the “request feedback” feature on LW in the future; they typically give feedback quite quickly.

8Joseph Van Name
In this post, the existence of a non-gradient based algorithm for computing LSRDRs is a sign that LSRDRs behave mathematically and are quite interpretable. Gradient ascent is a general purpose optimization algorithm that works in the case when there is no other way to solve the optimization problem, but when there are multiple ways of obtaining a solution to an optimization problem, the optimization problem is behaving in a way that should be appealing to mathematicians. LSRDRs and similar algorithms are pseudodeterministic in the sense that if we train the model multiple times on the same data, we typically get identical models. Pseudodeterminism is a signal of interpretability for several reasons that I will go into more detail in a future post: 1. Pseudodeterministic models do not contain any extra random or even pseudorandom information that is not contained in the training data already. This means that when interpreting these models, one does not have to interpret random information. 2. Pseudodeterministic models inherit the symmetry of their training data. For example, if we train a real LSRDR using real symmetric matrices, then the projection P will itself by a symmetric matrix. 3. In mathematics, a well-posed problem is a problem where there exists a unique solution to the problem. Well-posed problems behave better than ill-posed problems in the sense that it is easier to prove results about well-posed problems than it is to prove results about ill-posed problems. In addition to pseudodeterminism, in my experience, LSRDRs are quite interpretable since I have interpreted LSRDRs already in a few posts: Interpreting a dimensionality reduction of a collection of matrices as two positive semidefinite block diagonal matrices — LessWrong When performing a dimensionality reduction on tensors, the trace is often zero. — LessWrong I have Generalized LSRDRs so that they are starting to behave like deeper neural networks. I am trying to expand the capabilities

Gut reaction is “nope!”.

Could you spell out the implication?

Correct! I did mean to communicate that in the first footnote. I agree value-ing the unborn would drastically lower the amount of acceptable risk reduction.

Note that unborn people are merely potential, as their existence depends on our choices. Future generations aren't guaranteed—we decide whether or not they will exist, particularly those who might be born decades or centuries from now. This makes their moral status far less clear than someone who already exists or who is certain to exist at some point regardless of our choices.

Additionally, if we decide to account for the value of future beings, we might consider both potential human people and future AI entities capable of having moral value. From a utili... (read more)

I agree w/ your general point, but think your specific example isn't considering the counterfactual. The possible choices aren't usually: 

A. 50/50% chance of death/utopia
B. 100% of normal life

If a terminally ill patient would die next year 100%, then choice (A) makes sense! Most people aren't terminally ill patients though. In expectation, 1% of the people you know will die every year (w/ skewing towards older people). So a 50% of death vs utopia shouldn't be preferred by most people, & they should accept a delay of 1 year of utopia for >1% red... (read more)

AFAIK, I have similar values[1] but lean differently.

~1% of the world dies every year. If we accelerate AGI sooner 1 year, we save 1%. Push back 1 year, lose 1%. So, pushing back 1 year is only worth it if we reduce P(doom) by 1%. 

This means you're P(doom) given our current trajectory very much matters. If you're P(doom) is <1%, then pushing back a year isn't worth it.

The expected change conditioning on accelerating also matters. If accelerating by 1 year increases e.g. global tensions, increasing a war between nuclear states by X% w/ an expec... (read more)

3David Patterson
Many of these arguments seem pathological when applied to an individual.  I have a friend,  let's call her B, she has a 6 year old daughter A.  She of course adores her daughter. If I walked up to B and said "I'm going to inject this syringe into your daughter.  There's a 10% chance it'll kill her, and a 50% chance it'll extend her natural lifetime to 200." Then I jab A. EV on A's life expectancy is strongly positive.  B (and almost everybody) would be very upset if I did this. I'm upset with accelerationists for the same reasons.
4Yair Halberstadt
That would imply that if you could flip a switch which 90% chance kills everyone, 10% chance grants immortality then (assuming there weren't any alternative paths to immortality) you would take it. Is that correct?

For me, I'm at ~10% P(doom). Whether I'd accept a proposed slowdown depends on how much I expect it decrease this number.[2] 

How do you model this situation? (also curious on your numbers)

I put the probability that AI will directly cause humanity to go extinct within the next 30 years at roughly 4%. By contrast, over the next 10,000 years, my p(doom) is substantially higher, as humanity could vanish for many different possible reasons, and forecasting that far ahead is almost impossible. I think a pause in AI development matters most for reducing the ... (read more)

So, pushing back 1 year is only worth it if we reduce P(doom) by 1%.

Only if you don't care at all about people who aren't yet born. I'm assuming that's your position, but you didn't state it as one of your two assumptions and I think it's an important one.

The answer also changes if you believe nonhumans are moral patients, but it's not clear which direction it changes.

"focus should no longer be put into SAEs...?"

I think we should still invest research into them BUT it depends on the research. 

Less interesting research:

1. Applying SAEs to [X-model/field] (or Y-problem w/o any baselines)

More interesting research:

  1. Problems w/ SAEs & possible solutions
    1. Feature supression (solved by post-training, gated-SAEs, & top-k)
    2. Feature absorption (possibly solved by Matryoshka SAEs)
    3. SAE's don't find the same features across seeds (maybe solved by constraining latents to the convex hull of the data)
    4. Dark-matter of SAEs (nothing
... (read more)

Just on the Dallas example, look at this +8x & -2x below
 

So they 8x all features in the China super-node and multiplied the Texas supernode (Texas is "under" China, meaning it's being "replaced") by -2x. That's really weird! It should be multiplying Texas node by 0. If Texas is upweighting "Austin", then -2x-ing it could be downweighting "Austin", leading to cleaner top outputs results. Notice how all the graphs have different numbers for upweighting & downweighting (which is good that they include that scalar in the images). This means the SA... (read more)

You can learn a per-token bias over all the layers to understand where in the model it stops representing the original embedding (or a linear transformation of it) like in https://www.lesswrong.com/posts/P8qLZco6Zq8LaLHe9/tokenized-saes-infusing-per-token-biases


You could also plot the cos-sims of the resulting biases to see how much it rotates.

In the next two decades we're likely to reach longevity escape velocity: the point at which medicine can increase our healthy lifespans faster than we age.

I have the same belief and have thought about how bad it’d be if my loved ones died too soon.

Sorry for your loss.

Toilet paper is an example of "self-fulfilling prophecy". It will run out because people will believe it will run out, causing a bank toilet paper run.

2Viliam
Makes me feel confused about Economics 101. If people can easily predict that the toiler paper will run out, why aren't the prices increasing enough to prevent that.

Stores Already Have Empty Shelves.

Just saw two empty shelves (~toilet paper & peanut butter) at my local grocery store. Curious how to prepare for this? Currently we've stocked up on:
1. toilet paper
2. feminine products
3. Canned soup (that we like & already eat)

Additionally any models to understand the situation? For example:

  1. Self-fulfilling prophecy - Other people will believe that toilet paper will run out, so it will
  2. Time for Sea Freights to get to US from China is 20-40 days, so even if the trade war ends, it will still take ~ a month for things t
... (read more)
4Jordan Taylor
Isn't toilet paper almost always produced domestically? It takes up a lot of space compared to its value, so it's inefficient to transport. Potato chips are similar.
Logan RiggsΩ6134

I had this position since 2022, but this past year I've been very surprised and impressed by just how good black box methods can be e.g. the control agenda, Owain Evan's work, Anthropic's (& other's I'm probably forgetting). 

How to prove a negative: We can find evidence for or against a hypothesis, but rigorously proving the absence of deception circuits seems incredibly hard. How do you know you didn't just miss it? How much of the model do you need to understand? 90%? 99%? 99.99%?

If you understand 99.9% of the model, then you can just run your u... (read more)

8Neel Nanda
I disagree re the way we currently use understand - eg I think that SAE reconstructions have the potential to smuggle in lots of things via EG the exact values of the continuous activations, latents that don't quite mean what we think, etc. It's plausible that a future and stricter definition of understand fixes this though, in which case I might agree? But I would still be concerned that 99.9% understanding involves a really long tale of heuristics and I don't know what may emerge from combining many things that individually make sense. And I probably put >0.1% that a super intelligence could adversarially smuggle things we don't like into a system we don't think we understand. Anyway, all that pedantry aside, my actual concern is tractability. If addressed, this seems plausibly helpful!
Buck*Ω13239

I agree in principle, but as far as I know, no interp explanation that has been produced explains more like 20-50% of the (tiny) parts of the model it's trying to explain (e.g. see the causal scrubbing results, or our discussion with Neel). See that dialogue with Neel for more on the question of how much of the model we understand.

I think you've renamed the post which changed the url:
https://max.v3rv.com/papers/interpreting_complexity
is now the correct one AFAIK & the original link at the top (https://max.v3rv.com/papers/circuits_and_memorization )
is broken.
 

I too agreed w/ Chanind initially, but I think I see where Lucius is coming from. 

If we forget about a basis & focused on minimal description length (MDL), it'd be nice to have a technique that found the MDL [features/components] for each datapoint. e.g. in my comment, I have 4 animals (bunny, etc) & two properties (cute, furry). For MDL reasons, it'd be great to sometimes use cute/furry & sometimes use Bunny if that reflects model computation more simply. 

If you have both attributes & animals as fundamental units (and somehow hav... (read more)

I think you're saying:

Sometimes it's simpler (less edges) to use the attributes (Cute) or animals (Bunny) or both (eg a particularly cute bunny). Assumption 3 doesn't allow mixing different bases together.

So here we have 2 attributes (for) & 4 animals (for).

If the downstream circuit  (let's assume a linear + ReLU) reads from the "Cute" direction then:
1. If we are only using  : Bunny + Dolphin (interpretable, but add 100 more animals & it'll take a lot more work to interpret)
2. If we are only using  ... (read more)

2Lucius Bushnaq
I wouldn't even be too fussed about 'horribly convoluted' here. I'm saying it's worse than that. We would still have a problem even if we allowed ourselves arbitrary encoder functions to define the activations in the dictionary and magically knew which ones to pick. The problem here isn't that we can't make a dictionary that includes all the 1050 feature directions →f as dictionary elements. We can do that. For example, while we can't write  →a(x)=∑1000i=1ci(x)→fi+∑50i=1c′i(x)→f′i because those sums each already equal →a(x) on their own, we can write  →a(x)=∑1000i=1ci(x)2→fi+∑50i=1c′i(x)2→f′i. The problem is instead that we can't make a dictionary that has the 1050 feature activations ci(x),c′i(x) as the coefficients in the dictionary. This is bad because it means our dictionary activations cannot equal the scalar variables the model's own circuits actually care about. They cannot equal the 'features of the model' in the sense defined at the start, the scalar features comprising its ontology. As a result, if we were to look at a causal graph of the model, using the 1050 half-size dictionary feature activations we picked as the graph nodes, a circuit taking in the feature celephant(x) through a linear read-off along the direction →felephant would have edges in our graph connecting it to both the elephant direction, making up about 50% of the total contribution, and the fifty attribute directions, making up the remaining 50%. Same the other way around, any circuit reading in even a single attribute feature will have 1000 edges connecting to all of the animal features[1], making up 50% of the total contribution. It's the worst of both worlds. Every circuit looks like a mess now.   1. ^ Since the animals are sparse, in practice this usually means edges to a small set of different animals for every data point. Whichever ones happen to be active at the time.
Logan RiggsΩ682

A weird example of this is on page 33 (full transcript pasted farther down) 

tl;dr: It found a great general solution for speeding up some code on specific hardward, tried to improve more, resorted to edge cases which did worse, and submitted a worse version (forgetting the initial solution).

This complicates the reward hacking picture because it had a better solution that got better reward than special-casing yet it still resorted to special-casing. Did it just forget the earlier solution? Feels more like a contextually activated heuristic to special-c... (read more)

I agree. There is a tradeoff here for the L0/MSE curve & circuit-simplicity.

I guess another problem (w/ SAEs in general) is optimizing for L0 leads to feature absorption. However, I'm unsure of a metric (other than the L0/MSE) that does capture what we want.

Hey Armaan! Here's a paper where they instead used an MLP in the beginning w/ similar results (looking at your code, it seems by "dense" layer, you also mean a nonlinearity, which seems equivalent to the MLP one?)

How many tokens did you train yours on?

Have you tried ablations on the dense layers, such as only having the input one vs the output one? I know you have some tied embeddings for both, but I'm unsure if the better results are for the output or input. 

For both of these, it does complicate circuits because you'll have features combining nonline... (read more)

4Armaan A. Abraham
Ah, I was unaware of that paper and it is indeed relevant to this, thank you! Yes, by "dense" or "non-sparse" layer, I mean a nonlinearity. So, that paper's MLP SAE is similar to what I do here, except it is missing MLPs in the decoder. Early on, I experimented with such an architecture with encoder-only MLPs, because (1) as to your final point, the lack of nonlinearity in the output potentially helps it fit into other analyses and (2) it seemed much more likely to me to exhibit monosemantic features than an SAE with MLPs in the decoder too. But, after seeing some evidence that its dead neuron problems reacted differently to model ablations than both the shallow SAE and the deep SAE with encoder+decoder MLPs, I decided to temporarily drop it. I figured that if I found that the encoder+decoder MLP SAE features were interpretable, this would be a more surprising/interesting result than the encoder-only MLP SAE and I would run with it, and if not, I would move to the encoder-only MLP SAE. I trained on 7.5e9 tokens. As I mentioned in my response to your first question, I did experiment early on with the encoder-only MLP, but the architectures in this post are the only ones I looked at in depth for GPT2.   This is a good point, and I should have probably included this in the original post. As you said, one of the major limitations of this approach is that the added nonlinearities obscure the relationship between deep SAE features and upstream / downstream mechanisms of the model. In the scenario where adding more layers to SAEs is actually useful, I think we would be giving up on this microscopic analysis, but also that this might be okay. For example, we can still examine where features activate and generate/verify human explanations for them. And the idea is that the extra layers would produce features that are increasingly meaningful/useful for this type of analysis.

How do you work w/ sleep consolidation?

Sleep consolidation/ "sleeping on it" is when you struggle w/ [learning a piano piece], sleep on it, and then you're suddenly much better at it the next day!

This has happened to me for piano, dance, math concepts, video games, & rock climbing, but it varies in effectiveness. Why? Is it:

  1. Duration of struggling activity
  2. Amount of attention paid to activity
  3. Having a frustrating experience
  4. Time of day (e.g. right before sleep)

My current guess is a mix of all four. But I'm unsure if you [practice piano] in the morning, you... (read more)

1CstineSublime
If I'm playing anagrams or Scrabble after going to a church, and I get the letters "ODG" I'm going to be predisposed towards a different answer than if I've been playing with a German Shepard. I suspect sleep has very little to do with it, and simply coming at something with a fresh load of biases on a different day with different cues and environmental factors may be a larger part of it. Although Marvin Minsky made a good point about the myth of introspection: we are only aware of a think sliver of our active mental processes at any given moment, when you intensely focus on a maths problem or practicing the piano for a protracted period of time, some parts of the brain working on that may not abandon it just because your awareness or your attention drifts somewhere else. This wouldn't just be during sleep, but while you're having a conversation with your friend about the game last night, or cooking dinner, or exercising. You're just not aware of it, it's not in the limelight of your mind, but it still plugs away at it. In my personal experience, most Eureka moments are directly attributable to some irrelevant thing that I recently saw that shifted my framing of the problem much like my anagram example.
3philip_b
I think the way to learn any skill is to basically: 1. Practice it 2. Sleep 3. Goto 1 And the time spent in each iteration of item 1 is capped in usefulness or at least has diminishing returns. I think this has nothing to do with frustration. Also, I think reminding yourself of the experience is not that important and I think there is no cap of 1 thing a day.
2[anonymous]
Oh, I've thought a lot about something similar that I call "background processing" - I think it happens during sleep, but also when awake. I think for me it works better when something is salient to my mind / my mind cares about it. According to this theory, if I was being forced to learn music theory but really wanted to think about video games, I'd get less new ideas about music theory from background processing, and maybe it'd be less entered into my long term memory from sleep. I'm not sure how this effects more 'automatic' ('muscle memory') things (like playing the piano correctly in response to reading sheet music). I'm not sure about this either. It could also be formulated as there being some set amount of consolidation you do each night, and you can divide them between topics, but it's theoretically (disregarding other factors like motivation; not practical advice) most efficient if you do one area per day (because of stuff in the same topic having more potential to relate to each other and be efficiently compressed or generalized from or something. Alternatively, studying multiple different areas in a day could lead to creative generalization between them).

Huh, those brain stimulation methods might actually be practical to use now, thanks for mentioning them! 

Regarding skepticism of survey-data: If you're imagining it's only an end-of-the-retreat survey which asks "did you experience the jhana?", then yeah, I'll be skeptical too. But my understanding is that everyone has several meetings w/ instructors where a not-true-jhana/social-lie wouldn't hold up against scrutiny. 

I can ask during my online retreat w/ them in a couple months.

4niplav
As for brain stimulation, TMS devices can be bought for <$10k from ebay. tDCS devices are available for ~$100, though I don't expect them to have large effect sizes in any direction. There's been noises of consumer-level tFUS devices for <$10k, but that's likely >5 years in the future. The incentives of the people running jhourney are to over-claim attainments, especially on edge-cases, and hype the retreats. Organizations can be sufficiently on guard to prevent the extreme forms of over-claiming & turning into a positive-reviews-factory, but I haven't seen people from jhourney talk about it (or take action that shows they're aware of the problem).

Implications of a Brain Scan Revolution

Suppose we were able to gather large amounts of brain-scans, lets say w/ millions of wearable helmets w/ video and audio as well,[1] then what could we do with that? I'm assuming a similar pre-training stage where models are used to predict next brain-states (possibly also video and audio), and then can be finetuned or prompted for specific purposes.

Jhana helmets

Jhana is a non-addicting high pleasure state. If we can scan people entering this state, we might drastically reduce the time it takes to learn to enter ... (read more)

4niplav
For pleasure/insight helmets you probably need intervention in the form of brain simulation (tDCS, tFUS, tMS). Biofeedback might help but you need to at least know where to steer towards. I'm pretty skeptical of those numbers, all exiting projects I know of don't have a better method of measurement other than surveys and that gets bitten hard by social desirability bias/not wanting to have committed a sunk cost. Seems relevant that jhourney isn't doing much EEG & biofeedback anymore.

it seems unlikely to me that so many talented people went astray

Well, maybe we did go astray, but it's not for any reasons mentioned in this paper!

SAEs were trained on random weights since Anthropic's first SAE paper in 2023:

To assess the effect of dataset correlations on the interpretability of feature activations, we run dictionary learning on a version of our one-layer model with random weights. 28 The resulting features are here, and contain many single-token features (such as "span", "file", ".", and "nature") and some other features firing on seeming

... (read more)

I didn't either, but on reflection it is! 

I did change the post based off your comment, so thanks!

I think the fuller context,

Anthropic has put WAY more effort into safety, way way more effort into making sure there are really high standards for safety and that there isn't going to be danger what these AIs are doing

implies it's just the amount of effort is larger than other companies (which I agree with), and not the Youtuber believing they've solved alignment or are doing enough, see: 

but he's also a realist and is like "AI is going to really potentially fuck up our world"

and

But he's very realistic. There is a lot of bad shit that is going to happ

... (read more)
1dabbing.
Nevermind I didnt think was a requirement
Logan RiggsΩ240

Thinking through it more, Sox2-17 (they changed 17 amino acids from Sox2 gene) was your linked paper's result, and Retro's was a modified version of factors Sox AND KLF. Would be cool if these two results are complementary.

Logan RiggsΩ240

You're right! Thanks
For Mice, up to 77% 

Sox2-17 enhanced episomal OKS MEF reprogramming by a striking 150 times, giving rise to high-quality miPSCs that could generate all-iPSC mice with up to 77% efficiency

For human cells, up to 9%  (if I'm understanding this part correctly).
 

SOX2-17 gave rise to 56 times more TRA1-60+ colonies compared with WT-SOX2: 8.9% versus 0.16% overall reprogramming efficiency.

So seems like you can do wildly different depending on the setting (mice, humans, bovine, etc), and I don't know what the Retro folks were doing, but does make their result less impressive. 

4TsviBT
(Still impressive and interesting of course, just not literally SOTA.)

You're actually right that this is due to meditation for me. AFAIK, it's not a synesthesia-esque though (ie I'm not causing there to be two qualia now), more like the distinction between mental-qualia and bodily-qualia doesn't seem meaningful upon inspection. 

So I believe it's a semantic issue, and I really mean "confusion is qualia you can notice and act on" (though I agree I'm using "bodily" in non-standard ways and should stop when communicating w/ non-meditators).

This is great feedback, thanks! I added another example based off what you said.

For how obvious the first one, at least two folks I asked (not from this community) didn't think it was a baby initially (though one is non-native english and didn't know "2 birds of a feather" and assumed "our company" meant "the singers and their partner"). Neither are parents. 

I did select these because they caused confusion in myself when I heard/saw them years ago, but they were "in the wild" instead of in a post on noticing confusion.

I did want a post I could link [non rationalist friends] to that's a more fun intro to noticing confusion, so more regular members might not benefit!

Logan Riggs*Ω350

For those also curious, Yamanaka factors are specific genes that turn specialized cells (e.g. skin, hair) into induced pluripotent stem cells (iPSCs) which can turn into any other type of cell.

This is a big deal because you can generate lots of stem cells to make full organs[1] or reverse aging (maybe? they say you just turn the cell back younger, not all the way to stem cells).

 You can also do better disease modeling/drug testing: if you get skin cells from someone w/ a genetic kidney disease, you can turn those cells into the iPSCs, then i... (read more)

TsviBTΩ3110

According to the article, SOTA was <1% of cells converted into iPSCs

I don't think that's right, see https://www.cell.com/cell-stem-cell/fulltext/S1934-5909(23)00402-2

A trending youtube video w/ 500k views in a day brings up Dario Amodei's Machines of Loving Grace (Timestamp for the quote):
[Note: I had Claude help format, but personally verified the text's faithfulness]

I am an AI optimist. I think our world will be better because of AI. One of the best expressions of that I've seen is this blog post by Dario Amodei, who is the CEO of Anthropic, one of the biggest AI companies. I would really recommend reading this - it's one of the more interesting articles and arguments I have read. He's basically saying AI is going to

... (read more)
9habryka
Ah yes, a great description of Anthropic's safety actions. I don't think anyone serious at Anthropic believes that they "made sure there isn't going to be danger from these AIs are doing". Indeed, many (most?) of their safety people assign double-digits probabilities to catastrophic outcomes from advanced AI system. I do think this was a predictable quite bad consequence of Dario's essay (as well as his other essays which heavily downplay or completely omit any discussion of risks). My guess is it will majorly contribute to reckless racing while giving people a false impression of how good we are doing on actually making things safe.

Hey Midius!

My recommended rationality habit is noticing confusion, by which I mean a specific mental feeling that's usually quick & subtle & easy to ignore.

David Chapman has a more wooey version called Eating Your Shadow, which was very helpful for me since it pointed me towards acknowledging parts of my experience that I was denying due to identity & social reasons (hence the easy to ignore part).

Could you go into more details into what skills these advisers would have or what situations to navigate? 

Because I'm baking in the "superhuman in coding/maths" due to the structure of those tasks, and other tasks can either be improved through:
1. general capabilies
2. Specific task 

And there might be ways to differentially accelarate that capability.

4Chris_Leong
I don't exactly know the most important capabilities yet, but things like, advising on strategic decisions, improving co-ordination and non-manipulative communication seem important.

I really appreciate your post and all the links! This and your other recent posts/comments have really helped make a clearer model of timelines. 

In my experience, most of the general public will verbally agree that AI X-risk is a big deal, but then go about their day (cause reasonably, they have no power). There's no obvious social role/action to do in response to that.

For climate, people understand that they should recycle, not keep the water running, and if there's a way to donate to clean the ocean on a Mr. Beast video, then some will even donate (sadly, none of these are very effective for solving the climate problem though! Gotta avoid that for our case).

Having a clear call-to-action seems rel... (read more)

Claude 3.5 seems to understand the spirit of the law when pursuing a goal X. 

A concern I have is that future training procedures will incentivize more consequential reasoning (because those get higher reward). This might be obvious or foreseeable, but could be missed/ignored under racing pressure or when lab's LLMs are implementing all the details of research.

Thanks! 

I forgot about faithful CoT and definitely think that should be a "Step 0". I'm also concerned here that AGI labs just don't do the reasonable things (ie training for briefness making the CoT more steganographic). 

For Mech-interp, ya, we're currently bottlenecked by:

  1. Finding a good enough unit-of-computation (which would enable most of the higher-guarantee research)
  2. Computing Attention_in--> Attention_out (which Keith got the QK-circuit -> Attention pattern working a while ago, but haven't hooked up w/ the OV-circuit)

This is mostly a "reeling from o3"-post. If anyone is doom/anxiety-reading these posts, well, I've been doing that too! At least, we're in this together:)

From an apparent author on reddit:

[Frontier Math is composed of] 25% T1 = IMO/undergrad style problems, 50% T2 = grad/qualifying exam style porblems, 25% T3 = early researcher problems

The comment was responding to a claim that Terence Tao said he could only solve a small percentage of questions, but Terence was only sent the T3 questions. 

I also have a couple friends that require serious thinking (or being on my toes).  I think it's because they have some model of how something works, and I say something, showing my lack of this model.

Additionally, programming causes this as well (in response to compilation errors, getting nonsense outputs, or runs too long). 

4Nathan Young
Yes, this is one reason I really like forecasting. I forces me to see if my thinking was bad and learn what good thinking looks like.

Was looking up Google Trend lines for chatGPT and noticed a cyclical pattern:

Where the dips are weekends, meaning it's mostly used by people in the workweek. I mostly expect this is students using it for homework. This is substantiated by two other trends:
1. Dips in interest over winter and summer breaks (And Thanksgiving break in above chart)

2. "Humanize AI" which is 

Humanize AI™ is your go-to platform for seamlessly converting AI-generated text into authentic, undetectable, human-like content 

[Although note that overall interest in ChatGPT is W... (read more)

I’d guess that weekend dips come from office workers, since they rarely work on weekends, but students often do homework on weekends.

I was expecting this to include the output of MIRI for this year. Digging into your links we have:

Two Technical Governance Papers:
1. Mechanisms to verify international agreements about AI development
2. What AI evals for preventing catastrophic risks can and cannot do

Four Media pieces of Eliezer regarding AI risk:
1. Semafor piece
2. 1 hr talk w/ Panel 
3. PBS news hour
4. 4 hr video w/ Stephen Wolfram

Is this the full output for the year, or are there less linkable outputs such as engaging w/ policymakers on AI risks?

Harlan124

Hi, I’m part of the communications team at MIRI.

To address the object-level question: no, that’s not MIRI’s full public output for the year (but our public output for the year was quite small; more on that below). The links on the media page and research page are things that we put in the spotlight. We know the current website isn’t great for seeing all of our output, and we have plans to fix this. In the meantime, you can check out our newslettersTGT’s new website, and a forthcoming post with more details about the media stuff we’ve ... (read more)

Donated $100. 

It was mostly due to LW2 that I decided to work on AI safety, actually, so thanks!

I've had the pleasure of interacting w/ the LW team quite a bit and they definitely embody the spirit of actually trying. Best of luck to y'all's endeavors!

I tried a similar experiment w/ Claude 3.5 Sonnet, where I asked it to come up w/ a secret word and in branching paths:
1. Asked directly for the word
2. Played 20 questions, and then guessed the word

In order to see if it does have a consistent it can refer back to.

Branch 1: 

Branch 2:

Which I just thought was funny.

Asking again, telling it about the experiment and how it's important for it to try to give consistent answers, it initially said "telescope" and then gave hints towards a paperclip.

Interesting to see when it flips it answers, though it's a sim... (read more)

It'd be important to cache the karma of all users > 1000 atm, in order to credibly signal you know which generals were part of the nuking/nuked side. Would anyone be willing to do that in the next 2 & 1/2 hours? (ie the earliest we could be nuked)

4Zach Stein-Perlman
The post says generals' names will be published tomorrow.

We could instead  pre-commit to not engage with any nuker's future posts/comments (and at worse comment to encourage others to not engage) until end-of-year.

Or only include nit-picking comments.

5Logan Riggs
It'd be important to cache the karma of all users > 1000 atm, in order to credibly signal you know which generals were part of the nuking/nuked side. Would anyone be willing to do that in the next 2 & 1/2 hours? (ie the earliest we could be nuked)
aphyer135

During WWII, the CIA produced and distributed an entire manual (well worth reading) about how workers could conduct deniable sabotage in the German-occupied territories.
 

(11) General Interference with Organizations and Production 

   (a) Organizations and Conferences

  1. Insist on doing everything through "channels." Never permit short-cuts to be taken in order to expedite decisions. 
  2. Make speeches, talk as frequently as possible and at great length.  Illustrate your points by long anecdotes and accounts of personal experiences. Neve
... (read more)

Could you dig into why you think it's great inter work?

But through gradient descent, shards act upon the neural networks by leaving imprints of themselves, and these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space). So studying weights and activations is pretty doomed.

This paragraph sounded like you're claiming LLMs do have concepts, but they're not in specific activations or weights, but distributed across them instead. 

But from your comment, you mean that LLMs themselves don't learn the true simple-compressed features of reality, but a ... (read more)

2tailcalled
A true feature of reality get diminished into many small fragments. These fragments birfucate into multiple groups, of which we will consider two groups, A and B. Group A gets collected and analysed by humans into human knowledge, which then again gets diminished into many small fragments, which we will call group C. Group B and group C make impacts on the network. Each fragment in group B and group C produces a shadow in the network, leading to there being many shadows distributed across activation space and weight space. These many shadows form a channel which is highly reflective of the true feature of reality. That allows there to be simple useful ways to connect the LLM to the true feature of reality. However, the simplicity of the feature and its connection is not reflected into a simple representation of the feature within the network; instead the concept works as a result of the many independent shadows making way for it. The true features branch of from the sun (and the earth). Why would you ignore the problem pointed out in footnote 1? It's a pretty important problem.

The one we checked last year was just Pythia-70M, which I don't expect the LLM itself to have a gender feature that generalizes to both pronouns and anisogamy.

But again, the task is next-token prediction. Do you expect e.g. GPT 4 to have learned a gender concept that affects both knowledge about anisogamy and pronouns while trained on next-token prediction?

3tailcalled
I guess to add, if I ask GPT-4o "What is the relationship between gender and anisogamy?", it answers: So clearly there is some kind of information about the relationship between gender and anisogamy within GPT-4o. The point of my post is that it is unlikely to be in the weight space or activation space.
2tailcalled
Next-token prediction, and more generally autoregressive modelling, is precisely the problem. It assumes that the world is such that the past determines the future, whereas really the less-diminished shapes the more-diminished ("the greater determines the lesser"). As I admitted in the post, it's plausible that future models will use different architectures where this is less of a problem.
Load More