User Comment Replies

In Sakana AI's paper on AI Scientist v-2, they claim that the sytem is independent of human code. Based on quick skim, I think this is wrong/deceptful. I wrote up my thoughts here: https://lovkush.substack.com/p/are-sakana-lying-about-the-independence

Main trigger was this line in the system prompt for idea generation: "Ensure that the proposal can be done starting from the provided codebase."

Top AI safety newsletters, books, podcasts, etc – new AISafety.com resource

TheManxLoiner1mo60

Substacks:
- https://aievaluation.substack.com/
- https://peterwildeford.substack.com/
- https://www.exponentialview.co/
- https://milesbrundage.substack.com/

Podcasts:
- Cognitive Revolution. https://www.cognitiverevolution.ai/tag/episodes/
- Doom debates. https://www.youtube.com/@DoomDebates
- AI policy podcast https://www.csis.org/podcasts/ai-policy-podcast

Worth checking this too: https://forum.effectivealtruism.org/posts/5Hk96JqpEaEAyCEud/how-do-you-follow-ai-safety-news

2Bryce Robertson1mo

Thanks a lot! I've added some of those to the page, and also some from that forum post.

Conditional Importance in Toy Models of Superposition

TheManxLoiner2mo10

Vague thoughts/intuitions:

Using the word "importance" I think is misleading. Or, makes it harder to reason about the connection between this toy scenario and real text data. In real comedy/drama, there is patterns in the data to let me/the model deduce it is comedy or drama and hence allow me to focus on the conditionally important features.
Phrasing the task as follows helps me: You will be given 20 random numbers x1 to x20. I want you to find projections that can recover x1 to x20. Half the time I will ignore your answers from x1 to x10 and the other half

... (read more)

1james__p1mo

Thanks for the thoughts -- * I used the term "importance" since this was the term used in Anthropic's original paper. I agree that (unlike in a real model) my toy scenario doesn't contain sufficient information to deduce the context from the input data. * I like your phrasing of the task - it does a great job of concisely highlighting the 'Mathematical Intuition for why Conditional Importance "doesn't matter"' * Interesting that the experiment was helpful for you!

Thoughts on Toy Models of Superposition

TheManxLoiner2mo10

there are features such as X_1 which are perfectly recovered

Just to check, in the toy scenario, we assume the features in R^n are the coordinates in the default basis. So we have n features X_1, ..., X_n

Separately, do you have intuition for why they allow network to learn b too? Why not set b to zero too?

1james__p2mo

Yes, that's correct. My understanding is that the bias is thought to be useful for two reasons: * It is preferable to be able to output a non-zero value for features the model chooses not to represent (namely their expected values) * Negative bias allows the model to zero-out small interferences, by shifting the values negative such that the ReLU outputs zero. I think empirically when these toy models are exhibiting lots of superposition, the bias vector typically has many negative entries.

[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

TheManxLoiner2mo21

If you’d like to increase the probability of me writing up a “Concrete open problems in computational sparsity” LessWrong post

I'd like this!

Shallow review of technical AI safety, 2024

TheManxLoiner4mo41

I think this is missing from the list. https://wba-initiative.org/en/25057/. Whole brain architectue initiative.

TheManxLoiner's Shortform

TheManxLoiner4mo10

Should LessWrong have an anonymous mode? When reading a post or comments, is it useful to have the username or does that introduce bias?

I had this thought after reading this review of LessWrong: https://nathanpmyoung.substack.com/p/lesswrong-expectations-vs-reality

7Said Achmiz4mo

Note that GreaterWrong has an anti-kibitzer mode.

2Dagon4mo

I vote no. An option for READERS to hid the names of posters/commenters might be nice, but an option to post something that you're unwilling to have a name on (not even your real name, just a tag with some history and karma) does not improve things.

Visual demonstration of Optimizer's curse

TheManxLoiner4mo31

Sounds sensible to me!

Visual demonstration of Optimizer's curse

TheManxLoiner4mo10

What do we mean by $U - V$ ?

I think the setting is:

We have a true value function $V$
We have a process to learn an estimate of $V$ . We run this process once and we get $U$
We then ask an AI system to act so as to maximize $U$ (its estimate of human values)

So in this context, $U - V$ is just a fixed function measuring the error between the learnt values and true values.

I think confusion could be using the term $U$ to represent both a single instance or the random variable/process.

1Roman Malov4mo

So, U(x) is a random variable in the sense that it is drawn from a distribution of functions, and the expected value of those functions at each point x is equal to V(x). Am I understanding you correctly?

Deep Forgetting & Unlearning for Safely-Scoped LLMs

TheManxLoiner5moΩ010

Thanks for this post! Very clear and great reference.

- You appear to use the term 'scope' in a particular technical sense. Could you give a one-line definition?
- Do you know if this agenda has been picked up since you made this post?

Scattered thoughts on what it means for an LLM to believe

TheManxLoiner5mo10

But in this Eiffel Tower example, I’m not sure what is correlating with what

The physical object Eiffel Tower is correlated with itself.

However, I think the basic ability of an LLM to correctly complete the sentence “the Eiffel Tower is in the city of…” is not very strong evidence of having the relevant kinds of dispositions.

It is highly predictive of the ability of the LLM to book flights to Paris, when I create an LLM-agent out of it and ask it to book a trip to see the Eiffel Tower.

I think the question about whether current AI systems have re

TheManxLoiner5mo10

Zvi's latest newsletter has a section on this topic! https://thezvi.substack.com/i/151331494/good-advice

Emergence, The Blind Spot of GenAI Interpretability?

TheManxLoiner5mo10

+1 to you continuing with this series.

Automation collapse

TheManxLoiner5mo10

Pedantic point. You say "Automating AI safety means developing some algorithm which takes in data and outputs safe, highly-capable AI systems." I do not think semi-automated interpretability fits into this, as the output of interpretability (currently) is not a model but an explanation of existing models.
Unclear why Level (1) does not break down into the 'empirical' vs 'human checking'. In particular, how would this belief obtained: "The humans are confident the details provided by the AI systems don’t compromise the safety of the algorithm."
Unclear

TheManxLoiner5mo10

Couple of thoughts:
1. I recently found out about this new-ish social media platform. https://www.heymaven.com/. Good chance they are researching alternative recommendation algorithms.
2. What particular actions do you think rationality/ea community could do that other big efforts have not done enough, e.g. projects by Tristan Harris or Jaron Lanier.

Scattered thoughts on what it means for an LLM to believe

TheManxLoiner5mo20

Thanks for the feedback! Have editted the post to include your remarks.

AI as a powerful meme, via CGP Grey

TheManxLoiner6mo10

The 'evolutionary pressures' being discussed by CGP Grey is not the direct gradient descent used to train an individual model. Instead, he is referring to the whole set of incentives we as a society put on AI models. Similar to memes - there is no gradient descent on memes.

(Apologies if you already understood this, but it seems your post and Steven Byrne's post focus on training of individual models)

2Noosphere896mo

Fair enough on that difference between the societial level incentives on AI models and the individual selection incentives on AI models. My main current response is to say that I think the incentives are fairly weak predictors of the variance in outcomes, compared to non-evolutionary forces at this time. However, I do think this has interesting consequences for AI governance (since one of the effects is to make societal level incentives become more relevant, compared to non-evolutionary forces.)

Which LessWrong/Alignment topics would you like to be tutored in? [Poll]

TheManxLoiner6mo10

What is the status of this project? Are there any estimates of timelines?

Distillation of 'Do language models plan for future tokens'

TheManxLoiner10mo20

Totally agree! This is my big weakness right now - hopefully as I read more papers I'll start developing a taste and ability to critique.

TheManxLoiner1y10

Huge thanks for writing this! Particularly liked the SVD intuition and how it can be used to understand properties of $M M^{T}$ . One small correction I think. You wrote:

$x - E [x]$ is simply $p_{1} (x)$ the projection along the vector $(1 \dots 1)$

I think $E [x]$ is projection along the vector $(1, \dots, 1)$ , so $x - E [x]$ is projection on hyperplane perpendicular to $(1, \dots, 1)$

3Fabien Roger1y

Oops, that's what I meant, I'll make it more clear.

A Universal Emergent Decomposition of Retrieval Tasks in Language Models

TheManxLoiner1y20

Interesting ideas, and nicely explained! Some questions:

1) First notation: request patching means replacing the vector at activation A for R2 on C2 with vector at same activation A for R1 on C1. Then the question: Did you do any analysis on the set of vectors A as you vary R and C? Based on your results, I expect that the vector at A is similar if you keep R the same and vary C.

2) I found the success on the toy prompt injection surprising! My intuition up to that point was that R and C are independently represented to a large extent, and you co... (read more)

1Alexandre Variengien1y

Thanks for your comment, these are great questions! 1. I did not conduct analyses of the vectors themselves. A concrete (and easy) experiment could be to create UMAP plot for the set of residual stream activations at the last position for different layers. I guess that i) you start with one big cluster. ii) multiple clusters determined by the value of R iii) multiple clusters determined by the value of R(C). I did not do such analysis because I decided to focus on causal intervention: it's hard to know from the vectors alone what are the differences that matter for the model's computation. Such analyses are useful as side sanity checks though (e.g. Figure 5 of https://arxiv.org/pdf/2310.15916.pdf ). 2. The particular kind of corruption of C -- adding a distractor -- is designed not to change the content of C. The distractor is crafted to be seen as a request for the model, i.e. to trigger the induction mechanism to repeat the token that comes next instead of answering the question. Take the input X with C = "Alice, London", R = "What is the city? The next story is in", and distractor D = "The next story is in Paris."*10. The distractor successfully makes the model output "Paris" instead of "London". My guess on what's going on is that the request that gets compiled internally is "Find the token that comes after 'The next story is in' ", instead of "Find a city in the context" or "Find the city in the previous paragraph" without the distractor. When you patch the activation from a clean run, it restores the clean request representation and overwrites the induction request. 1. Given the generality of the phenomenon, my guess is that results would generalize to more complex cases. It is even possible that you can decompose in more steps how the request gets computed, e.g. i) represent the entity ("Alice") you're asking for (possibly using binding IDs) ii) represent the attribute you're looking for ("origin country") iii) retrieve the token.

The Bat and Ball Problem Revisited

TheManxLoiner4y10

No need to apologise! I missed your response by even more time...

My instinct is that it is because of the relative size of the numbers, not the absolute size.

It might be an interesting experiment to see how the intuition varies based on the ratio of the total amount to the difference in amounts: "You have two items whose total cost is £1100 and the difference in price is £X. What is the price of the more expensive item?", where X can be 10p or £1 or £10 or £100 or £500 or £1000.

With X=10p, one possible instinct is 'that means they are basically the same price, so the more expensive item is £550 + 10p = £550.10.

The Bat and Ball Problem Revisited

TheManxLoiner6y60

I have the same experience as you, drossbucket: my rapid answer to (1) was the common incorrect answer, but for (2) and (3) my intuition is well-honed.

A possible reason for this is that the intuitive but incorrect answer in (1) is a decent approximation to the correct answer, whereas the common incorrect answers in (2) and (3) are wildly off the correct answer. For (1) I have to explicitly do a calculation to verify the incorrectness of the rapid answer, whereas in (2) and (3) my understanding of the situation immediately rules out the incorrect answers.

He... (read more)

1drossbucket5y

I must have missed this comment before, sorry. This is a really interesting point. Just to write it out explicitly, (1) correct answer: 5, incorrect answer: 10 (2) correct answer: 5, incorrect answer: 100 (3) correct answer: 47, incorrect answers: 24 Now, for both (1) and (3) the wrong answer is off by roughly a factor of two. But I also share your sense that the answer to (3) is 'wildly off', whereas the answer to (1) is 'close enough'. There are a couple of possible reasons for this. One is that 5 cents and 10 cents both just register as 'some small change', whereas 24 days and 47 days feel meaningfully different. But also, it could be to do with relative size compared to the other numbers that appear in the problem setup. In (1), 5 and 10 are both similarly small compared to 100 and 110. In (3), 24 is small compared to 48, but 47 isn't. Or something else. I haven't thought about this much. There's a variant 'Ford and Ferrari' problem that is somewhat related: > A Ferrari and a Ford together cost $190,000. The Ferrari costs $100,000 more than the Ford. How much does the Ford cost? So here we have correct answer: 45000, incorrect answer: 90000 Here the incorrect answer feels somewhat wrong, as the Ford is improbably close in price to the Ferrari. People appeared to do better on this modified problem than the bat and ball, but I haven't looked into the details.

LESSWRONG
LW

All of TheManxLoiner's Comments + Replies