LESSWRONG
LW

All of ojorgensen's Comments + Replies

You should consider applying to PhDs (soon!)

Strong upvote!

One thing I'd emphasise is that there's a pretty big overhead to submitting a single application (getting recommendation letters, writing a generic statement of purpose), but it doesn't take much effort to apply to more after that (you can rejig your SOP quite easily to fit different universities). Given the application process is noisy and competitive, if you're submitting one application you should probably submit loads if you can afford the application costs. Good luck to everyone applying! :))

0joec7mo

I'll have to push back on this. I think if there's one specific program that you'd like to go to, especially if there's an advisor you have in mind, it's good to tailor your application to that program. However, this might not apply to the typical reader of this post. I followed a k strategy with my PhD statements of purpose (and recommendations) rather than an r strategy. I tailored my applications to the specific schools, and it seemed to work pretty decently well. I know of more qualified people who were rejected from a much higher proportion of schools who spent much less time on each application. (Disclaimer: this is all anecdotal. Also, I was applying for chemistry programs, not AI)

Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic

ojorgensen1y10

Yeah I think we have the same understanding here (in hindsight I should have made this more explicit in the post / title).

I would be excited to see someone empirically try to answer the question you mention at the end. In particular, given some direction $u$ and a LayerNormed vector $v$ , one might try to quantify how smoothly rotating from $v$ towards $u$ changes the output of the MLP layer. This seems like a good test of whether the Polytope Lens is helpful / necessary for understanding the MLPs of Transformers (with smoot... (read more)

2Chris_Leong1y

Also: It seems like there would be an easier way to get this observation that this post makes, ie. directly showing that kV and V get mapped to the same point by layer norm (excluding the epsilon). Don't get me wrong, the circle is cool, but seems like it's a bit of a detour.

Open Thread – Winter 2023/2024

ojorgensen2y167

It would save me a fair amount of time if all lesswrong posts had an "export BibTex citation" button, exactly like the feature on arxiv. This would be particularly useful for alignment forum posts!

Against Almost Every Theory of Impact of Interpretability

ojorgensen2y10

One central criticism of this post is its pessimism towards enumerative safety. (i.e. finding all features in the model, or at least all important features). I would be interested to hear how the author / others have updated on the potential of enumerative safety in light of recent progress on dictionary learning, and finding features which appear to correspond to high-level concepts like truth, utility and sycophancy. It seems clear that there should be some positive update here, but I would be interested in understanding issues which these approaches wil... (read more)

ojorgensen2y62

But this does not hold for tiny cosine similarities (e.g. 0.01 for $n = 12288$ , which gives a lower bound of 2 using the formula above). I'm not aware of a lower bound better than $n$ for tiny angles.

Unless I'm misunderstanding, a better lower bound for almost orthogonal vectors when cosine similarity is approximately $0$ is just $n$ , by taking an orthogonal basis for the space.

My guess for why the formula doesn't give this is because it is derived by covering a sphere with non-intersecting spherical caps, which is sufficient for ... (read more)

2Fabien Roger2y

You made me curious, so I ran a small experiment. Using the sum of abs cos similarity as loss, initializing randomly on the unit sphere, and optimizing until convergence with LBGFS (with strong wolfe), here are the maximum cosine similarities I get (average and stds over 5 runs since there was a bit of variation between runs): It seems consistent with the exponential trend, but it also looks like you would need dim>>1000 to have any significant boost of number of vectors you can fit with cosine similarity < 0.01, so I don't think this happens in practice. My optimization might have failed to converge to the global optimum though, this is not a nicely convex optimization problem (but the fact that there is little variation between runs is reassuring).

Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo

ojorgensen2y20

(Potential spoilers!)

There is some relevant literature which explores this phenomenon, also looking at the cosine similarity between words across layers of transformers. I think the most relevant is (Cai et. al 2021), where they also find this higher than expected cosine similarity between residual stream vectors in some layer for BERT, D-BERT, and GPT. (Note that they use some somewhat confusing terminology: they define inter-type cosine similarity to be cosine similarity between embeddings of different tokens in the same input; and intra-type cosine simi... (read more)

ojorgensen's Shortform

ojorgensen2y10

Problem: we want to make it hard for ML systems (trained via SGD) to perform naive gradient hacking. By naive gradient hacking, I mean "being able to keep some weights of the network constant for an arbitrary step of SGD".

Solution: do "stochastic" regularisation, e.g. sample the amount of regularisation we perform randomly (could use quantum stuff if we want true randomness). This seems like it should make naive gradient hacking almost impossible - in order to keep some target weights unchanged, you'd have to match their +ve contribution to the loss to the... (read more)

Excessive AI growth-rate yields little socio-economic benefit.

ojorgensen2y51

Just a nit-pick but to me "AI growth-rate" suggests economic growth due to progress in AI, as opposed to simply techincal progress in AI. I think "Excessive AI progress yields little socio-economic benefit" would make the argument more immediately clear.

EIS XI: Moving Forward

ojorgensen2yΩ110

Rando et al. (2022)

This link is broken btw!

1scasper2y

Thanks! Fixed. https://arxiv.org/abs/2210.04610

Abuse in LessWrong and rationalist communities in Bloomberg News

ojorgensen2y50

Didn't get that impression from your previous comment, but this seems like a good strategy!

Abuse in LessWrong and rationalist communities in Bloomberg News

ojorgensen2y127

This seems like a bad rule of thumb. If your social circle is largely comprised of people who have chosen to remain within the community, ignoring information from "outsiders" seems like a bad strategy for understanding issues with the community.

4Ben Pace2y

Yeah, but that doesn't sound like my strategy. I've many times talked to people who are leaving or left and interviewed them about why and what they didn't like and their reasons for leaving.

Bing Chat is blatantly, aggressively misaligned

ojorgensen2y32

Even if OpenAI don't have the option to stop Bing Chat being released now, this would surely have been discussed during investment negotiations. It seems very unlikely this is being released without approval from decision-makers at OpenAI in the last month or so. If they somehow didn't foresee that something could go wrong and had no mitigations in place in case Bing Chat started going weird, that's pretty terrible planning.

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

ojorgensen2yΩ471

This seems very similar to recent work that has come out of the Stanford AI Lab recently, linked to here.

3Buck2y

It’s a pretty different algorithm, though obviously it’s trying to solve a related problem.

Gradient hacking is extremely difficult

ojorgensen2y20

Great post! This helps to clarify and extend lots of fuzzy intuitions I had around gradient hacking, so thanks! If anyone is interested in a different perspective / set of intuitions for how some properties of gradient descent affect gradient hacking, I wrote a small post about this here: https://www.lesswrong.com/posts/Nnb5AqcunBwAZ4zac/extremely-naive-gradient-hacking-doesn-t-work

I’d expect this to mainly be of use if the properties of gradient descent labelled 1, 4, 5 were not immediately obvious to you.

Disagreements about Alignment: Why, and how, we should try to solve them

ojorgensen3y20

Hey! Not currently working on anything related to this, but would be excited to read anything related to this you are writing :))

A Walkthrough of A Mathematical Framework for Transformer Circuits

ojorgensen3y10

Understanding Infra-Bayesianism :))

A Walkthrough of A Mathematical Framework for Transformer Circuits

ojorgensen3yΩ230

I went through the paper for a reading group the other day, and I think the video really helped me to understand what is going on in the paper. Parts I found most useful were indications which parts of the paper / maths were most important to be able to understand, and which were not (tensor products).

I had made some effort to read the paper before with little success, but now feel like I understand the overall results of the paper pretty well. I’m very positive about this video, and similar things like this being made in the future!

Personal context: I also found the intro to IB video series similarly useful. I’m an AI masters student who has some pre-existing knowledge about AI alignment. I have a maths background.

3Neel Nanda3y

Thanks for the feedback! Glad to hear it was useful :)

Strange Loops - Self-Reference from Number Theory to AI

ojorgensen3y10

Firstly, thanks for reading the post! I think you're referring mainly to realisability here which I'm not that clued up on tbh, but I'll give you my two cents because why not.

I'm not sure to what extent we should focus on unrealisability when aligning systems. I think I have a similar intuition to you that the important question is probably "how can we get good abstractions of the world, given that we cannot perfectly model it". However, I think better arguments for why unrealisability is a core problem in alignment than I have laid out probably do e... (read more)

Strange Loops - Self-Reference from Number Theory to AI

ojorgensen3y10

I'm not sure if this is what you're looking for, but Hofstadter gives a great analogy using record players which I find useful in terms of thinking about how changing the situation changes our results (which is paraphrased here).

A (hi-fi) record player that tries to playing every possible sound can't actually play its own self-breaking sound, so it is incomplete by virtue of its strength.
A (low-fi) record player that refuses to play all sounds (in order to avoid destruction from its self-breaking sound) is incomplete by virtue of its weakness.

We may ... (read more)

2Viliam3y

What exactly is the aspect of natural numbers that makes them break math, as opposed to other types of values? Intuitively, it seems to be the fact that they can be arbitrarily large but not infinite. Like, if you invent another data type that only has a finite number of values, it would not allow you to construct something equivalent to Gödel numbering. But if it allows infinite number of (finite) values, it would. (Not sure about an infinite number of/including infinite values, probably also would break math.) It seems like you cannot precisely define natural numbers using first-order logic. Is that the reason of this all? Or is it a red herring? Would situation be somehow better with second-order logic? (These are the kinds of questions that I assume would be obvious to me, if I grokked the situation. So the fact that they are not obvious, suggests that I do not see the larger picture.)

How likely is deceptive alignment?

ojorgensen3y30

I found this post really interesting, thanks for sharing it!

It doesn’t seem obvious to me that the methods of understanding a model given a high path-dependence world become significantly less useful if we are in a low path-dependence world. I think I see why low path-dependence would give us the opportunity to use different methods of analysis, but I don’t see why the high path-dependence ones would no longer be useful.

For example, here is the reasoning behind “how likely is deceptive alignment” in a high path-dependence world (quoted from the slide).&nbs... (read more)

1[anonymous]3y

Not sure, but here's how I understand it: If we are in a low path-dependence world, the fact that SGD takes a certain path doesn't say much about what type of model it will eventually converge to. In a low path-dependence world, if these steps occurred to produce a deceptive model, SGD could still "find a path" to the corrigibly aligned version. The questions of whether it would find these other models depends on things like "how big is the space of models with this property?", which corresponds to a complexity bias.

My emotional reaction to the current funding situation

ojorgensen3y21

I really like this post! I can’t see whether you’ve already cross posted this to the EA forum, but it seems valuable to have this there too (as it is focussed on the EA community).

2Sam F. Brown3y

I'm happy for it to be cross-posted there, but I'm not sure how to do that myself. If anyone else wants to, feel free. (Edit: I'm confused by the downvote. Is this advising against cross-posting? Or suggesting that I should work out how to and then do it myself?)