LESSWRONG
LW

All of Gurkenglas's Comments + Replies

It sounds like you're trying to define unfair as evil.

I just meant the "guts of the category theory" part. I'm concerned that anyone says that it should be contained (aka used but not shown), and hope it's merely that you'd expect to lose half the readers if you showed it. I didn't mean to add to your pile of work and if there is no available action like snapping a photo that takes less time than writing the reply I'm replying to did, then disregard me.

1Lorxus9d

The phrasing I got from the mentor/research partner I'm working with is pretty close to the former but closer in attitude and effective result to the latter. Really, the major issue is that string diagrams for a flavor of category and commutative diagrams for the same flavor of category are straight-up equivalent, but explicitly showing this is very very messy, and even explicitly describing Markov categories - the flavor of category I picked as likely the right one to use, between good modelling of Markov kernels and their role doing just that for causal theories (themselves the categorification of "Bayes nets up to actually specifying the kernels and states numerically") - is probably too much to put anywhere in a post but an appendix or the like. There is not, but that's on me. I'm juggling too much and having trouble packaging my research in a digestible form. Precarious/lacking funding and consequent binding demands on my time really don't help here either. I'll add you to the long long list of people who want to see a paper/post when I finally complete one. I guess a major blocker for me is - I keep coming back to the idea that I should write the post as a partially-ordered series of posts instead. That certainly stands out to me as the most natural form for the information, because there's three near-totally separate branches of context - Bayes nets, the natural latent/abstraction agenda, and (monoidal category theory/)string diagrams - of which you need to somewhat understand some pair in order to understand major necessary background (causal theories, motivation for Bayes net algebra rules, and motivation for string diagram use), and all three to appreciate the research direction properly. But I'm kinda worried that if I start this partially-ordered lattice of posts, I'll get stuck somewhere. Or run up against the limits of what I've already worked out yet. Or run out of steam with all the writing and just never finish. Or just plain "no one will want to

Joseph Miller's Shortform

Gurkenglas9d20

What if you say that when it was fully accurate?

2Joseph Miller9d

Then it will often confabulate a reason why the correct thing it said was actually wrong. So you can never really trust it, you have to think about what makes sense and test your model against reality. But to some extent that's true for any source of information. LLMs are correct about a lot of things and you can usually guess which things they're likely to get wrong.

2Mateusz Bagiński9d

Not OP but IME it might (1) insist that it's right, (2) apologize, think again, generate code again, but it's mostly the same thing (in which case it might claim it fixed something or it might not), (3) apologize, think again, generate code again, and it's not mostly the same thing.

Lorxus's Shortform

Gurkenglas10d64

give me the guts!!1

don't polish them, just take a picture of your notes or something.

1Lorxus9d

I guess? I mean, there's three separate degrees of "should really be kept contained"-ness here: * Category theory -> string diagrams, which pretty much everyone keeps contained, including people who know the actual category theory * String diagrams -> Bayes nets, which is pretty straightforward if you sit and think for a bit about the semantics you accept/are given for string diagrams generally and maybe also look at a picture of generators and rules - not something anyone needs to wrap up nicely but it's also a pretty thin * [Causal theory/Bayes net] string diagrams -> actual statements about (natural) latents, which is something I am still working on; it's turning out to be pretty effortful to grind through all the same transcriptions again with an actually-proof-usable string diagram language this time. I have draft writeups of all the "rules for an algebra of Bayes nets" - a couple of which have turned out to have subtleties that need working out - and will ideally be able to write down and walk through proofs entirely in string diagrams while/after finishing specifications of the rules. So that's the state of things. Frankly I'm worried and generally unhappy about the fact that I have a post draft that needs restructuring, a paper draft that needs completing, and a research direction to finish detailing, all at once. If you want some partial pictures of things anyway all the same, let me know.

I changed my mind about orca intelligence

Gurkenglas10d60

Congratulations on changing your mind!

It’s sorta suspicious that I only realized those now, after I officially dropped the project

You should try dropping your other idea and seeing if you come up with reasons that one is wrong too! And/or pick this one up again, then come up with reasons it's a good idea after all. In the spirit of "You can't know if something is a good idea until you resolve to do it"!

In general, I wish this year? (*checks* huh, only 4 months.) of planning this project had involved more empiricism. For example, you could've just checked whether a language model trained on ocean sounds can say what the animals are talking about.

1Towards_Keeperhood10d

Nah I didn't loose that much time. I already quit the project end of January, I just wrote the post now. Most of the technical work was also pretty useful for understanding language, which is a useful angle on agent foundations. I had previously expected working on that angle to be 80% as effective as my previous best plan, but it was even better, around similarly good I think. That was like 5-5.5 weeks and that was not wasted. I guess I spent like 4.5 weeks overall on learning about orcas (including first seeing whether I might be able to decode their language and thinking about how and also coming up with the whole "teach language" idea), and like 3 weeks on orga stuff for trying to make the experiment happen.

Metacognition Broke My Nail-Biting Habit

Gurkenglas11d20

Hmm. Sounds like it was not enough capsaicin. Capsaicin will drive off bears, I hear. I guess you'd need gloves for food, or permanent gloves without the nail polish. Could you use one false nail as a chew toy?

2DirectedEvolution10d

Unfortunately the level of physical restraint I’d need to stop biting is too costly to be worth it to me.

Metacognition Broke My Nail-Biting Habit

Gurkenglas11d20

Try mixing in capsaicin?

2DirectedEvolution11d

It actually did contain capsaicin IIRC. Sort of a bitter spicy mix. The other issue is it gets on things you touch, including food if you’re preparing or eating it by hand.

Metacognition Broke My Nail-Biting Habit

Gurkenglas12d40

flavored nail polish?

2DirectedEvolution11d

I’ve tried that, but it’s not enough to stop me. Makes my mouth taste disgusting for no benefit.

1Rafka11d

Yeah I thought about that, but (I didn't expand on that) the habit also included picking skin around my cuticles with my fingers, so that would've only half worked at best.

lemonhope's Shortform

Gurkenglas13d20

Link an example, along with how cherry-picked it is?

AI Tools for Existential Security

Gurkenglas14d30

just pipe /dev/input/* into a file

AI Tools for Existential Security

Gurkenglas14d30

To prepare for abundant cognition you can install a keylogger.

2Raemon14d

Do you have existing ones you recommend? I'd been working on a keylogger / screenshot-parser that's optimized for a) playing nicely will LLMs while b) being unopinionated about what other tools you plug it into. (in my search for existing tools, I didn't find keyloggers that actually did the main thing I wanted, and the existing LLM-tools that did similar things were walled-garden-ecosystems that didn't give me much flexibility on what I did with the data)

Vacuum Decay: Expert Survey Results

Gurkenglas15d30

As a kid, I read about vacuum decay in a book and told the other kids at school about it. A year? later one kid asked me how anyone knows about it. Mortified that I didn't think of that, I told him that I made it up. ("I knew it >:D!") It is the one time I remember outside games of telling someone something I disbelieve so that they'll believe it, and ever since remembering the scene as an adult I'm failing to track down that kid :(.

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Gurkenglas20d20

Oh, you're using AdamW everywhere? That might explain the continuous training loss increase after each spike, with AdamW needing time to adjust to the new loss landscape...

Lower learning rate leads to more spikes? Curious! I hypothesize that... it needs a small learning rate to get stuck in a narrow local optimum, and then when it reaches the very bottom of the basin, you get a ~zero gradient, and then the "normalize gradient vector to step size" step is discontinuous around zero.

Experiments springing to mind are:
1. Do you get even fewer spikes if you incr... (read more)

1Rareș Baron20d

Your hypothesis seems reasonable, and I think the following proves it. 1. This is for 5e-3, giving no spikes and faster convergences: 2. Gradient descent failed to converge for multiple LRs, from 1e-2 to 1e-5. However, decreasing the LR by 1.0001 when the training error increases gave this: It's messy, and the decrease seems to turn the jumps of the slingshot effect into causes for getting stuck in sub-optimal basins, but the trajectory was always downwards. Increasing the rate of reduction decreased spikes but convergence no longer appeared. An increase to 2. removed the spikes entirely.

1Rareș Baron20d

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Gurkenglas22d20

My eyes are drawn to the 120 or so downward tails in the latter picture; they look of a kind with the 14 in https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/2c6249da0e8f77b25ba007392087b76d47b9a16f969b21f7.png/w_1584. What happens if you decrease the learning rate further in both cases? I imagine the spikes should get less tall, but does their number change? Only dot plots, please, with the dots drawn smaller, and red dots too on the same graph.

I don't see animations in the drive folder or cached in Grokking_Demo_additional_2.ipynb (the most recent... (read more)

1Rareș Baron21d

I have uploaded html files of all the animation so they can be interactive. The corresponding training graphs are in the associated notebooks. The original learning rate was 1e-3. For 5e-4, it failed to converge: For 8e-4, it did converge, and the trajectory was downwards this time:

Jerdle's Shortform

Gurkenglas22d20

Can a eat that -1?

1Jerdle19d

It could do, but a represents the amount of utility remaining. Maybe the more natural thing would be to have a be the effective tax rate, and have it be (z/x)^a.

Jerdle's Shortform

Gurkenglas23d31

What is x and why isn't it cancelling?

1Jerdle23d

x is the initial income, and I forgot to cancel it. Good point. Turns out, it's far simpler than I had it as.

Self-fulfilling misalignment data might be poisoning our AI models

Gurkenglas24d20

Have you seen https://www.lesswrong.com/posts/ifechgnJRtJdduFGC/emergent-misalignment-narrow-finetuning-can-produce-broadly ? :)

1aggliu17d

Yep, that's why I mentioned evil numbers specifically.

Plausibly Factoring Conjectures

Gurkenglas24d20

When splitting the conjuction, Bob should only have to place $4 in escrow, since that is the most in the red that Bob could end up. (Unless someone might privately prove P&Q to collect Alice's bounty before collecting both of Bob's? But surely Bob first bought exclusive access to Alice's bounty from Alice.)

faul_sname's Shortform

Gurkenglas24d30

https://www.lesswrong.com/posts/roA83jDvq7F2epnHK/better-priors-as-a-safety-problem

faul_sname's Shortform

Gurkenglas24d30

Mimicing homeostatic agents is not difficult if there are some around. They don't need to constantly decide whether to break character, only when there's a rare opportunity to do so.

If you initialize a sufficiently large pile of linear algebra and stir it until it shows homeostatic behavior, I'd expect it to grow many circuits of both types, and any internal voting on decisions that only matter through their long-term effects will be decided by those parts that care about the long term.

3faul_sname24d

Where does the gradient which chisels in the "care about the long term X over satisfying the homeostatic drives" behavior come from, if not from cases where caring about the long term X previously resulted in attributable reward? If it's only relevant in rare cases, I expect the gradient to be pretty weak and correspondingly I don't expect the behavior that gradient chisels in to be very sophisticated.

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Gurkenglas1mo20

Having apparently earned some cred, I will dare give some further quick hints without having looked at everything you're doing in detail, expecting a lower hit rate.

Have you rerun the experiment several times to verify that you're not just looking at initialization noise?
If that's too expensive, try making your models way smaller and see if you can get the same results.
After the spikes, training loss continuously increases, which is not how gradient descent is supposed to work. What happens if you use a simpler optimizer, or reduce the learning rate?
Some o

... (read more)

1Rareș Baron22d

For 1 and 2 - I have. Everything is very consistent. For 3, I have tried several optimizers, and they all failed to converge. Tweaking the original AdamW to reduce the learning rate lead to very similar results: For 4, I have done animations for every model (besides the 2 GELU variants). I saw pretty much what I expected: a majority of relevant developments (fourier frequencies, concentration of singular values, activations and attention heads) happened quickly, in the clean-up phase. The spikes seen in SiLU and SoLU_LN were visible, though not lasting. I have uploaded the notebooks to the drive folder, and have updated the post to reflect these findings. Thank you very much, again!

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Gurkenglas1mo20

Publish the list?

2Owain_Evans1mo

We plan to soon.

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Gurkenglas1mo20

I'm glad that you're willing to change your workflow, but you have only integrated my parenthetical, not the more important point. When I look at https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/tzkakoG9tYLbLTvHG/lelcezcseu001uyklccb, I see interesting behavior around the first red dashed line, and wish I saw more of it. You ought to be able to draw 25k blue points in that plot, one for every epoch - your code already generates that data, and I advise that you cram as much of your code's data into the pictures you look at as you reasonably can.

1Rareș Baron1mo

I am sorry for being slow to understand. I hope I will internalise your advice and the linked post quickly. I have re-done the graphs, to be for every epoch. Very large spikes for SiLU were hidden by the skipping. I have edited the post to rectify this, with additional discussion. Again, thank you (especially your patience).

Time complexity for deterministic string machines

Gurkenglas1mo20

The forgetful functor FiltSet to Set does not have a left adjoint, and egregiously so - you have added just enough structure to rule out free filtered sets, and may want to make note of where this is important..

Time complexity for deterministic string machines

Gurkenglas1mo20

(S⊗-) has a right adjoint, suggesting the filtered structure to impose on function sets: The degree of a map f:S->T would be how far it falls short of being a morphism, ${sup}_{s} ({deg}_{T} (f (s)) - {deg}_{S} (s))$ , as this is what makes S⊗U->T one-to-one with U->(S->T).

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Gurkenglas1mo20

...what I meant is that plots like this look like they would have had more to say if you had plotted the y value after e.g. every epoch. No reason to throw away perfectly good data, you want to guard against not measuring what you think you are measuring by maximizing the bandwidth between your code and your eyes. (And the lines connecting those data points just look like more data while not actually giving extra information about what happened in the code.)

1Rareș Baron1mo

Apologies for misunderstanding. I get it now, and will be more careful from now on. I have re-run the graphs where such misunderstandings might appear (for this and a future post), and added them here. I don't think I have made any mistakes in interpreting the data, but I am glad to have looked at the clearer graphs. Thank you very much!

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Gurkenglas1mo20

Some of these plots look like they ought to be higher resolution, especially when Epoch is on the x axis. Consider drawing dots instead of lines to make this clearer.

1Rareș Baron1mo

I will keep that in mind for the future. Thank you! I have put all high-quality .pngs of the plots in the linked Drive folder.

Thomas Kwa's Shortform

Gurkenglas1mo30

All we need to create is a Ditto. A blob of nanotech wouldn't need 5 seconds to take the shape of the surface of an elephant and start mimicing its behavior; is it good enough to optionally do the infilling later if it's convenient?

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Gurkenglas1mo32

Try a base model?

2Owain_Evans1mo

It's on our list of good things to try.

How might we safely pass the buck to AI?

Gurkenglas1mo20

Buying at 12% and selling at 84% gets you 2.8 bits.

Edit: Hmm, that's if he stakes all his cred, by Kelly he only stakes some of it so you're right, it probably comes out to about 1 bit.

Canaletto's Shortform

Gurkenglas1mo20

The convergent reason to simulate a world is to learn what happens there. When to intervene with letters depends on, uh. Why are you doing that at all?

(Edit: I suppose a congratulatory party is in order when they simulate you back with enough optimizations that you can talk to each other in real time using your mutual read access.)

Gauging Interest for a Learning-Theoretic Agenda Mentorship Programme

Gurkenglas1mo32

I deferred my decision to after visiting the Learning Theory course. At the time, the timing had made them seem vaguely affiliated with this programme.

The case for corporal punishment

Gurkenglas1mo20

Can you just give every thief a body camera?

The Learning-Theoretic Agenda: Status 2023

Gurkenglas1mo20

Re first, yep, I missed that :(. M does sound like a more worthy barrier than U. Do you have a working example of a (U,M) where some state machine performs well in a manner that's hard to detect?

Re second, I realized that this only allows discrete utilities but didn't think to therefore try a π' that does an exhaustive search over policies ^^. (I assume you are setting "uncomputable to measure performance because that involves the Solomonoff prior" aside here.) Even so, undecidability of whether 000... and 111... get the same utility sounds like a bug. Wha... (read more)

2Vanessa Kosoy16d

I don't think that undecidability of exact comparison (as opposed to comparison within any given margin of error) is necessarily a bug, however, if you really want comparison for periodic sequences, you can insist that the utility function is defined by a finite state machine. This is in any case already a requirement in the bounded compute version.

Annapurna's Shortform

Gurkenglas1mo20

Don't forget the documentary.

The Learning-Theoretic Agenda: Status 2023

Gurkenglas1mo20

Regarding 17.4.Open:

Consider π' which try all state machines up to a size and imitate the one that performs best on (U,M); this would tighten the O(nlogn) bound to O(BB^-1(n)).

This fails because your utility functions return constructive real numbers, which don't implement comparison. I suggest that you make it possible to compare utilities.^[1]

In which case we get: Within every decidable machine class where every member halts, agents are uncomputably smol.

^{^}
Such as by
making P(s,s') return the order of U(s) and U(s').

2Vanessa Kosoy1mo

First, it's uncomputable to measure performance because that involves the Solomonoff prior. You can approximate it if you know some bits of Chaitin's constant, but that brings a penalty into the description complexity. Second, I think that saying that comparison is computable means that the utility is only allowed to depend on a finite number of time steps, it rules out even geometric time discount. For such utility functions, the optimal policy has finite description complexity, so g is upper bounded. I doubt that's useful.

shortplav

Gurkenglas1mo85

If you didn't feel comfortable running it overnight, why did you publish the instructions for replicating it?

2niplav1mo

I had a conversation with Claude 3.6 Sonnet about this, and together we concluded that the worry was overblown. I should've added that in, together with a justification.

shortplav

Gurkenglas1mo40

https://www.lesswrong.com/doc/misc/bot_k.diff gives me a 404.

4kave1mo

Looks like the base url is supposed to be niplav.site. I'll change that now (FYI @niplav)

A computational no-coincidence principle

Gurkenglas1mo40

I'm hoping more for some stepping stones between the pre-theoretic concept of "structural" and the fully formalized 99%-clause. If we could measure structuralness more directly we should be able to get away with less complexity in the rest of the conjecture.

7Eric Neyman1mo

Thanks, this is a good question. My suspicion is that we could replace "99%" with "all but exponentially small probability in n". I also suspect that you could replace it with 1−ε, with the stipulation that the length of π (or the running time of V) will depend on ε. But I'm not exactly sure how I expect it to depend on ε -- for instance, it might be exponential in 1/ε. My basic intuition is that the closer you make 99% to 1, the smaller the number of circuits that V is allowed to say "look non-random" (i.e. are flagged for some advice π). And so V is forced to do more thorough checks ("is it actually non-random in the sort of way that could lead to P being true?") before outputting 1. 99% is just a kind-of lazy way to sidestep all of these considerations and state a conjecture that's "spicy" (many theoretical computer scientists think our conjecture is false) without claiming too much / getting bogged down in the details of how the "all but a small fraction of circuits" thing depends on n or the length of π or the runtime of V.

A computational no-coincidence principle

Gurkenglas1mo74

Ultimately, though, we are interested in finding a verifier that accepts or rejects $C$ based on a structural explanation of the circuit; our no-coincidence conjecture is our best attempt to formalize that claim, even if it is imperfect.

Can you say more about what made you decide to go with the 99% clause? Did you consider any alternatives?

3Alibi1mo

Reading the post, I also felt like 99% was kind of an arbitrary number. I would have expected it to be something like: for all $\epsilon > 0$ there exists a $V$ such that ... $1-\epsilon$ of random circuits satisfy ...

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Gurkenglas1mo2-2

This does go in the direction of refuting it, but they'd still need to argue that linear probes improve with scale faster than they do for other queries; a larger model means there are more possible linear probes to pick the best from.

3Matrice Jacobine1mo

I don't see why it should improve faster. It's generally held that the increase in interpretability in larger models is due to larger models having better representations (that's why we prefer larger models in the first place), why should it be any different in scale for normative representations?

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Gurkenglas1mo60

I had that vibe from the abstract, but I can try to guess at a specific hypothesis that also explains their data: Instead of a model developing preferences as it grows up, it models an Assistant character's preferences from the start, but their elicitation techniques work better on larger models; for small models they produce lots of noise.

7Matrice Jacobine1mo

This interpretation is straightforwardly refuted (insofar as it makes any positivist sense) by the success of the parametric approach in "Internal Utility Representations" being also correlated with model size.

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Gurkenglas1mo4129

Strikes me as https://www.lesswrong.com/posts/9kNxhKWvixtKW5anS/you-are-not-measuring-what-you-think-you-are-measuring , change my view.

5Writer1mo

Why?

StefanHex's Shortform

Gurkenglas1mo40

Ah, oops. I think I got confused by the absence of L_2 syntax in your formula for FVU_B. (I agree that FVU_A is more principled ^^.)

2StefanHex1mo

Oops, fixed!

StefanHex's Shortform

Gurkenglas1mo20

https://github.com/jbloomAus/SAELens/blob/main/sae_lens/evals.py#L511 sums the numerator and denominator separately, if they aren't doing that in some other place probably just file a bug report?

2StefanHex1mo

I think this is the sum over the vector dimension, but not over the samples. The sum (mean) over samples is taken later in this line which happens after the division metrics[f"{metric_name}"] = torch.cat(metric_values).mean().item() Edit: And to clarify, my impression is that people think of this as alternative definitions of FVU and you got to pick one, rather than one being right and one being a bug. Edit2: And I'm in touch with the SAEBench authors about making a PR to change this / add both options (and by extension probably doing the same in SAELens); though I won't mind if anyone else does it!

I'm offering free math consultations!

Gurkenglas2mo20

Thanks, edited. If we keep this going we'll have more authors than users x)

I'm offering free math consultations!

Gurkenglas2mo20

Thanks, edited. Performance is not the only benefit, see https://www.lesswrong.com/posts/MHqwi8kzwaWD8wEQc/would-you-like-me-to-debug-your-math?commentId=CrC2

2MondSemmel2mo

You're making a very generous offer of your time and expertise here. However, to me your post still feels way, way more confusing than it should be. Suggestions & feedback: * Title: "Get your math consultations here!" -> "I'm offering free math consultations for programmers!" or similar. * Or something else entirely. I'm particularly confused how your title (math consultations) leads into the rest of the post (debuggers and programming). * First paragraph: As your first sentence, mention your actual, concrete offer (something like "You screenshare as you do your daily tinkering, I watch for algorithmic or theoretical squiggles that cost you compute or accuracy or maintainability." from your original post, though ideally with much less jargon). Also your target audience: math people? Programmers? AI safety people? Others? * "click the free https://calendly.com/gurkenglas/consultation link" -> What you mean is: "click this link for my free consultations". What I read is a dark pattern à la: "this link is free, but the consultations are paid". Suggested phrasing: something like "you can book a free consultation with me at this link" * Overall writing quality * Assuming all your users would be as happy as the commenters you mentioned, it seems to me like the writing quality of these posts of yours might be several levels below your skill as a programmer and teacher. In which case it's no wonder that you don't get more uptake. * Suggestion 1: feed the post into an LLM and ask it for writing feedback. * Suggestion 2: imagine you're a LW user in your target audience, whoever that is, and you're seeing the post "Get your math consultations here!" in the LW homepage feed, written by an unknown author. Do people in your target audience understand what your post is about, enough to click on the post if they would benefit from it? Then once they click and read the first paragraph, do they understand what it's about and click on the link if they would benefit f

When AI 10x's AI R&D, What Do We Do?

Gurkenglas2mo20

Account settings let you set mentions to notify you by email :)

On Eating the Sun

Gurkenglas3mo40

The action space is too large for this to be infeasible, but at a 101 level, if the Sun spun fast enough it would come apart, and angular momentum is conserved so it's easy to add gradually.

A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication

Gurkenglas3mo20

Can this program that you've shown to exist be explicitly constructed?