All of metawrong's Comments + Replies

Finetuned a SAE on deceptive/non deceptive reasoning traces from  Gemma 9b

 

If you generate synthetic deceptive trajectories, how can you be sure the SAE is going to generalise to 'real' deceptive trajectories? Also in those cases why do you need to use SAEs, can you use probes instead?

1Gerard Boxo
Hey, thanks for the comment! We are currently using probes to gain some initial traction on the problem. However, the unsupervised nature of SAE labels seems to better align with the broader application of activation space methods for AI control. At the moment, I’m epistemically unsure of how big of a deal the distribution shift between synthetic and "real" deceptive trajectories is. Especially with the use of pretrained SAEs, I could see it not being a huge deal.  

How does this explain the Decoy effect [1]?

  1. ^

    I am not sure how real and how well researched the 'decoy effect' is

So you would expect Claude Opus 3 to be harder to interpret than Claude Sonnet 3.5 ?

My intuition is that larger models of the same capability would exhibit less super-position and thus be easier to interpret?

> The news is good, and there are now seven shows in my tier 1

@Zvi Which are the other shows in your tier 1?

LASR (https://www.lasrlabs.org/) is giving a £11,000 stipend for a 13 week program, assuming 40h/week it works out to ~$27

3Ryan Kidd
Updated figure with LASR Labs and Pivotal Research Fellowship at current exchange rate of 1 GBP = 1.292 USD.
2Ryan Kidd
That seems like a reasonable stipend for LASR. I don't think they cover housing, however.

It's difficult to produce a counterargument when the main argument is 'trust me bro'

The main argument seems to be (except for the all caps BUY STONKS NOW):

>It is just very obvious that the market is erring

This is not obvious to me

Thank you for doing this!

Few random questions, of course feel free to say as much or as little as you want:

Have you been personally to the Gaza strip (or the West Bank) ? If yes - what are your impressions? How do people live there? Is it common for regular (non-military) Jewish people to hang around those places? How common is for Palestinians to hang around outside of Gaza and the West Bank? How common is for Jewish and Palestinian people to mix in everyday life and interact?

I have just recently learned about the Gilat Shalit prisoner exchange (a single ... (read more)

3Yovel Rom
I was actually born in Netzarim in the Gaza strip, but my parents left when I was one month old for job reasons. I've not been there until 2005, when The Separation happened, all jews were transferred from the strip and jewish entrance was prohibited, after which I couldn't visit anymore. There are 150,000 Palestinians from the West Bank who work in in Israel, and around 15,000 from the Gaza strip who used to do the same until last saturday. Israelis and Israeli Arabs mix all the time (for instance, I have some Israeli Arab friends), Palestinians less so. Access from and to the Gaza strip is very restricted, and in the West Bank settlements are separate. I live for a few years in the West Bank, and I talked to Palestinians when I met them during hiking and stuff. My dad and I actually saved a Palestinian goat that fell to a water canal during one of our hikes. It's unrelated to your question, just a funny anecdote I remembered while writing this. We shouldn't have payed so much, but I don't know if it had a long term strategic impact. We definitely gave a lot of talent back to Hamas, who used it well. Basically zero. My theory would rely on classified information, so I won't give it, sorry. I imagine there will be some kind of public commission of inquiry after the war. I'll try and remember to post a link to its conclusion for you when that happens.