All of Curt Tigges's Comments + Replies

I'm not sure I entirely agree with the overall recommendation for researchers working on internals-based techniques. I do agree that findings will need to be behavioral initially in order to be legible and something that decision-makers find worth acting on.

My expectation is that internals-based techniques (including mech interp) and techniques that detect specific highly legible behaviors will ultimately converge. That is:

  1. Internals/mech interp researchers will, as they have been so far at least in model organisms, find examples of concerning cognition tha
... (read more)
6ryan_greenblatt
I'm not claiming that internals-based techniques aren't useful, just that internals-based techniques probably aren't that useful for specifically producing legible evidence of misalignment. Detecting misalignment with internals-based techniques could be useful for other reasons (which I list in the post) and internals based techniques could be used for applications other than detecting misalignment (e.g. better understanding some misaligned behavior). If internals-based techniques are useful for further investigating misalignment, that seems good. And I think I agree that if we first find legible evidence of misalignment behaviorally and internals-based methods pick this up (without known false positives), then this will make future evidence with internals-based techniques more convincing. However, I think it might not end up being that much more convincing in practice unless this happens many times with misalignment which occurs in production models.
2Yash Shirsath
Agreed. A concrete example of this might be: multi-agent systems trained to communicate via chain-of-thought. Or even simpler, RL pressure inducing steganography. We already have toy examples of this. In cases like these, "fancy internals-based methods" might be the only ways to flag steganography. These can then be reproduced in sandbox environments to create legible examples.  
Answer by Curt Tigges30

I have perhaps 1000-1500 hours of meditation experience and have done a decent amount of psychedelics as well. I don't think meditation has given me any understanding of the hard problem of consciousness. Meditation has helped me to see different possibilities in terms of content, shape, and phenomena within the conscious space, and perhaps helped me to understand the shape of it better, but I don't really see it helping much to bridge the scientific/philosophical gap. Best I can say is that "yeah, it sort of feels like what Epistemic Depth Theory and prob... (read more)

I use Freedom and Limit on my computer and Stay Focused on my Android phone. The former two allow for a combination of complete blocking during certain time windows and time limits (for any website, even across browsers and even if you open an incognito window). The latter does both for my phone.

I block all social media and content during prime working hours and implement a 30-minute limit outside of that. It works pretty well. I may make it more strict because I sometimes find myself looking at Twitter, etc. occasionally when watching a TV show in the eve... (read more)

I find this argument quite compelling, and this is also why I find the idea of "AI girl/boyfriends" largely uninteresting. Without actual connection to another mind (that has experiences and phenomenal consciousness), any of these things--art, deep conversations about thoughts/feelings, what have you--eventually falls flat. (That includes one-way connection through art).

I quite enjoyed reading this. Very evocative.

Welcome to San Francisco.

1Recurrented
thank you :) 

Domain: Software engineering, mech interp

Bryce Meyer (primary maintainer of TransformerLens, and software engineer with many years of experience) has a weekly coding stream event where he does live coding on TransformerLens--resolving bugs, adding features and tests, etc. I've found it to be useful!

You can find it in the Open Source Mechanistic Interpretability Slack, under the "code-sessions" channel (feel free to DM for an invite).

Great post, but there is one part I'd like to push back on:

Iterators are also easier to identify, both by their resumes and demonstrated skills. If you compare two CVs of postdocs that have spent the same amount of time in academia, and one of them has substantially more papers (or GitHub commits) to their name than the other (controlling for quality), you’ve found the better Iterator. Similarly, if you compare two CodeSignal tests with the same score but different completion times, the one completed more quickly belongs to the stronger Iterator.

This seems... (read more)

2Ryan Kidd
Yeah, I basically agree with this nuance. MATS really doesn't want to overanchor on CodeSignal tests or publication count in scholar selection.

Perhaps more important than these details: How do you curate input to take notes on, and what is the purpose you take the notes for? How do you use the notes once written? (This latter point seems to be one of the biggest reason many people have dropped PKM systems.)

Very kind of you to say. :) I think for me, though, the source of the emotion I felt when reading this series was something like: "Ah, so in addition to ensuring we are dateable ourselves, we must fix society, capitalism (at least the dating part of it), culture, etc. in order to have a Good Dating Universe." Which in retrospect was a bit overblown of me, so I think I no longer endorse the strong version of what I said in that comment.

I think this list may successfully convince some to stay off the dating market indefinitely. Who in the world has time to work on all of this? At best, this is just a massive set of to-dos; at worst, it's an enormous list of all the ways the dating world sucks and reasons why you'll fail. 

Upon reflection: This is a good collection of information, even if it is rather discouraging to read. May we all find exceptions to the unfortunate trends that seem to characterize the modern dating landscape.

2romeostevensit
People should focus way more on things that make them better partners because they make you a healthier more rounded person and way less on idiosyncratic dating market dynamics imo. When you climb the health hill you meet others also climbing the health hill. When you climb fake hills you meet others climbing fake hills.
5Vanessa Kosoy
FWIW, from glancing at your LinkedIn profile, you seem very dateable :)

I actually went through the same process as what you describe here, but it didn't remove my "transhumanist" label. I was a big fan of Humanity+, excited about human upgrading, etc. etc. I then became disillusioned about progress in the relevant fields, started to understand nonduality and the lack of a persistent or independent self, and realized AI was the only critical thing that actually was in the process of happening.

In that sense, my process was similar but I still consider myself a transhumanist. Why? Because for me, solving death or trying to make ... (read more)

Yes, tuned lens is an excellent tool and generally superior to the original logit lens. In this particular case, I don't think it would show very different results, however (and in any case the logit lens is only a small part of the analysis), but I think it would be interesting to have some kind of integration with TransformerLens that enabled the training and usage of tuned lens as well.

This is a cool idea, and I have no doubt it helped somewhat, but IMO it falls prey to the same mistake I see made by the makers of almost every video series/online course/list of resources for ML math: assuming that math is mostly about concepts and facts.

It's only about 5% that. Maybe less. I and many others in ML have seen the same videos and remembered the concepts for a while too. And forgotten them, in time. More than once! On the other hand, I've seen how persistently and operationally fluent (especially in ML and interpretability) people become when... (read more)