LESSWRONG
LW

Ronak_Mehta — LessWrong

I think this is roughly right. I think of it more as a single layer would be a permutation, and that composing these permutations would give your complex behaviors (that break down in these nice ways). As a starting point having the hidden/model dimension equal to the input and output dimension would allow some sort of "reasonable" first interpretation that you are using convex combinations of your discrete vocabulary to compose the behaviors and come up with a prediction for your output. Then intermediate layers can map directly to your vocab space (this won't by default be true though, you'd still need some sort of diagonalized prior or something to make it such that each basis corresponded to an input vocab token).

Appendix: Interpretable by Design - Constraint Sets with Disjoint Limit Points

Ronak_Mehta

9mo

A bunch of other ideas that I couldn't format well for the main post here, are relevant, but were blocking me from just sharing the main ideas. This is significantly more messy and rough, with random pieces all over the place.

Maps for Simplex-Valued Vectors

Once we have "stuff on the simplex", we still need to do computation. We need some differentiable maps $f_{θ} : Δ \to Δ$ that do something.

Linear Maps

What is the equivalent "Linear Map"? What is $W x = y$ for $x, y \in Δ$ ? We can get this by requiring $W$ be in the space of stochastic matrices.

We can easily project a weight matrix in $R^{n \times n}$ to be a stochastic matrix by normalizing the output dimension using a softmax. Then, if we guarantee the input is on the simplex,... (read 595 more words →)

Interpretable by Design - Constraint Sets with Disjoint Limit Points

Ronak_Mehta

9mo

cart;horse: How can we constrain our models to be interpretable? Convex, linear sets make for more interpretable parameter spaces, and the simplex and the Birkhoff Polytope are great examples of this that have other desirable properties.

An interpretation is something explicit, something discrete, something that compresses, something that summarizes. Our current paradigms do not lend themselves well to this.

We may be able to fine-tune models and interpretations, via approaches built on Provable Guarantees for Model Performance via Mechanistic Interpretability, but in some sense we are fighting an uphill battle against "an uninterpretable base". In the same way we want to create models that are inherently not capable of deception rather than having to... (read 2596 more words →)

Ronak_Mehta's Shortform

Ronak_Mehta

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Ronak_Mehta1yQuick Take

Do you have a good estimate of what is and what will be possible with massively scaled up inference-time compute over the next 3 months? 6 months? Are you thinking about how this will effect others' priorities? Resource allocation? Governance and policy?

IMO having good answers to these questions feels super important for prioritizing where you spend your time.

GPT-2 Sometimes Fails at IOI

Ronak_Mehta

tl;dr: For Lisa, GPT-2 does not do IOI. GPT-2 fails to perform the IOI task on a significantly nonzero fraction of names used in the original IOI paper.

Code for this post can be found at https://github.com/ronakrm/ioi-enumerate.

Unintentionally continuing the trend of "following up" on the IOI paper, I ran GPT-2 Small on all possible inputs that fit the original BABA templates, PLACE/OBJECT tokens, and set of names for Subjects and Indirect Objects. This results in 9 million strings, and instead of just looking at the mean logit diff between the subject and indirect object tokens, let's look at the distribution.

These look pretty decent, but there's obviously some mass below zero! For what percent... (read 582 more words →)

A Bit For You

Ronak_Mehta

This button will send a single bit.

The Button

This is no mindgame, no weird trolley-problem-monkey's-paw-dilemma.

This page, this post, pressing this button, are meant to be whatever they need to be for you in this moment. The purpose of this singular bit is entirely up to you. Take a second, and Consider The Button. What do you need from this?

You probably already know if this was useful or not, but if you aren’t being present, take a second to think about “What The Button Can Do For You” before you continue reading or your mind skips to The Next Thing.

A Single Bit

Sometimes a single bit of information is sufficient, if we have a predetermined context... (read 409 more words →)