faul_sname - LessWrong

In that post, you say that you have a graph of vertices with a particular structure. In that scenario, where is that structured graph of $10^{100}$ vertices coming from? Presumably there's some way you know the graph looks like this

rather than looking like this

If you know that your graph is a nice sparse graph that has lots of symmetries, you can take advantage of those properties to skip redundant parts of the computation (and when each of your $10^{100}$ nodes has at most 100 inbound edges and 100 outbound edges, then you only have on the order of a trillion distinct nodes (if we consider e.g. (0, 0, 0, ..., 0, 0, 1) to be identical to (0, 0, 0, ..., 0, 1, 0, ..., 0, 0, 0)).

It's probably worth looking at the process which is generating this graph, and figuring out if we can translate the output of that process directly to a coordinate in our 100-dimensional space without going through the "translate the output to a graph, and then embed that graph" intermediate step.

Goal oriented cognition in "a single forward pass"

faul_sname21h42

I think the probability of getting the exact continuation "a a a a a ..." is genuinely higher than the probability of getting the exact continuation "little girl who was born with a very special gift...", though getting a continuation in the class of "a a a a a..." is much lower-probability than getting a continuation in the class of "little girl who was born with a very special gift..", because the latter class has a much larger possibility space than the former. So there might be 1e4 different low-entropy length-32 completions with an average probability of 1e-10 each, and 9.999999e15 different high-entropy length-32 completions with an average probability of 1e-16. This adds up to normality in that if you were to randomly sample this distribution, you'd get a weird low-entropy output one time in a million, and a normal high-entropy output the other 999999 times in a million. But if you try to do something along the lines of "take the best K outputs and train the model on those", you'll end up with almost entirely weird low-entropy outputs.

But yeah, I think I misunderstood your proposal as something along the lines of "take the k most probable n-token outputs" rather than "take the k% most probable n-token outputs" or "randomly sample a bunch of n-token outputs".

Goal oriented cognition in "a single forward pass"

faul_sname1d20

And I think one way to create a 2-token reasoner is to generate all plausible completions of 2 tokens, and then propagate the joint loss of the log-probs of those two tokens.

I think this just doesn't work very well, because it incentivizes the model to output a token which makes subsequent tokens easier to predict, as long as the benefit in predictability of the subsequent token(s) outweighs the cost of the first token. Concretely, let's say you have the input "Once upon a time, there was a" and you want 32 tokens. Right now, davinci-002 will spit out something like [" little"," girl"," who"," was"," born"," with"," a"," very"," special"," gift","."," She"," could"," see"," things"," that"," others"," could"," not","."," She"," could"," see"," the"," future",","," and"," she"," could"," see"," the"," past"], with logprobs of [-2.44, -0.96, -0.90, ..., -0.28, -0.66, 0.26], summing to -35.3. But if instead, it returned [" a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"," a"], it would have logprobs like [-9.32, -7.77, -1.51, ..., -0.06, -0.05, -0.05], summing to -23.5. And indeed, if you could somehow ask a couple quadrillion people "please write a story starting with Once upon a time, there was a", I suspect that at least 1 in a million people would answer with low-entropy completions along the lines of a a a a ... (and there just aren't that many low-entropy completions). But "Once upon a time there was a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a" is not a very good completion, despite being a much higher-probability completion.

You could use a more sophisticated loss function that "sum of individual-token logprob", but I think that road leads towards PPO (nothing says that your criterion has to be "helpful/harmful/honest as judged by a human rater" though).

Johannes C. Mayer's Shortform

faul_sname1d30

Do you need to store information about each object? If so, do you need to do so before or after the operation.

If you need to store information about each object before processing (let's say 1 bit per object, for simplicity), the Landauer limit says you need of mass to store that information (at the current cosmic microwave background temperature of 2.7ºK). That's about a factor of $10^{7}$ more than the mass of the observable universe, so the current universe could not even store $10^{100}$ bits of information for you to perform your operation on in the first place.

I think if you're willing to use all the mass in the universe and wait a trillion billion years or so for the universe to cool off, you might be able store one bit of output per operation for $10^{100}$ operations, assuming you can do some sort of clever reversible computing thing to make the operations themselves approximately free.

Is there some specific computation you are thinking of that is useful if you can do it $10^{100}$ times but not useful if you can only do it $10^{50}$ times?

niplav's Shortform

faul_sname3d40

Ideally one would want to be able to compute the logical correlation without having to run the program.

I think this isn't possible in the general case. Consider two programs, one of which is "compute the sha256 digest of all 30 byte sequences and halt if the result is 9a56f6b41455314ff1973c72046b0821a56ca879e9d95628d390f8b560a4d803" and the other of which is "compute the md5 digest of all 30 byte sequences and halt if the result is 055a787f2fb4d00c9faf4dd34a233320".

Any method that was able to compute the logical correlation between those would also be a program which at a minimum reverses all cryptograhic hash functions.

Cohesion and business problems

faul_sname5d40

Your POS system exports data that your inventory software imports and uses. But I strongly suspect that this is often not possible in practice.

This sounds like exactly the sort of problem that a business might pay for a solution to, particularly if there is one particular pair of POS system / inventory software that is widely used in the industry in question, where those pieces of software don't natively play well together.

Experiments with an alternative method to promote sparsity in sparse autoencoders

faul_sname5d20

The other baseline would be to compare one L1-trained SAE against another L1-trained SAE -- if you see a similar approximate "1/10 have cossim > 0.9, 1/3 have cossim > 0.8, 1/2 have cossim > 0.7" pattern, that's not definitive proof that both approaches find "the same kind of features" but it would strongly suggest that, at least to me.

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer

faul_sname6d132

With that in mind, the real hot possibility is the inverse of what Shai and his coresearchers did. Rather than start with a toy model with some known nice latents, start with a net trained on real-world data, and go look for self-similar sets of activations in order to figure out what latent variables the net models its environment as containing. The symmetries of the set would tell us something about how the net updates its distributions over latents in response to inputs and time passing, which in turn would inform how the net models the latents as relating to its inputs, which in turn would inform which real-world structures those latents represent.

Along these lines, I wonder whether you get similar scaling laws by training on these kind of hidden markov processes as you do by training on real-world data, and if so if there is some simple relationship between the underlying structure generating the data and the coefficients of those scaling laws. That might be informative for the question of what level of complexity you should expect in the self-similar activation sets in real-world LLMs. And if the scaling laws are very different, that would also be interesting.

Experiments with an alternative method to promote sparsity in sparse autoencoders

faul_sname8d30

This is really cool!

I did some tests on random features for interpretability, and found them to be interpretable. However, one would need to do a detailed comparison with SAEs trained on an L1 penalty to properly understand whether this loss function impacts interpretability. For what it’s worth, the distribution of feature sparsities suggests that we should expect reasonably interpretable features.

One cheap and lazy approach is to see how many of your features have high cosine similarity with the features of an existing L1-trained SAE (e.g. "900 of the 2048 features detected by the -trained model had cosine sim > 0.9 with one of the 2048 features detected by the L1-trained model"). I'd also be interested to see individual examinations of some of the features which consistently appear across multiple training runs in the $L_{a p p r o x}^{0}$ -trained model but don't appear in an L1-trained SAE on the training dataset.

nikola's Shortform

faul_sname8d63

Not just "some robots or nanomachines" but "enough robots or nanomachines to maintain existing chip fabs, and also the supply chains (e.g. for ultra-pure water and silicon) which feed into those chip fabs, or make its own high-performance computing hardware".

If useful self-replicating nanotech is easy to construct, this is obviously not that big of an ask. But if that's a load bearing part of your risk model, I think it's important to be explicit about that.

LESSWRONG
LW

Posts

Wiki Contributions

Comments