J Bostock

Sequences

Dead Ends
Statistical Mechanics
Independent AI Research
Rationality in Research

Wiki Contributions

Comments

Sorted by

A paper I'm doing mech interp on used a random split when the dataset they used already has a non-random canonical split. They also validated with their test data (the dataset has a three way split) and used the original BERT architecture (sinusoidal embeddings which are added to feedforward, post-norming, no MuP) in a paper that came out in 2024. Training batch size is so small it can be 4xed and still fit on my 16GB GPU. People trying to get into ML from the science end have got no idea what they're doing. It was published in Bioinformatics.

sellers auction several very similar lots in quick succession and then never auction again

This is also extremely common in biochem datasets. You'll get results in groups of very similar molecules, and families of very similar protein structures. If you do a random train/test split your model will look very good but actually just be picking up on coarse features.

I think the LessWrong community and particularly the LessWrong elites are probably too skilled for these games. We need a harder game. After checking the diplomatic channel as a civilian I was pretty convinced that there were going to be no nukes fired, and I ignored the rest of the game based on that. I also think the answer "don't nuke them" is too deeply-engrained in our collective psyche for a literal Petrov Day ritual to work like this. It's fun as a practice of ritually-not-destroying-the-world though.

Isn't Les Mis set in the second French Revolution (1815 according to wikipedia) not the one that led to the Reign of Terror (which was in the 1790s)?

I have an old hypothesis about this which I might finally get to see tested. The idea is that the feedforward networks of a transformer create little attractor basins. Reasoning is twofold: the QK-circuit only passes very limited information to the OV circuit as to what information is present in other streams, which introduces noise into the residual stream during attention layers. Seeing this, I guess that another reason might be due to inferring concepts from limited information:

Consider that the prompts "The German physicist with the wacky hair is called" and "General relativity was first laid out by" will both lead to "Albert Einstein". Both of them will likely land in different parts of an attractor basin which will converge.

You can measure which parts of the network are doing the compression using differential optimization, in which we take d[OUTPUT]/d[INPUT] as normal, and compare to d[OUTPUT]/d[INPUT] when the activations of part of the network are "frozen". Moving from one region to another you'd see a positive value while in one basin, a large negative value at the border, and then another positive value in the next region.

Yeah, I agree we need improvement. I don't know how many people it's important to reach, but I am willing to believe you that this will hit maybe 10%. I expect the 10% to be people with above-average impact on the future, but I don't know what %age of people is enough.

90% is an extremely ambitious goal. I would be surprised if 90% of the population can be reliably convinced by logical arguments in general.

I've posted it there. Had to use a linkpost because I didn't have an existing account there and you can't crosspost without 100 karma (presumably to prevent spam) and you can't funge LW karma for EAF karma.

Only after seeing the headline success vs test-time-compute figure did I bother to check it against my best estimates of how this sort of thing should scale. If we assume:

  1. A set of questions of increasing difficulty (in this case 100), such that:
  2. The probability of getting question  correct on a given "run" is an s-curve like  for constants  and 
  3. The model does  "runs"
  4. If any are correct, the model finds the correct answer 100% of the time
  5.  gives a score of 20/100

Then, depending on  ( is is uniquely defined by  in this case), we get the following chance of success vs question difficulty rank curves:

Higher values of  make it look like a sharper "cutoff", i.e. more questions are correct ~100% of the time, but more are wrong ~100% of the time. Lower values of  make the curve less sharp, so the easier questions are gotten wrong more often, and the harder questions are gotten right more often.

Which gives the following best-of-N sample curves, which are roughly linear in  in the region between 20/100 and 80/100. The smaller the value of , the steeper the curve.

Since the headline figure spans around 2 orders of magnitude compute, the model on appears to be performing on AIMES similarly to a best-of-N sampling on the  case.

If we allow the model to split the task up into  subtasks (assuming this creates no overhead and each subtask's solution can be verified independently and accurately) then we get a steeper gradient roughly proportional to , and a small amount of curvature.

Of course this is unreasonable, since this requires correctly identifying the shortest path to success with independently-verifiable subtasks. In reality, we would expect the model to use extra compute on dead-end subtasks (for example, when doing a mathematical proof, verifying a correct statement which doesn't actually get you closer to the answer, or when coding, correctly writing a function which is not actually useful for the final product) so performance scaling from breaking up a task will almost certainly be a bit worse than this.

Whether or not the model is literally doing best-of-N sampling at inference time (probably it's doing something at least a bit more complex) it seems like it scales similarly to best-of-N under these conditions.

Overall it looked a lot like other arguments, so that’s a bit of a blow to the model where e.g. we can communicate somewhat adequately,  ‘arguments’ are more compelling than random noise, and this can be recognized by the public.

 

Did you just ask people "how compelling did you find this argument" because this is a pretty good argument that AI will contribute to music production. I would rate it highly on compelling, just not a compelling argument for X-risk.

J Bostock4016

I was surprised by the "expert opinion" case causing people to lower their P(doom), then I saw the argument itself suggests to people that experts have a P(doom) of around 5%. If most people give a number > 5% (as in the open response and slider cases) then of course they're going to update downwards on average!

I would be interested to see what a specific expert opinion (e.g. Geoffrey Hinton, Yoshua Bengio, Elon Musk, Yann LeCunn as a negative control) would have, given that those individuals have more extreme P(dooms)

My update on the choice of measurement is that "convincingness" is effectively meaningless.

I think the values of update probability are likely to be meaningful. The top two arguments are both very similar, as they play off of humans misusing AI (which I also find to be the most compelling argument to individuals), then there is a cluster relating to talking about how powerful AI is or could be and how it could compete with people.

Load More