Mark Xu

I do alignment research at the Alignment Research Center. Learn more about me at markxu.com/about

Sequences

Intermittent Distllations
Training Regime

Wiki Contributions

Comments

Mark Xu3mo20

It's important to distinguish between:

  • the strategy of "copy P2's strategy" is a good strategy
  • because P2 had a good strategy, there exists a good strategy for P1

Strategy stealing assumption isn't saying that copying strategies is a good strategy, it's saying the possibility of copying means that there exists a strategy P1 can take that is just a good as P2.

Mark Xu3mo122

You could instead ask whether or not the observer could predict the location of a single particle p0, perhaps stipulating that p0 isn't the particle that's randomly perturbed.

My guess is that a random 1 angstrom perturbation is enough so that p0's location after 20s is ~uniform. This question seems easier to answer, and I wouldn't really be surprised if the answer is no?

Here's a really rough estimate: This says 10^{10} s^{-1} per collision, so 3s after start ~everything will have hit the randomly perturbed particle, and then there are 17 * 10^{10} more collisions, each of which add's ~1 angstrom of uncertainty to p0. 1 angstrom is 10^{-10}m, so the total uncertainty is on the order of 10m, which means it's probably uniform? This actually came out closer than I thought it would be, so now I'm less certain that it's uniform.

This is a slightly different question than the total # of particles on each side, but it becomes intuitively much harder to answer # of particles if you have to make your prediction via higher order effects, which will probably be smaller.

Mark Xu9mo40

The bounty is still active. (I work at ARC)

Mark Xu9mo30

Humans going about their business without regard for plants and animals has historically not been that great for a lot of them.

Mark Xu10moΩ470

Here are some things I think you can do:

  • Train a model to be really dumb unless I prepend a random secret string. The goverment doesn't have this string, so I'll be able to predict my model and pass their eval. Some precedent in: https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal

  • I can predict a single matrix multiply just by memorizing the weights, and I can predict ReLU, and I'm allowed to use helper AIs.

  • I just train really really hard on imitating 1 particular individual, then have them just say whatever first comes to mind.

Mark Xu10moΩ240

You have to specify your backdoor defense before the attacker picks which input to backdoor.

I think Luke told your mushroom story to me. Defs not a coincidence.

If you observe 2 pieces of evidence, you have to condition the 2nd on seeing the 1st to avoid double-counting evidence

Mark Xu1y177

A human given finite time to think also only performs O(1) computation, and thus cannot "solve computationally hard problems".

Load More