ryan_greenblatt

I work at Redwood Research.

Wiki Contributions

Comments

Compute for doing inference on the weights if you don't have LoRA finetuning set up properly.

My implicit claim is that there maybe isn't that much fine-tuning stuff internally.

I get it if you're worried about leaks but I don't get how it could be a hard engineering problem — just share API access early, with fine-tuning

Fine-tuning access can be extremely expensive if implemented naively and it's plausible that cheap (LoRA) fine-tuning isn't even implemented for new models internally for a while at AI labs. If you make the third party groups pay for it than I suppose this isn't a problem, but the costs could be vast.

I agree that inference access is cheap.

Thanks! I feel dumb for missing that section. Interesting that this is so different from random.

Have you compared this method (finding vectors that change downstream activations as much as possible based on my understanding) with just using random vectors? (I didn't see this in the post, but I might have just missed this.)

In particular, does that yield qualitatively similar results?

Naively, I would expect that this would be qualitatively similar for some norm of random vector. So, I'd be interested in some ablations of the technique.

If random vectors work, that would simplify the story somewhat: you can see salient and qualitatively distinct behaviors via randomly perturbing activations.

(Probably random vectors have to somewhat higher norm to yield qualitatively as large results to vectors which are optimized for changing downstream activations. However, I current don't see a particular a priori (non-empirical) reason to think that there doesn't exist some norm at which the results are similar.)

I'm also worried about unaligned AIs as a competitor to aligned AIs/civilizations in the acausal economy/society. For example, suppose there are vulnerable AIs "out there" that can be manipulated/taken over via acausal means, unaligned AI could compete with us (and with others with better values from our perspective) in the race to manipulate them.

This seems like a reasonable concern.

My general view is that it seems implausible that much of the value from our perspective comes from extorting other civilizations.

It seems unlikely to me that >5% of the usable resources (weighted by how much we care) are extorted. I would guess that marginal gains from trade are bigger (10% of the value of our universe?). (I think the units work out such that these percentages can be directly compared as long as our universe isn't particularly well suited to extortion rather than trade or vis versa.) Thus, competition over who gets to extort these resources seems less important than gains from trade.

I'm wildly uncertain about both marginal gains from trade and the fraction of resources that are extorted.

Our universe is small enough that it seems plausible (maybe even likely) that most of the value or disvalue created by a human-descended civilization comes from its acausal influence on the rest of the multiverse.

Naively, acausal influence should be in proportion to how much others care about what a lightcone controlling civilization does with our resources. So, being a small fraction of the value hits on both sides of the equation (direct value and acausal value equally).

Of course, civilizations elsewhere might care relatively more about what happens in our universe than whoever controls it does. (E.g., their measure puts much higher relative weight on our universe than the measure of whoever controls our universe.) This can imply that acausal trade is extremely important from a value perspective, but this is unrelated to being "small" and seems more well described as large gains from trade due to different preferences over different universes.

(Of course, it does need to be the case that our measure is small relative to the total measure for acausal trade to matter much. But surely this is true?)

Overall, my guess is that it's reasonably likely that acausal trade is indeed where most of the value/disvalue comes from due to very different preferences of different civilizations. But, being small doesn't seem to have much to do with it.

(Surely cryonics doesn't matter given a realistic action space? Usage of cryonics is extremely rare and I don't think there are plausible (cheap) mechanisms to increase uptake to >1% of population. I agree that simulation arguments and similar considerations maybe imply that "helping current humans" is either incoherant or unimportant.)

But I do think, intuitively, GPT-5-MAIA might e.g. make 'catching AIs red-handed' using methods like in this comment significantly easier/cheaper/more scalable.

Noteably, the mainline approach for catching doesn't involve any internals usage at all, let alone labeling a bunch of internals.

I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.

Load More