Tao Lin

Wiki Contributions

Comments

Sorted by

Do these options have a chance to default / are the sellers stable enough?

A core part of Paul's arguments is that having 1/million of your values towards humans only applies a minute amount of selection pressure against you. It could be that coordinating causes less kindness because without coordination it's more likely some fraction of agents have small vestigial values that never got selected against or intentionally removed

Tao Lin1913

to me "alignment tax" usually only refers to alignment methods that don't cost-effectively increase capabilities, so if 90% of alignment methods did cost effectively increase capabilities but 10% did not, i would still say there was an "alignment tax", just ignore the negatives.

Also, it's important to consider cost-effective capabilities rather than raw capabilities - if a lab knows of a way to increase capabilities more cost-effectively than alignment, using that money for alignment is a positive alignment tax

Tao Lin10

there's steganography, you'd need to limit total bits not accounted for by the gating system or something to remove them

Tao Lin10

yes, in some cases a much weaker (because it's constrained to be provable) system can restrict the main ai, but in the case of llm jailbreaks there is no particular hope that such a guard system could work (eg jailbreaks where the llm answers in base64 require the guard to understand base64 and any other code the main ai could use)

Tao LinΩ110

interesting, this actually changed my mind, to the extent i had any beliefs about this already. I can see why you would want to update your prior, but the iterated mugging doesn't seem like the right type of thing that should cause you to update. My intuition is to pay all the single coinflip muggings. For the digit of pi muggings, i want to consider how different this universe would be if the digit of pi was different. Even though both options are subjectively equally likely to me, one would be inconsistent with other observations or less likely or have something wrong with it, so i lean toward never paying it

Tao Lin30

Train two nets, with different architectures (both capable of achieving zero training loss and good performance on the test set), on the same data.
...
Conceptually, this sort of experiment is intended to take all the stuff one network learned, and compare it to all the stuff the other network learned. It wouldn’t yield a full pragmascope, because it wouldn’t say anything about how to factor all the stuff a network learns into individual concepts, but it would give a very well-grounded starting point for translating stuff-in-one-net into stuff-in-another-net (to first/second-order approximation).

I don't see why this experiment is good. This hessian similarity loss is only a product of the input/output behavior, and because both networks get 0 loss, their input/output behavior must be very similar, combined with general continuous optimization smoothness would lead to similar hessians. I think doing this in a case where the nets get nonzero loss (like ~all real world scenarios), would be more meaningful, because it would be similarity despite input-output behavior being non-identical and some amount of lossy compression happening.

Tao Lin61

yeah, i agree the movie has to be very high quality to work. This is a long shot, although the best rationalist novels are actually high quality which gives me some hope that someone could write a great novel/movie outline that's more targeted at plausible ASI scenarios

Tao Lin10

it's sad that open source models like Flux have a lot of potential for customized workflows and finetuning but few people use them

Tao Lin41

yeah. One trajectory could be someone in-community-ish writes an extremely good novel about a very realistic ASI scenario with the intention to be adaptable into a movie, it becomes moderately popular, and it's accessible and pointed enough to do most of the guidence for the movie. I don't know exactly who could write this book, there are a few possibilities.

Load More