lukemarks - LessWrong

RLHF is the worst possible thing done when facing the alignment problem

I don't think the point of RLHF ever was value alignment, and I doubt this is what Paul Christiano and others intended RLHF to solve. RLHF might be useful in worlds without capabilities and deception discontinuities (plausibly ours), because we are less worried about sudden ARA, and more interested in getting useful behavior from models before we go out with a whimper.

This theory of change isn't perfect. There is an argument that RLHF was net-negative, and this argument has been had.

My point is that you are assessing RLHF using your model of AI risk, so the disagreement here might actually be unrelated to RLHF and dissolve if you and the RLHF progenitors shared a common position on AI risk.

François Chollet on the limitations of LLMs in reasoning

lukemarks8mo10

I don't understand why Chollet thinks the smart child and the mediocre child are doing categorically different things. Why can't the mediocre child be GPT-4, and the smart child GPT-6? I find the analogies Chollet and others draw in an effort to explain away the success of deep learning sufficient to explain what the human brain does, and it's not clear a different category of mind will or can ever exist (I don't make this claim, I'm just saying that Chollet's distinction is not evidenced).

Chollet points to real shortcomings of modern deep learning systems, but these are often exacerbated by factors not directly relevant to problem solving ability such as tokenization, so often I take them more lightly than I estimate he does.

lukemarks's Shortform

lukemarks8mo10

That is closer to what I meant, but it isn't quite what SLT says. The architecture doesn't need to be biased toward the target function's complexity. It just needs to always prefer simpler fits to more complex ones.

This why the neural redshift paper says something different to SLT. It says neural nets that generalize well don't just have a simplicity bias, they have a bias for functions with similar complexity to the target function. This brings into question mesaoptimization, because although mesaoptimization is favored by a simplicity bias, it is not necessarily favored by a bias toward equivalent simplicity to the target function.

lukemarks's Shortform

lukemarks8mo20

I think the predictions SLT makes are different from the results in the neural redshift paper. For example, if you use tanh instead of ReLU the simplicity bias is weaker. How does SLT explain/predict this? Maybe you meant that SLT predicts that good generalization occurs when an architecture's preferred complexity matches the target function's complexity?

The explanation you give sounds like a different claim however.

If you go to a random point in the loss landscape, you very likely land in a large region implementing the same behaviour, meaning the network has a small effective parameter count

This is true of all neural nets, but the neural redshift paper claims that specific architectural decisions beat picking random points in the loss landscape. Neural redshift could be true in worlds where the SLT prediction was either true or false.

We already knew neural network training had a bias towards algorithmic simplicity of some kind, because otherwise it wouldn't work.

We knew this, but the neural redshift paper claims that the simplicity bias is unrelated to training.

So we knew general algorithms, like mesa-optimisers, would be preferred over memorised solutions that don't generalise out of distribution.

The paper doesn't just show a simplicity bias, it shows a bias for functions of a particular complexity that is simpler than random. To me this speaks against the likelihood of mesaoptimization, because it seems unlikely a mesaoptimizer would be similar in complexity to the training set if that training set did not describe an optimizer.

lukemarks's Shortform

lukemarks8mo185

Neural Redshift: Random Networks are not Random Functions shows that even randomly initialized neural nets tend to be simple functions (measured by frequency, polynomial order and compressibility), and that this bias can be partially attributed to ReLUs. Previous speculation on simplicity biases focused mostly on SGD, but this is now clearly not the only contributor.

The authors propose that good generalization occurs when an architecture's preferred complexity matches the target function's complexity. We should think about how compatible this is with our projections for how future neural nets might behave. For example: If this proposition were true and a significant decider of generalization ability, would this make mesaoptimization less likely? More likely?

As an aside: Research on inductive biases could be very impactful. My impression is that far less resources are spent studying inductive biases than interpretability, but inductive bias research could be feasible on small compute budgets, and tell us lots about what to expect as we scale neural nets.

lukemarks's Shortform

lukemarks8mo131

I dropped out one month ago. I don't know anyone else who has dropped out. My comment recommends students consider dropping out on the grounds that it seemed like the right decision for me, but it took me a while to realize this was even a choice.

So far my experience has been pleasant. I am ~twice as productive. The total time available to me is ~2.5-3x as much as I had prior. The excess time lets me get a healthy amount of sleep and play videogames without sacrificing my most productive hours. I would make the decision again, and earlier if I could.

lukemarks's Shortform

lukemarks8mo21-1

More people should consider dropping out of high school, particularly if they:

Don't find their classes interesting
Have self-motivation
Don't plan on going to university

In most places, once you reach an age younger than the typical age of graduation you are not legally obligated to attend school. Many continue because it's normal, but some brief analysis could reveal that graduating is not worth the investment for you.

Some common objections I heard:

It's only more months, why not finish?

Why finish?

What if 'this whole thing' doesn't pan out?

The mistake in this objection is thinking there was a single reason I wanted to leave school. I was increasing my free time, not making a bet on a particular technology.

My parents would never consent to this.

In some cases this is true. You might be surprised if you demonstrate long term commitment and the ability to get financial support though.

Leaving high school is not the right decision for everyone, but many students won't even consider it. At least make the option available to yourself.

lukemarks's Shortform

lukemarks9mo1-2

This should be an equivalent problem, yes.

lukemarks's Shortform

lukemarks9mo10

Yes, your description of my hypothetical is correct. I think it's plausible that approximating things that happened in the past is computationally easier than breaking some encryption, especially if the information about the past is valuable even if it's noisy. I strongly doubt my hypothetical will materialize, but I think it is an interesting problem regardless.

My concern with approaches like the one you suggest is that they're restricted to small parts of the universe, so with enough data it might be possible to fill in the gaps.

lukemarks's Shortform

lukemarks9mo*3-11

Present cryptography becomes redundant when the past can be approximated. Simulating the universe at an earlier point and running it forward to extract information before it's encrypted is a basic, but difficult way to do this. For some information this approximation could even be fuzzy, and still cause damage if public. How can you protect information when your adversary can simulate the past?

The information must never exist as plaintext in the past. A bad way to do this is to make the information future-contingent. Perhaps it could be acausally inserted into the past by future agents, but probably you would not be able to act on future-contingent information in useful ways. A better method is to run many homomorphically encrypted instances of a random function that might output programs that do computations that yield sensitive information (e.g, an uploaded human). You would then give a plaintext description of the random function, including a proof that it probably output a program doing computations likely adversaries would want. This incentivizes the adversary to not destroy the program output by the random function, because it may not be worth the cost of destruction and replacement with something that is certainly doing better computations.

This method satisfies the following desiderata:
1. The adversary does not know the output of the encrypted random function, or the outputs of the program the random function output
2. There is an incentive to not destroy the program output by the random function

One problem with this is that your advserary might be superintelligent, and prove the assumptions that made your encryption appear strong to be incorrect. To avoid this, you could base your cryptography on something other than computational hardness.

My first thought was to necessitate computations that would make an adversary incur massive negative utility to verify the output of a random function. It's hard to predict what an adversary's preferences might be in advance, so the punishment for verifying the output of the random function would need to be generically bad, such as forcing the adversary to expend massive amounts of computation on useless problems. This idea is bad for obvious reasons, and will probably end up making the same or equally bad assumptions about the inseparability of the punishment and verification.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments