CBiddulph

Wikitag Contributions

Comments

Sorted by

I was about to comment with something similar to what Martín said. I think what you want is an AI that solves problems that can be fully specified with a low Kolmogorov complexity in some programming language. A crisp example of this sort of AI is AlphaProof, which gets a statement in Lean as input and has to output a series of steps that causes the prover to output "true" or "false." You could also try to specify problems in a more general programming language like Python.

A "highly complicated mathematical concept" is certainly possible. It's easy to construct hypothetical objects in math (or Python) that have arbitrarily large Kolmogorov complexity: for example, a list of random numbers with length 3^^^3.

Another example of a highly complicated mathematical object which might be "on par with the complexity of the 'trees' concept" is a neural network. In fact, if you have a neural network that looks at an image and tells you whether or not it's a tree with very high accuracy, you might say that this neural network formally encodes (part of) the concept of a tree. It's probably hard to do much better than a neural network if you hope to encode the "concept of a tree," and this requires a Python program a lot longer than the program that would specify a vector or a market.

Eliezer's post on Occam's Razor explains this better than I could - it's definitely worth a read.

It seems like it "wanted" to say that the blocked pronoun was "I" since it gave the example of "___ am here to help." Then it was inadvertently redirected into saying "you" and it went along with that answer. Very interesting.

I wonder if there's some way to apply this to measuring faithful CoT, where the model should go back and correct itself if it says something that we know is "unfaithful" to its true reasoning.

I previously agree-voted on the top-level comment and disagree-voted on your reply, but this was pretty convincing to me. I have now removed those votes.

My thinking was that some of your takeover plan seems pretty non-obvious and novel, especially the idea of creating mirror mold, so it could conceivably be useful to an AI. It's possible that future AI will be good at executing clearly-defined tasks but bad at coming up with good ideas for achieving long-term goals - this is sort of what we have now. (I asked ChatGPT to come up with novel takeover plans, and while I can't say its ideas definitely wouldn't work, they seem a bit sketchy and are definitely less actionable.)

But yes, probably by the time the AI can come up with and accurately evaluate the ideas needed to create its own mirror mold from scratch, the idea of making mirror mold in the first place would be easy to come by.

I am not really knowledgeable about it either and have never been hypnotized or attempted to hypnotize anyone. I did read one book on it a while ago - Reality is Plastic - which I remember feeling pretty informative.

Having said all that, I get the impression this description of hypnosis is a bit understated. Social trust stuff can help, but I'm highly confident that people can get hypnotized from as little as an online audio file. Essentially, if you really truly believe hypnosis will work on you, it will work, and everything else is just about setting the conditions to make you believe it really hard. Self-hypnosis is a thing, if you believe in it.

Hypnosis can actually be fairly risky - certain hypnotic suggestions can be self-reinforcing/addictive, so you want to avoid that stuff. In general, anything that can overwrite your fundamental personality is quite risky. For these reasons, I'm not sure I'd endorse hypnosis becoming a widespread tool in the rationalist toolkit.

Yeah, it's possible that CoT training unlocks reward hacking in a way that wasn't previously possible. This could be mitigated at least somewhat by continuing to train the reward function online, and letting the reward function use CoT too (like OpenAI's "deliberative alignment" but more general).

I think a better analogy than martial arts would be writing. I don't have a lot of experience with writing fiction, so I wouldn't be very good at it, but I do have a decent ability to tell good fiction from bad fiction. If I practiced writing fiction for a year, I think I'd be a lot better at it by the end, even if I never showed it to anyone else to critique. Generally, evaluation is easier than generation.

Martial arts is different because it involves putting your body in OOD situations that you are probably pretty poor at evaluating, whereas "looking at a page of fiction" is a situation that I (and LLMs) are much more familiar with.

There's been a widespread assumption that training reasoning models like o1 or r1 can only yield improvements on tasks with an objective metric of correctness, like math or coding. See this essay, for example, which seems to take as a given that the only way to improve LLM performance on fuzzy tasks like creative writing or business advice is to train larger models.

This assumption confused me, because we already know how to train models to optimize for subjective human preferences. We figured out a long time ago that we can train a reward model to emulate human feedback and use RLHF to get a model that optimizes this reward. AI labs could just plug this into the reward for their reasoning models, reinforcing the reasoning traces leading to responses that obtain higher reward. This seemed to me like a really obvious next step.

Well, it turns out that DeepSeek r1 actually does this. From their paper:

2.3.4. Reinforcement Learning for all Scenarios

To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. Specifically, we train the model using a combination of reward signals and diverse prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.

This checks out to me. I've already noticed that r1 feels significantly better than other models at creative writing, which is probably due to this human preference training. While o1 was no better at creative writing than other models, this might just mean that OpenAI didn't prioritize training o1 on human preferences. My Manifold market currently puts a 65% chance on chain-of-thought training outperforming traditional LLMs by 2026, and it should probably be higher at this point.

We need to adjust our thinking around reasoning models - there's no strong reason to expect that future models will be much worse at tasks with fuzzy success criteria.

Adapted from my previously-posted question, after cubefox pointed out that DeepSeek is already using RLHF.

Yeah, but there are probably other interesting takeaways

Thanks! Apparently I should go read the r1 paper :)

Yeah, if you're keeping the evaluator aligned solely by not RLing it very much, I feel safer than if you RLed it a lot, but not a lot safer. Ideally there would be some more theoretically-justified reason why we expect it to be safe. Hopefully MONA just works pretty well without any RL training of the evaluator, but if not, RLing the evaluator a little bit is something we could try.

If you're using additional techniques to make the evaluator act more human-like despite its RL, perhaps you could just apply those techniques to the agent model instead.

I think I mostly understand what you're saying. Essentially, MONA training is:

  1. strong model comes up with a step
  2. weak model outputs a score
  3. the step gets executed
  4. repeat, starting from the new state

Your idea adds "1.5. weak model optionally asks the strong model follow-up questions" and "2.5. weak model optionally changes the step."

If you don't update the weak model at all, this seems fine. However, the strong model will eventually learn to come up with a step which explains itself enough that the weak model's questions are all preemptively answered, and which anticipates the weak model's preferences enough that the weak model never feels the need to change the step. So this is essentially equivalent to MONA.

However, if you do update the weak model, this seems problematic. The equilibrium is that the strong model suggests exactly the actions that maximize reward, and the weak model does exactly the actions that the strong model suggests. (Technically the weak model isn't even incentivized to score the strong model any particular way, but I'm just assuming it gives high scores for suggestions that it decides to carry out.)

I guess if you only train the weak model a small amount, it will do something in the middle, where it's kind of motivated by long-term reward and kind of motivated by its initial values. There's no guarantee on exactly what it does. I think the stipulation that the weak model remains safe "due to its architecture or whatever reason we trust it more" is doing a lot of work here; I'm not sure exactly what this would mean.

Load More