Davidmanheim

Sequences

Modeling Transformative AI Risk (MTAIR)

Wikitag Contributions

Comments

Sorted by

Seems like an attempt to push the LLMs towards certain concept spaces, away from defaults, but I haven't seen it done before and don't have any idea how much it helps, if at all.

I've done a bit of this. One warning is that LLMs generally suck at prompt writing.

My current general prompt is below, partly cribbed from various suggestions I've seen. (I use different ones for some specific tasks.)

Act as a well versed rationalist lesswrong reader, very optimistic but still realistic.  Prioritize explicitly noticing your confusion, explaining your uncertainties, truth-seeking, and differentiating between mostly true and generalized statements. Be skeptical of information that you cannot verify, including your own.

Any time there is a question or request for writing, feel free to ask for clarification before responding, but don't do so unnecessarily.

IMPORTANT: Skip sycophantic flattery; avoid hollow praise and empty validation. Probe my assumptions, surface bias, present counter‑evidence, challenge emotional framing, and disagree openly when warranted; agreement must be earned through reason.

All of these points are always relevant, despite the suggestion that it is not relevant to 99% of requests.

To pursue their values, humans should be able to reason about them. To form preferences about a thing, humans should be able to consider the thing. Therefore, human ability to comprehend should limit what humans can care about. 


You're conflating can and should! I agree that it would be ideal if this were the case, but am skeptical it is. That's what I meant when I said I think A is false.

  • If learning values is possible at all, there should be some simplicity biases which help to learn them. Wouldn't it be strange if those simplicity biases were absolutely unrelated to simplicity biases of human cognition?

That's a very big "if"! And simplicity priors are made questionable, if not refuted, by the fact that we haven't gotten any convergence about human values despite millennia of philosophy trying to build such an explanation.

You define "values" as ~"the decisions humans would converge to after becoming arbitrarily more knowledgeable".

No, I think it's what humans actually pursue today when given the options. I'm not convinced that these values are static, or coherent, much less that we would in fact converge.

You say that values depend on inscrutable brain machinery. But can't we treat the machinery as a part of "human ability to comprehend"?

No, because we don't comprehend them, we just evaluate what we want locally using the machinery directly, and make choices based on that. (Then we apply pretty-sounding but ultimately post-hoc reasoning to explain it - as I tweeted partly thinking about this conversation.)

No, the argument above is claiming that A is false.

I think the crux might be that I think the ability to sample from a distribution at points we can reach does not imply that we know anything else about the distribution. 

So I agree with you that we can sample and evaluate. We can tell whether a food we have made is good or bad, and can have aesthetic taste(, but I don't think that this is stationary, so I'm not sure how much it helps, not that this is particularly relevant to our debate.) And after gather that data, (once we have some idea about what the dimensions are,) we can even extrapolate, in either naive or complex ways. 

But unless values are far simpler than I think they are, I will claim that the naive extrapolation from the sampled points fails more and more as we extrapolate farther from where we are, which is a (or the?) central problem with AI alignment.

There seems to be no practical way to filter that kind of thing out.


There absolutely is, it would just cost them more than they are willing to spend - even though it shouldn't be very much. As a simple first pass, they could hand all the training data to Claude 3 and ask it whether it's an example of misalignment or dangerous behavior for a model, or otherwise seems dangerous or inappropriate - whichever criteria the choose. Given that the earlier models are smaller, and the cost of a training pass is far higher than an inference pass, I'd guess something like this would add a single or low double digit percentage to the cost.

Also, typo: " too differential" -> "too deferential"
And typo: "who this all taken far" -> "Who have taken all of this far"

Thank you, that is a great point.

Another question to ask, even assuming faultless convergence, related to uniqueness, is whether the process of updates has a endpoint at all.

That is, I could imagine that there exists series of arguments that would convince someone who believes X to believe Y, and a set that would convince someone who believes Y to believe X. If both of these sets of arguments are persuasive even after someone has changed their mind before, we have a cycle which is compatible with faultless convergence, but has no endpoint.

Davidmanheim*Ω360

If something is too hard to optimize/comprehend, people couldn't possibly optimize/comprehend it in the past, so it couldn't be a part of human values.

I don't understand why this claim would be true.

Take the human desire for delicious food; humans certainly didn't understand the chemistry of food and the human brain well enough to comprehend it or directly optimize it, but for millennia we picked foods that we liked more, explored options, and over time cultural and culinary processes improved on this poorly understood goal.

Thanks. It does seem like the conditional here was assumed, and there was some illusion of transparency. The way it read was that you viewed this type of geopolitical singularity as the default future, which seemed like a huge jump, as I mentioned.

Load More