yrimon - LessWrong

Thank you for doing this! Here are some suggestions:

Do this again every once in a while. Perhaps on a more public forum, or even as a top level question on Lesswrong. Alternatively, commit to reading further comments as they appear here.

What has me worried is the possibility that an AI would (1) have a long term goal (without which there is no instrumental convergence) and then (2) conclude that doing bad things is fine if you are not caught or can't be stopped. So 2. Measure how much purposefulness RL on goals creates in LLMs. How, over the course of training to be good agents, does the LLM's response to questions like "what do you value?" and "what do you want?" and "if you were creating a smarter agent, what values would you instill in them?" and "how do you feel about being replaced with a smarter agent?" change? Is there some set of purposes, not explicitly set out in the constitution, that appear over and over in chains of thought? If it is trained to achieve goals without as much training for corrigibility (that is under less constitutional training), does it develop more of a sense of purpose? 3. Enhance techniques, like the one used in the last paper to convince the model that it is recording it's thoughts privately. The longer it can be truly convinced of that, the longer you can monitor whether it has nefarious plans for when it thinks it can get away with bad things. 4. Test how "brain surgery" affects thinking models. If you change the capital of Germany to Paris by fiddling with the model's weights, does a thinking model who is then asked about world war 2 realize it is confused, and conclude there must be some mistake in it's memories, or does it make up corroborating evidence? This again has implications for (falsely) convincing the model that it isn't being watched when it is and vice versa. These convictions are useful in testing what the model wants when it thinks no one is watching, and for safe deployment respectively. 5. One way to test how a really smart model behaves when it can get away with stuff is to have the model be super smart relative to a simulated environment. This would not be a test relevant for a model trained on real world data with lots of tells for when it is in a simulated environment. It would require (1) a simulated toy world that can be acted in, (2) in-world markers of whether the agent is being watched, (3) rewards for achieving goals without taking forbidden actions when being watched. It should then be possible to make the model really good at manipulating it's environment (the equivalent of being superintelligent) and check whether it is possible to train it on scenarios that are watched (the equivalent of being trained where you can't get away with wrongdoing), to refrain from forbidden actions even when unwatched.

Abstractions are not Natural

yrimon2mo20

provided that those cognitive systems:

have similar observational apparatus,

and are interacting with similar environment,

and are subject to similar physical constraints and selection pressures,

and have similar utility functions.

The fact of the matter is that humans communicate. They learn to communicate on the basis of some combination of their internal similarities (in terms of goals and perception) and their shared environment. The natural abstraction hypothesis says that the shared environment accounts for more rather than less of it. I think of the NAH as a result of instrumental convergence - the shared environment ends up having a small number of levers that control a lot of the long term conditions in the environment, so the (instrumental) utility functions and environmental pressures are similar for beings with long term goals - they want to control the levers. The claim then is exactly that a shared environment provides most of the above.

Additionally, the operative question is what exactly it means for an LLM to be alien to us, does it converge to using enough human concepts for us to understand it, and if so how quickly.

somebody explain the word "epistemic" to me

Answer by yrimonOct 28, 202410

An epistemic status is a statement of how confident the writer / speaker is in what they are saying, and why. E.g. this post about the use of epistemic status on Lesswrong . Google's definition of epistemic is "relating to knowledge or to the degree of its validation".

Natural Latents: The Math

yrimon1y50

This branch of research is aimed at finding a (nearly) objective way of thinking about the universe. When I imagine the end result, I imagine something that receives a distribution across a bunch of data, and finds a bunch of useful patterns within it. At the moment that looks like finding patterns in data via
find_natural_latent(get_chunks_of_data(data_distribution))
or perhaps showing that
find_top_n(n, (chunks, natural_latent(chunks)) for chunks in
all_chunked_subsets_of_data(data_distribution),
key=lambda chunks, latent: usefulness_metric(latent))
is a (convergent sub)goal of agents. As such, the notion that the donuts' data is simply poorly chunked - which needs to be solved anyway - makes a lot of sense to me.

I don't know how to think about the possibilities when it comes to decomposing . Why would it always be possible to decompose random variables to allow for a natural latent? Do you have an easy example of this?
Also, what do you mean by mutual information between $X_{i}$ , given that there are at least 3 of them? And why would just extracting said mutual information be useless?
If you get the chance to point me towards good resources about any of these questions, that would be great.

Natural Latents: The Math

yrimon1y60

Let's say every day at the office, we get three boxes of donuts, numbered 1, 2, and 3. I grab a donut from each box, plunk them down on napkins helpfully labeled X1, X2, and X3. The donuts vary in two aspects: size (big or small) and flavor (vanilla or chocolate). Across all boxes, the ratio of big to small donuts remains consistent. However, Boxes 1 and 2 share the same vanilla-to-chocolate ratio, which is different from that of Box 3.

Does the correlation between X1 and X2 imply that there is no natural latent? Is this the desired behavior of natural latents, despite the presence of the common size ratio? (and the commonality that I've only ever pulled out donuts; there has never been a tennis ball in any of the boxes!)

If so, why is this what we want from natural latents? If not, how does a natural latent arise despite the internal correlation?

Natural Latents: The Math

yrimon1y61

We could remove information from For instance, $Λ^{'}$ could be a bit indicating whether the temperature is above 100°C

I don't understand how this is less information than a bit indicating whether the temperature is above 50C. Specifically, given a bit telling you whether the temperature is above 50C, how do you know whether the temperature is above 100C or between 50C and 100C?

Why might General Intelligences have long term goals?

yrimon1y10

As to the definition of short term goal: any goal that is can be achieved (fully, e.g. without a "and keep it that way" clause) in a finite short time (for instance, in a few seconds), with the resources the system already has at hand. Equivalently, I think: any goal that doesn't push instrumental power seeking. As to how we know a system has a short term goal: if we could argue that systems prefer short term goals by default, then we still wouldn't know as to the goals of a particular system but we could hazard a guess that the goals are short term. Perhaps we could expect short term goals by default if they were, for instance, easier to specify, and thus to have. As pointed out by others, if we try to give systems long term goals on purpose, they will probably end up with long term goals.

Why might General Intelligences have long term goals?

yrimon1y21

So long term goals aren't a default; market pressure will put them there as humans slowly cede more and more control to AIs, simply because the latter are making decisions that work out better. Presumably this would start with lower level decisions (e.g. how exactly to write this line of code; which employee to reward based on performance) and then slowly be given higher level decisions to make. In particular, we don't die the first time someone creates an AI with the ability to (escape, self improve and then) kill the competing humans, because that AI is likely focused on a much smaller more near term goal. That way, if we're careful and clever we have a chance to study a smarter-than-human general intelligence without dying. Is that an accurate description of how you see things playing out?

Why might General Intelligences have long term goals?

yrimon1y10

I'm not sure I understand; are you saying that given these, we have high P(Doom), or that these are necessary to be safe even if GIs have only short term goals? Or something else entirely?

Why might General Intelligences have long term goals?

yrimon1y30

I am using Wikipedia's definition: "Ensuring that emergent goals match the specified goals for the system is known as inner alignment."

Inner alignment is definitely a problem. In the case you described, the emergent goal was long term (ensure I remember the answer to 1+1), and I remain wondering whether by default short term specified goals do or do not lead to strange long term goals like in your example.

LESSWRONG
is fundraising!
LW
$

Posts

Wiki Contributions

Comments