Q Home

Discord: LemonUniverse (lemonuniverse). Reddit: u/Smack-works. About my situation: here.

I wrote some bad posts before 2024 because I was very uncertain how the events may develop.

I do philosophical/conceptual research, have no mathematical or programming skills. But I do know a bunch of mathematical and computer science concepts.

Wikitag Contributions

Comments

Sorted by

I have a couple of questions/points. Might stem from not understanding the math.

1) The very first example shows that absolutely arbitrary things (e.g. arbitrary green lines) can be "natural latents". Does it mean that "natural latents" don't capture the intuitive idea of "natural abstractions"? That all natural abstractions are natural latents, but not all natural latents are natural abstractions. You seem to be confirming this interpretation, but I just want to be sure:

So we view natural latents as a foundational tool. The plan is to construct more expressive structures out of them, rich enough to capture the type signatures of the kinds of concepts humans use day-to-day, and then use the guarantees provided by natural latents to make similar isomorphism claims about those more-complicated structures. That would give a potential foundation for crossing the gap between the internal ontologies of two quite different minds.

Is there any writing about what those "more expressive structures" could be?

2) Natural latents can describe both things which propagate through very universal, local physical laws (e.g. heat) and any commonalities in made up categories (e.g. "cakes"). Natural latents seem very interesting in the former case, but I'm not sure about the latter. Not sure the analogy between the two gives any insight. I'm still not seeing any substantial similarity between cakes and heat or Ising models. I.e. I see that an analogy can be made, but I don't feel that this analogy is "grounded" in important properties of reality (locality, universality, low-levelness, stability, etc). Does this make sense?

3) I don't understand what "those properties can in-principle be well estimated by intensive study of just one or a few mai tais" (from here) means. To me a natural latent is something like ~"all words present in all of 100 books", it's impossible to know unless you read every single book.

If I haven't missed anything major, I'd say core insights about abstractions are still missing.

Have an idea about interpretability and defining search/optimization.

Trivial Property

Finite algorithms can solve infinite (classes of) problems. For example, the algorithm for adding two numbers has a finite description, yet can solve an infinity of examples.

This is a basic truth of computability theory.

Intuitively, it means that algorithms can exploit regularities in problems. But "regularity" here can only be defined tautologically (any smaller thing which solves/defines a bigger thing).

Less Trivial Property

Many algorithms have the following property:

  1. The algorithm computes function by computing another function  many times. Computing  is useful for computing  is computed significantly more times than .
  2. Computing  is significantly easier than computing .
  3.  gives important information about . This information is easy to compute and (maybe) trivial to prove.

Intuitively, it means that algorithms can exploit regularities in problems. However, here "regularity" has a stricter definition than in the trivial property (TP). A "regularity" is an easily computable thing which gives you important, easily computable information about a hard-to-compute thing.

Now, the question is: what classes of algorithms have the less trivial property (LTP)?

LTP includes a bunch of undefined terms:

  • "Significantly simpler", "easily computable", and "useful for computing". Those could be defined in terms of computational complexity.
  • "Significantly more times". Could be defined asymptotically.
  • "Important information". I have little idea how it should be defined. "Important information" may mean upper and lower bounds of a function, for example.

Probably giving fully general definitions right away is not important. We could try starting with overly restrictive definitions and see if we can prove anything about those.

Relation to AI safety

If neural networks implement algorithms with LTP, we could try finding those algorithms by looking for  (which is much easier than looking for ).

Furthermore, LTP seems very relevant to defining search / optimization.

Examples of LTP

Examples of algorithms with LTP:

  • Any greedy algorithm is an example. The whole solution to the problem is , the greedy choice is .
  • Many dynamic programming algorithms are an example. Solution to the whole problem is , solution to a smaller problem (which we need to use many times) is .
  • Consider a strong chess engine. : "given a position, find how to e.g. win material given an ~N moves horizon". : "given a position, find how to e.g. win material given a ~K moves horizon". Where K << N. Simply put, the N-moves-deep engine will often punish K-moves-deep mistakes. That's the reason why you can understand that you're losing long before you get checkmated. This wouldn't be true in a game much more chaotic than chess.
  • Consider simulated annealing: "given a function, find the global optimum". AFAICT, simulated annealing is greedy search + randomness. The greedy part is .

Got around to interrogating Gemini for a bit.

Seems like KSF talks about programs generating sets. It doesn't say anything about the internal structure of the programs (but that's where the objects such as "real diamonds" live). So let's say  is a very long video about dogs doing various things. If I apply KSF, I get programs (aka "codes") generating sets of videos. But it doesn't help me identify "the most dog-like thing" inside each program. For example, one of the programs might be an atomic model of physics, where "the most dog-like things" are stable clouds of atoms. But KSF doesn't help me find those clouds. A similarity metric between videos doesn't help either.

My conceptual solution to the above problem, proposed in the post: if you have a simple program with special internal structure describing simple statistical properties of "dog-shaped pixels" (such program is guaranteed to exist), there also exists a program with very similar internal structure describing "valuable physical objects causing dog-shaped pixels" (if such program doesn't exist, then "valuable physical objects causing dog-shaped pixels" don't exist either).[1] Finding "the most dog-like things" in such program is trivial. Therefore, we should be able to solve ontology identification by heavily restricting the internal structure of programs (to structures which look similar to simple statistical patterns in sensory data).

So, to formalize my "conceptual solution" we need models which are visually/structurally/spatially/dynamically similar to the sensory data they model. I asked Gemini about it, multiple times, with Deep Research. The only interesting reference Gemini found is Agent-based models (AFAIU, "agents" just means "any objects governed by rules").

  1. ^

    This is not obvious, requires analyzing basic properties of human values. 

Then you can use the three dot points in my comment to construct source code for a new agent that does the same thing, but is not nicely separated.

This is the step I don't get (how we make the construction), because I don't understand SGD well. What does "sample N world models" mean?

My attempt to understand: We have a space of world models () and a space of plans (). We pick points from  (using SGD) and evaluate them on the best points of  (we got those best points by trying to predict the world and applying SGD).

My thoughts/questions: To find the best points of , we still need to do modelling independently from planning? While the world model is not stored in memory, some pointer to the best points of  is stored? We at least have "the best current plan" stored independently from the world models?

First things first, defining a special language which creates a safe but useful AGI absolutely is just a restatement of the problem, more or less. But the post doesn't just restate the problem, it describes the core principle of the language (the comprehension/optimization metric) and makes arguments for why the language should be provably sufficient for solving a big part of alignment.

You're saying that the simple AI can tell if the more complex AI's plans are good, bad, or unnecessary--but also the latter "can know stuff humans don't know". How?

This section deduces the above from claims A and B. What part of the deduction do you disagree with/confused about? Here's how the deduction would apply to the task "protect a diamond from destruction":

  1. cares about an ontologically fundamental diamond.  models the world as clouds of atoms.
  2. According to the principle, we can automatically find what object in  corresponds to the "ontologically fundamental diamond".
  3. Therefore, we can know what  plans would preserve the diamond. We also can know if applying any weird optimization to the diamond is necessary for preserving it. Checking for necessity is probably hard, might require another novel insight. But "necessity" is a simple object-level property.

The automatic finding of the correspondence (step 2) between an important comprehensible concept and an important incomprehensible concept resolves the apparent contradiction.[1]

  1. ^

    Now, without context, step 2 is just a restatement of the ontology identification problem. The first two sections of the post (mostly the first one) explain why the comprehension/optimization metric should solve it. I believe my solution is along the lines of the research avenues Eliezer outlined.

    If my principle is hard to agree with, please try to assume that it's true and see if you can follow how it solves some alignment problems.

So, somehow you're able to know when an AI is exerting optimization power in "a way that flows through" some specific concepts?

Yes, we're able to tell if AI optimizes through a specific class of concepts. In most/all sections of the post I'm assuming the AI generates concepts in a special language (i.e. it's not just a trained neural network), a language which allows to measure the complexity of concepts. The claim is that if you're optimizing through concepts of certain complexity, then you can't fulfill a task in a "weird" way. If the claim is true and AI doesn't think in arbitrary languages, then it's supposed to be impossible to create a harmful Doppelganger.

But I don't get if, or why, you think that adds up to anything like the above.

Clarification: only the interpretability section deals with inner alignment. The claims of the previous sections are not supposed to follow from the interpretability section.

Anyway, is the following basically what you're proposing?

Yes. The special language is supposed to have the property that can automatically learn if  plans good, bad, or unnecessary actions.  can't be arbitrarily smarter than humans, but it's a general intelligence which doesn't imitate humans and can know stuff humans don't know.

Could you ELI15 the difference between Kolmogorov complexity (KC) and Kolmogorov structure function (KSF)?

Here are some of the things needed to formalize the proposal in the post:

  1. A complexity metric defined for different model classes.
  2. A natural way to "connect" models. So we can identify the same object (e.g. "diamond") in two different models. Related: multi-level maps.

I feel something like KSF could tackle 1, but what about 2?

Thanks for clarifying! Even if I still don't fully understand your position, I now see where you're coming from.

No, I think it's what humans actually pursue today when given the options. I'm not convinced that these values are static, or coherent, much less that we would in fact converge.

Then those values/motivations should be limited by the complexity of human cognition, since they're produced by it. Isn't that trivially true? I agree that values can be incoherent, fluid, and not converging to anything. But building Task AGI doesn't require building an AGI which learns coherent human values. It "merely" requires an AGI which doesn't affect human values in large and unintended ways.

No, because we don't comprehend them, we just evaluate what we want locally using the machinery directly, and make choices based on that.

This feels like arguing over definitions. If you have an oracle for solving certain problems, this oracle can be defined as a part of your problem-solving ability. Even if it's not transparent compared to your other problem-solving abilities. Similarly, the machinery which calculates a complicated function from sensory inputs to judgements (e.g. from Mona Lisa to "this is beautiful") can be defined as a part of our comprehension ability. Yes, humans don't know (1) the internals of the machinery or (2) some properties of the function it calculates — but I think you haven't given an example of how human values depend on knowledge of 1 or 2. You gave an example of how human values depend on the maxima of the function (e.g. the desire to find the most delicious food), but that function having maxima is not an unknown property, it's a trivial property (some foods are worse than others, therefore some foods have the best taste).

That's a very big "if"! And simplicity priors are made questionable, if not refuted, by the fact that we haven't gotten any convergence about human values despite millennia of philosophy trying to build such an explanation.

I agree that ambitious value learning is a big "if". But Task AGI doesn't require it.

But

  • To pursue their values, humans should be able to reason about them. To form preferences about a thing, humans should be able to consider the thing. Therefore, human ability to comprehend should limit what humans can care about. At least before humans start unlimited self-modification. I think this logically can't be false.
  • Eliezer Yudkowsky is a core proponent of complexity of value, but in Thou Art Godshatter and Protein Reinforcement and DNA Consequentialism he basically makes a point that human values arose from complexity limitations, including complexity limitations imposed by brainpower limitations. Some famous alignment ideas (e.g. NAH, Shard Theory) kinda imply that human values are limited by human ability to comprehend and it doesn't seem controversial. (The ideas themselves are controversial, but for other reasons.)
  • If learning values is possible at all, there should be some simplicity biases which help to learn them. Wouldn't it be strange if those simplicity biases were absolutely unrelated to simplicity biases of human cognition?

Based on your comments, I can guess that something below is the crux:

  1. You define "values" as ~"the decisions humans would converge to after becoming arbitrarily more knowledgeable". But that's a somewhat controversial definition (some knowledge can lead to changes in values) and even given that definition it can be true that "past human ability to comprehend limits human values" — since human values were formed before humans explored unlimited knowledge. Some values formed when humans were barely generally intelligent. Some values formed when humans were animals.
  2. You say that values depend on inscrutable brain machinery. But can't we treat the machinery as a part of "human ability to comprehend"?
  3. You talk about ontology. Humans can care about real diamonds without knowing what physical things the diamonds are made from. My reply: I define "ability to comprehend" based on ability to comprehend functional behavior of a thing under normal circumstances. Based on this definition, a caveman counts as being able to comprehend the cloud of atoms his spear is made of (because the caveman can comprehend the behavior of the spear under normal circumstances), even though the caveman can't comprehend atomic theory.

Could you confirm or clarify the crux? Your messages felt ambiguous to me. In what specific way is A false?

Are you talking about value learning? My proposal doesn't tackle advanced value learning. Basically, my argument is "if (A) human values are limited by human ability to comprehend/optimize things and (B) the factors which make something easier or harder to comprehend/optimize are simple, then the AI can avoid accidentally messing up human values — so we can define safe impact measures and corrigibility". My proposal is not supposed to make the AI learn human values in great detail or extrapolate them out of distribution. My argument is "if A and B hold, then we can draw a box around human values and tell the AI to not mess up the contents of the box — without making the AI useless; yet the AI might not know what exact contents of the box count as 'human values'".[1]

The problem with B is that humans have very specialized and idiosyncratic cognitive machinery (the machinery generating experiences) which is much more advanced than human general ability to comprehend things. I interpreted you as making this counterargument in the top level comment. My reply is that I think human values depend on that machinery in a very limited way, so B is still true enough. But I'm not talking about extrapolating something out of distribution. Unless I'm missing your point.

  1. ^

    Why those things follow from A and B is not obvious and depends on a non-trivial argument. I tried to explain it in the first section of the post, but might've failed.

Load More