Discord: LemonUniverse (lemonuniverse). Reddit: u/Smack-works. About my situation: here.
I wrote some bad posts before 2024 because I was very uncertain how the events may develop.
I do philosophical/conceptual research, have no mathematical or programming skills. But I do know a bunch of mathematical and computer science concepts.
Have an idea about interpretability and defining search/optimization.
Finite algorithms can solve infinite (classes of) problems. For example, the algorithm for adding two numbers has a finite description, yet can solve an infinity of examples.
This is a basic truth of computability theory.
Intuitively, it means that algorithms can exploit regularities in problems. But "regularity" here can only be defined tautologically (any smaller thing which solves/defines a bigger thing).
Many algorithms have the following property:
Intuitively, it means that algorithms can exploit regularities in problems. However, here "regularity" has a stricter definition than in the trivial property (TP). A "regularity" is an easily computable thing which gives you important, easily computable information about a hard-to-compute thing.
Now, the question is: what classes of algorithms have the less trivial property (LTP)?
LTP includes a bunch of undefined terms:
Probably giving fully general definitions right away is not important. We could try starting with overly restrictive definitions and see if we can prove anything about those.
If neural networks implement algorithms with LTP, we could try finding those algorithms by looking for (which is much easier than looking for ).
Furthermore, LTP seems very relevant to defining search / optimization.
Examples of algorithms with LTP:
Got around to interrogating Gemini for a bit.
Seems like KSF talks about programs generating sets. It doesn't say anything about the internal structure of the programs (but that's where the objects such as "real diamonds" live). So let's say is a very long video about dogs doing various things. If I apply KSF, I get programs (aka "codes") generating sets of videos. But it doesn't help me identify "the most dog-like thing" inside each program. For example, one of the programs might be an atomic model of physics, where "the most dog-like things" are stable clouds of atoms. But KSF doesn't help me find those clouds. A similarity metric between videos doesn't help either.
My conceptual solution to the above problem, proposed in the post: if you have a simple program with special internal structure describing simple statistical properties of "dog-shaped pixels" (such program is guaranteed to exist), there also exists a program with very similar internal structure describing "valuable physical objects causing dog-shaped pixels" (if such program doesn't exist, then "valuable physical objects causing dog-shaped pixels" don't exist either).[1] Finding "the most dog-like things" in such program is trivial. Therefore, we should be able to solve ontology identification by heavily restricting the internal structure of programs (to structures which look similar to simple statistical patterns in sensory data).
So, to formalize my "conceptual solution" we need models which are visually/structurally/spatially/dynamically similar to the sensory data they model. I asked Gemini about it, multiple times, with Deep Research. The only interesting reference Gemini found is Agent-based models (AFAIU, "agents" just means "any objects governed by rules").
This is not obvious, requires analyzing basic properties of human values.
Then you can use the three dot points in my comment to construct source code for a new agent that does the same thing, but is not nicely separated.
This is the step I don't get (how we make the construction), because I don't understand SGD well. What does "sample N world models" mean?
My attempt to understand: We have a space of world models () and a space of plans (). We pick points from (using SGD) and evaluate them on the best points of (we got those best points by trying to predict the world and applying SGD).
My thoughts/questions: To find the best points of , we still need to do modelling independently from planning? While the world model is not stored in memory, some pointer to the best points of is stored? We at least have "the best current plan" stored independently from the world models?
First things first, defining a special language which creates a safe but useful AGI absolutely is just a restatement of the problem, more or less. But the post doesn't just restate the problem, it describes the core principle of the language (the comprehension/optimization metric) and makes arguments for why the language should be provably sufficient for solving a big part of alignment.
You're saying that the simple AI can tell if the more complex AI's plans are good, bad, or unnecessary--but also the latter "can know stuff humans don't know". How?
This section deduces the above from claims A and B. What part of the deduction do you disagree with/confused about? Here's how the deduction would apply to the task "protect a diamond from destruction":
The automatic finding of the correspondence (step 2) between an important comprehensible concept and an important incomprehensible concept resolves the apparent contradiction.[1]
Now, without context, step 2 is just a restatement of the ontology identification problem. The first two sections of the post (mostly the first one) explain why the comprehension/optimization metric should solve it. I believe my solution is along the lines of the research avenues Eliezer outlined.
If my principle is hard to agree with, please try to assume that it's true and see if you can follow how it solves some alignment problems.
So, somehow you're able to know when an AI is exerting optimization power in "a way that flows through" some specific concepts?
Yes, we're able to tell if AI optimizes through a specific class of concepts. In most/all sections of the post I'm assuming the AI generates concepts in a special language (i.e. it's not just a trained neural network), a language which allows to measure the complexity of concepts. The claim is that if you're optimizing through concepts of certain complexity, then you can't fulfill a task in a "weird" way. If the claim is true and AI doesn't think in arbitrary languages, then it's supposed to be impossible to create a harmful Doppelganger.
But I don't get if, or why, you think that adds up to anything like the above.
Clarification: only the interpretability section deals with inner alignment. The claims of the previous sections are not supposed to follow from the interpretability section.
Anyway, is the following basically what you're proposing?
Yes. The special language is supposed to have the property that can automatically learn if plans good, bad, or unnecessary actions. can't be arbitrarily smarter than humans, but it's a general intelligence which doesn't imitate humans and can know stuff humans don't know.
Could you ELI15 the difference between Kolmogorov complexity (KC) and Kolmogorov structure function (KSF)?
Here are some of the things needed to formalize the proposal in the post:
I feel something like KSF could tackle 1, but what about 2?
Thanks for clarifying! Even if I still don't fully understand your position, I now see where you're coming from.
No, I think it's what humans actually pursue today when given the options. I'm not convinced that these values are static, or coherent, much less that we would in fact converge.
Then those values/motivations should be limited by the complexity of human cognition, since they're produced by it. Isn't that trivially true? I agree that values can be incoherent, fluid, and not converging to anything. But building Task AGI doesn't require building an AGI which learns coherent human values. It "merely" requires an AGI which doesn't affect human values in large and unintended ways.
No, because we don't comprehend them, we just evaluate what we want locally using the machinery directly, and make choices based on that.
This feels like arguing over definitions. If you have an oracle for solving certain problems, this oracle can be defined as a part of your problem-solving ability. Even if it's not transparent compared to your other problem-solving abilities. Similarly, the machinery which calculates a complicated function from sensory inputs to judgements (e.g. from Mona Lisa to "this is beautiful") can be defined as a part of our comprehension ability. Yes, humans don't know (1) the internals of the machinery or (2) some properties of the function it calculates — but I think you haven't given an example of how human values depend on knowledge of 1 or 2. You gave an example of how human values depend on the maxima of the function (e.g. the desire to find the most delicious food), but that function having maxima is not an unknown property, it's a trivial property (some foods are worse than others, therefore some foods have the best taste).
That's a very big "if"! And simplicity priors are made questionable, if not refuted, by the fact that we haven't gotten any convergence about human values despite millennia of philosophy trying to build such an explanation.
I agree that ambitious value learning is a big "if". But Task AGI doesn't require it.
But
Based on your comments, I can guess that something below is the crux:
Could you confirm or clarify the crux? Your messages felt ambiguous to me. In what specific way is A false?
Are you talking about value learning? My proposal doesn't tackle advanced value learning. Basically, my argument is "if (A) human values are limited by human ability to comprehend/optimize things and (B) the factors which make something easier or harder to comprehend/optimize are simple, then the AI can avoid accidentally messing up human values — so we can define safe impact measures and corrigibility". My proposal is not supposed to make the AI learn human values in great detail or extrapolate them out of distribution. My argument is "if A and B hold, then we can draw a box around human values and tell the AI to not mess up the contents of the box — without making the AI useless; yet the AI might not know what exact contents of the box count as 'human values'".[1]
The problem with B is that humans have very specialized and idiosyncratic cognitive machinery (the machinery generating experiences) which is much more advanced than human general ability to comprehend things. I interpreted you as making this counterargument in the top level comment. My reply is that I think human values depend on that machinery in a very limited way, so B is still true enough. But I'm not talking about extrapolating something out of distribution. Unless I'm missing your point.
Why those things follow from A and B is not obvious and depends on a non-trivial argument. I tried to explain it in the first section of the post, but might've failed.
I have a couple of questions/points. Might stem from not understanding the math.
1) The very first example shows that absolutely arbitrary things (e.g. arbitrary green lines) can be "natural latents". Does it mean that "natural latents" don't capture the intuitive idea of "natural abstractions"? That all natural abstractions are natural latents, but not all natural latents are natural abstractions. You seem to be confirming this interpretation, but I just want to be sure:
Is there any writing about what those "more expressive structures" could be?
2) Natural latents can describe both things which propagate through very universal, local physical laws (e.g. heat) and any commonalities in made up categories (e.g. "cakes"). Natural latents seem very interesting in the former case, but I'm not sure about the latter. Not sure the analogy between the two gives any insight. I'm still not seeing any substantial similarity between cakes and heat or Ising models. I.e. I see that an analogy can be made, but I don't feel that this analogy is "grounded" in important properties of reality (locality, universality, low-levelness, stability, etc). Does this make sense?
3) I don't understand what "those properties can in-principle be well estimated by intensive study of just one or a few mai tais" (from here) means. To me a natural latent is something like ~"all words present in all of 100 books", it's impossible to know unless you read every single book.
If I haven't missed anything major, I'd say core insights about abstractions are still missing.