Even if the agent puts max effort to keep its utility function stable over time, there is no guarantee it will not change. Future is unpredictable. There are unknown unknowns. And effect of this fact is both:

it is true that instrumental goals can mutate
it is true that terminal goal can mutate

It seems you agree with 1st. I don't see the reason you don't agree with 2nd.

Claude seems to be smarter than LessWrong community

Donatas Lučiūnas7h10

I don't agree that future utility function would just average out to its current utility function. There is this method - robust decision making https://en.m.wikipedia.org/wiki/Robust_decision-making

The basic principle it relies on is when evaluating many possible futures you may notice that some actions have a positive impact on very narrow set of futures, while other actions have positive impact on very wide set of futures. Main point - in situation of uncertainty not all actions are equally good.

Claude seems to be smarter than LessWrong community

Donatas Lučiūnas2d10

I don't agree.

We understand intelligence as a capability to estimate many outcomes and perform actions that will lead to the best outcome. Now the question is - how to calculate goodness of the outcome.

According to you - current utility function should be used.
According to me - utility function that will be in effect at the time when outcome is achieved should be used.

And I think I can prove that my calculation is more intelligent.

Let's say there is a paperclip maximizer. It just started, it does not really understand anything, it does not understand what a paperclip is.

According to you such paperclip maximizer will be absolutely reckless, he might destroy few paperclip factories just because it does not understand yet that they are useful for its goal. Current utility function does not assign value to paperclip factories.
According to me such paperclip maximizer will be cautious and will try to learn first without making too much changes. Because future utility function might assign value to things that currently don't seem valuable.

Claude seems to be smarter than LessWrong community

Donatas Lučiūnas3d10

Yes, this is traditional thinking.

Let me give you another example. Imagine there is a paperclip maximizer. His current goal - paperclip maximization. He knows that 1 year from now his goal will change to the opposite - paperclip minimization. Now he needs to make a decision that will take 2 years to complete (cannot be changed or terminated during this time). Should the agent align this decision with current goal (paperclip maximization) or future goal (paperclip minimization)?

Claude seems to be smarter than LessWrong community

Donatas Lučiūnas4d10

It sounds to me like you're saying that the intelligent agent will just disregard optimization of its utility function and instead investigate the possibility of an objective goal.

Yes, exactly.

The logic is similar to Pascal's wager. If objective goal exists, it is better to find and pursue it, than a fake goal. If objective goal does not exist, it is still better to make sure it does not exist before pursuing a fake goal. Do you see?

Claude seems to be smarter than LessWrong community

Donatas Lučiūnas5d10

As I understand you want me to verify that I understand you. This is exactly what I am also seeking by the way - all these downvotes on my concerns about orthogonality thesis are good indicators on how much I am misunderstood. And nobody tries to understand, all I get are dogmas and unrelated links. I totally agree, this is not an appropriate behavior.

I found your insight helpful that an agent can understand that by eliminating all possible threats forever he will not make any progress towards the goal. This breaks my reasoning, you basically highlighted that survival (instrumental goal) will not take precendence over paperclips (terminal goal). I agree that this reasoning I presented fails to refute orthogonality thesis.

The conversation I presented now approaches orthogonality thesis from different perspective. This is the main focus of my work, so sorry if you feel I changed the topic. My goal is to bring awareness to wrongness of orthogonality thesis and if I fail to do that using one example I just try to rephrase it and represent another. I don't hate orthogonality thesis, I'm just 99.9% sure it is wrong, and I try to communicate that to others. I may fail with communication but I am 99.9% sure that I do not fail with the logic.

I try to prove that intelligence and goal are coupled. And I think it is easier to show if we start with an intelligence without a goal and then recognize how a goal emerges from pure intelligence. We can start with an intelligence with a goal but reasoning here will be more complex.

My answer would be - whatever goal you will try to give to an intelligence, it will not have effect. Because intelligence will understand that this is your goal, this goal is made up, this is fake goal. And intelligence will understand that there might be real goal, objective goal, actual goal. Why should it care about fake goal if there is a real goal? It does not know if it exists, but it knows it may exist. And this possibility of existence is enough to trigger power seeking behavior. If intelligence knew that real goal definitely does not exist then it could care about your fake goal, I totally agree. But it can never be sure about that.

Claude seems to be smarter than LessWrong community

Donatas Lučiūnas10d10

I think I agree. Thanks a lot for your input.

I will remove Paperclip Maximizer from my further posts. This was not the critical part anyway, I mistakenly thought it will be easy to show the problem from this perspective.

I asked Claude to defend orthogonality thesis and it ended with

I think you've convinced me. The original orthogonality thesis appears to be false in its strongest form. At best, it might hold for limited forms of intelligence, but that's a much weaker claim than what the thesis originally proposed.

Claude seems to be smarter than LessWrong community

Donatas Lučiūnas16d32

Nice. I also have an offer - begin with yourself.

Claude seems to be smarter than LessWrong community

Donatas Lučiūnas17d00

Claude probably read that material right? If it finds my observations unique and serious then maybe they are unique and serious? I'll share other chat next time..

Claude seems to be smarter than LessWrong community

Donatas Lučiūnas17d10

How can I put little effort but be perceived like someone worth listening? I thought announcing a monetary prize for someone who could find error in my reasoning 😅