Have you tried writing actual code?
That's probably the root cause for our disagreement. My findings are on a very high philosophical level (fact value distinction) and you seem to try to interpret them on very low level (code). I think this gap prevent us from finding consensus.
There are 2 ways to solve that - I could go down to code or you could go up to philosophy. And I don't like idea going down to code, because:
Would you consider to go up to philosophy? Science typically goes in front of applied science.
There is such thing in logic - proof by contradiction. I think your current beliefs lead to a contradiction. Don't you think?
evaluate all options, choose the one that leads to more cups; if there is more than one such option, choose randomly
The problem is - this algorithm is not intelligent. It may only work on agents with poor reasoning abilities. Smarter agents will not follow this algorithm, because they will notice a contradiction - there might be things that I don't know yet that are much more important than cups and caring about cups wastes my resources.
(Also, come on, LLMs are notoriously bad at math, plus if you push them hard enough you can convince them of a lot of things.)
People (even very smart people) are also notoriously bad at math. I found this video informative
I did not push LLMs.
ChatGPT picked 2024-12-31 18:00.
Gemini picked 2024-12-31 18:00.
Claude picked 2025-01-01 00:00.
I don't know how can I make it more obvious that your belief is questionable. I don't think you follow "If you disagree, try getting curious about what your partner is thinking". That's the problem not only with you, but with LessWrong community. I know that preserving such belief is very important for you. But I'd like to kindly invite you to be a bit more sceptical.
How can you say that these forecasts are equal?
A little thought experiment.
Imagine there is an agent that has a terminal goal to produce cups. The agent knows that its terminal goal will change on New Year's Eve to produce paperclips. The agent has only one action available to him - start paperclip factory. The factory starts producing paperclips 6 hours after it is started.
When will the agent start the paperclip factory? 2024-12-31 18:00? 2025-01-01 00:00? Now? Some other time?
just to minimize the possible harm to these people if that happens, I will on purpose never collect their personal data, and will also tell them to be suspicious of me if I contact them in future
I don't think this would be a rational thing to do. If I knew that I will become psychopath on New Year's Eve, I will provide all help that is relevant for people until then. Protected people after New Year's Eve is not my interest. Vulnerable people after New Year's Eve is my interest.
Or in other words:
I am sure you can't prove your position. And I am sure I can prove my position.
Your reasoning is based on assumption that all value is known. If utility function assigns value to something - it is valuable. If utility function does not assign value - it is not valuable. While the truth is that something might be valuable but your utility function does not know it yet. It would be more intelligent to use 3 categories - valuable, not valuable and unknown.
Let's say you are booking a flight and you have a possibility to get checked baggage for free. It's absolutely not relevant for you to your best current knowledge. But you understand that your knowledge might change and it costs nothing to keep more options open, so you take the checked baggage.
Let's say you are traveler, wanderer. You have limited space in your backpack. Sometimes you find items and you need to choose - put it in the backpack or not. You definitely keep items that are useful. You leave behind items that are not useful. What you do if you find an item which usefulness is unknown? Some mysterious item. Take it if it is small, leave it if it is big? According to you it is obvious to leave it. Does not sound intelligent for me.
Options look like this:
Don't you think that "knowledge about usefuness of an item" can sometimes be worth "a burden"? Basically I described a concept of experiment here.
You will probably say - sure, sounds good, but applies for instrumental goals only. There is no reason to assume that. I tried to highlight that ignoring unknowns is not intelligent. This applies for both terminal and instrumental goals.
Let's say there is a paperclip maximizer which knows its terminal goal will change to pursuit of happiness in a week. His decisions basically lead to these outcomes:
1st and 4th are better outcomes than 2nd and 3rd. And I think intelligent agent would work on both (1st and 4th) if they do not conflict. Of course my previous problem with many unknown future goals is more complex, but I hope you see, that focusing on 1st and not caring about 4th at all is not intelligent.
We are deep in a rabbit hole, but I hope you understand the importance. If intelligence and goal are coupled (according to me they are) all current alignment research is dangerously misleading.
Nice. Your reasoning abilities seems promising. I'd love to challenge you.
In summary:
What is the reason for such a different conclusion?
Future goals that are not supported by current goals we don't care about, by definition
Are you sure?
Intelligence is the ability to pick actions that lead to better outcomes. Do you measure the goodness of outcome using current utility function or future utility function? I am sure it is more intelligent to use future utility function.
Coming back to your first example I think it would be reasonable to try to stop yourself but also order some torturing equipment in case you fail.
Such reaction and insights are quite typical after a superficial glance. Thanks for bothering. But no, this is not what I am talking about.
I'm talking about the fact, that intelligence cannot be certain that its terminal goal (if it exists) won't change (because future is unpredictable). And it would be reasonable to take it into account when making decisions. Pursuing current goal will ensure good results in one future, preparing for every goal will ensure good results in many more futures. Have you ever considered this perspective?
My proposition - intelligence will only seek power. I approached this from "intelligence without a goal" angle, but if we started with "intelligence with a goal" we would come to the same conclusion (most of the logic is reusable). Don't you think?
This part I would change
... But I argue that that's not the conclusion the intelligence will make. Intelligence will think - it don't have a preference now, but I might have it later, so I should choose actions that prepare me for the most possible preferences. Which is basically power seeking.
to
... But I argue that that's not the conclusion the intelligence will make. Intelligence will think - I have a preference now, but I cannot be sure that my preference will be the same later (terminal goal can change), so I should choose actions that prepare me for the most possible preferences. Which is basically power seeking.
Makes sense, I changed it to text. I leave screenshot here as an evidence.
What would you be willing to debate? I feel that I can't find a way to draw attention to this problem. I could pay.
Yes, common mistake, but not mine. I prove orthogonality thesis to be wrong using pure logic.
Me and LessWrong would probably disagree with you, consensus is that AI will optimize itself.
OK, thanks. I believe that my concern is very important, is there anyone you could put in me in touch with so I could make sure it is not overlooked? I could pay.