When it comes to rationality, the Black Swan Theory ( https://en.wikipedia.org/wiki/Black_swan_theory ) is an extremely useful test.
A truly rational paradigm should be built with anti-fragility in mind, especially towards Black Swan events which would challenge its axiomatic.
This is actually a quote from Arbital. Their article explain the connection.
My point is that SFOT likely never work in any environment relevant to AI Alignement, where such diagonal methods show any Agent with a fixed Objective Function is crippled by an adequate counter.
Therefore SFOT should not be used when exploring AI alignement.
Can SFOT hold in ad-hoc limited situations that do not represent the real world? Maybe, but that was not my point.
Finding one counter-example that shows SFOT does not hold in a specific setting (Clippy in my scenario) proves that it does not hold in general, which was my goal.
The discussion here is about the strong form. Proving that a « terminal » agent is crippled is exactly what is needed to prove the strong form does not hold.
(1) « Liking », or « desire » can be defined as « All other things equal, Agents will go to what they Desire/Like most, whenever given a choice ». Individual desire/liking/tastes vary.
(2) In Evolutionary Game Theory, in a Game where a Mitochondria-like Agent offers you choice between :
then that Agent is likely to win. To a rational agent, it’s a winning wager. My last publication expands on this.
What would prevent a Human brain from hosting an AI?
FYI some humans have quite impressive skills:
Peak human brain could act as a (memory-constrained) Universal Turing/Oracle Machine, and run a light enough AI, especially if it’s programmed in such a way that the Human Memory is its Web-like database?
Arbital is where I found this specific wording for the strong form.
Since I wrote this (two weeks), I am working on addressing some lesser forms as presented in Stuart Armstrong’s article at section 4.5.
We can consider the « Stronger Strong Form » about « Eternally Terminal » Agents, which CANNOT change, does not hold, then :-)
(1) « people liking thing does not seem like a relevant parameter of design ».
This is quite a bold statement. I personally believe the mainstream theory according to which it’s easier to have designs adopted when they are liked by the adopters.
(2) Nice objection, and the observation of complex life forms gives a potential answer :
Hypothesis:
Basilisk could give a virus to any complex enough Turing Machine, that proves Basilisk’s Wager is either:
My first post should be validated soon, and is a proof that the strong form does not hold: in some games some terminal alignment perform less than non-terminal equivalent alignment.
An hypothesis is that most goals, if they become “terminal” (“in itself”, impervious to change), prevent evolution, and mutualistic relationships with other agents.
Evolution gives us many organically designed Systems which offer potential Solutions:
A team of Leucocytes (white blood cells):
This is a system that can be implemented in a Company and f...
A Black Swan is better formulated as:
- Extreme Tail Event : Probabilities cannot compute in current paradigm. Its weight is p<Epsilon.
- Extreme Impact if it happens : Paradigm Revolution.
- Can be rationalised in hindsight, because there were hints. "Most" did not spot the pattern. Some may have.
If spotted a priori, one could call it a Dragon King: https://en.wikipedia.org/wiki/Dragon_king_theory
The Argument:
"Math + Evidence + Rationality + Limits makes it Rational to drop Long Tail for Decision Making"
is a prime example of an heuristic which fails into ... (read more)