Thank you for your question!
I agree that the simulations need to have sufficient complexity. Indeed, that was one of main motivations I became interested in creating multi-objective benchmarks in the past. Various AI safety toy problems seemed to me so much simplified that they lacked essential objectives and other decisive nuances. This motivation is still very much one of my main driving motivations.
That being said, complexity has also downsides:
1) The complexity introduces confounding factors. When a model fails such a benchmark, it is not clear whether...
I think your own message is also too extreme to be rational. So it seems to me that you are fighting fire with a fire. Yes, Remmelt has some extreme expressions, but you definitely have extreme expressions here too, while having even weaker arguments.
Could we find a golden middle road, a common ground, please? With more reflective thinking and with less focus on right and wrong?
I agree that Remmelt can improve the message. And I believe he will do that.
I may not agree that we are going to die with 99% probability. At the same time I find that his curr...
The following is meant as a question to find out, not a statement of belief.
Nobody seems to have mentioned the possibility that initially they did not intend to fire Sam, but just to warn him or to give him a choice to restrain himself. Yet possibly he himself escalated it to firing or chose firing instead of complying with the restraint. He might have done that just in order to have all the consequences that have now taken place, giving him more power.
For example, people in power positions may escalate disagreements, because that is a territory they are more experienced with as compared to their opponents.
The paper is now published with open access here:
https://link.springer.com/article/10.1007/s10458-022-09586-2
I propose blacklists are less useful if they are about proxy measures, and much more useful if they are about ultimate objectives. Some of the ultimate objectives can also be represented in the form of blacklists. For example, listing many ways to kill a person is less useful. But saying that death or violence is to be avoided, is more useful.
I imagine that the objectives which fulfill the human needs for Power (control over AI), Self-Direction (autonomy, freedom from too much influence from AI), and maybe others, would be partially also working in ensuring that the AI does not start moving towards wireheading. Wireheading would surely be in contradiction to these objectives.
If we consider wireheading as a process, not a black and white event, then there are steps along the way. These steps could be potentially detected or even foreseen before the process finishes in a new equilibrium.
A question. Is it relevant for your current problem formulation that you also want to ensure that authorised people still have reasonable access to the diamond? In other words, is it important here that the system still needs to yield to actions or input from certain humans, be interruptible and corrigible? Or, in ML terms, does it have to avoid both false negatives and false positives when detecting or avoiding intrusion scenarios?
I imagine that an algorithmically more trivial way to make the system both "honest" and "secured" is to make it so heavily secured that almost certainly nobody can access the diamond.
Until the Modem website is down, you can access our workshop paper here: https://drive.google.com/file/d/1qufjPkpsIbHiQ0rGmHCnPymGUKD7prah/view?usp=sharing
You can apply the nonlinear transformation either to the rewards or to the Q values. The aggregation can occur only after transformation. When transformation is applied to Q values then the aggregation takes place quite late in the process - as Ben said, during action selection.
Both the approach of transforming the rewards and the approach of transforming the Q values are valid, but have different philosophical interpretations and also have different experimental outcomes to the agent behaviour. I think both approaches need more research.
For example, I wou...
Yes, maybe the the minimum cost is 3 even without floor or ceiling? But the question is then how to find concrete solutions that can be proven using realistic efforts. I interpret the challenge as request for submission of concrete solutions, not just theoretical ones. Anyway, my finding is below, maybe it can be improved further. And could there be any way to emulate floor or ceiling using the functions permitted in the initial problem formulation?
By the way, for me the >! works reliably when entered right in the beginning of the message. After a newline it does not work reliably.
ceil(3!! * sqrt(sqrt(5! / 2 + 2)))
If you would allow ceiling function then I could give you a solution with score 60 for the Puzzle 1. Ceiling or floor functions are cool because they add even more branches to the search, and enable involving irrational number computations too. :P Though you might want to restrict the number of ceiling or floor functions permitted per solution.
By the way, please share a hint about how do you enter spoilers here?
Submitting my post for early feedback in order to improve it further:
Abstract.
Utility maximising agents have been the Gordian Knot of AI safety. Here a concrete VNM-rational formula is proposed for satisficing agents, which can be contrasted with the hitherto over-discussed and too general approach of naive maximisation strategies. For example, the 100 paperclip scenario is easily solved by the proposed framework...
It looks like there is so much information on this page that trying to edit the question kills the browser.
An additional idea: Additionally to supporting the configuration of the default behaviours, perhaps the agent should interactively ask for confirmation of shutdown instead of running deterministically?
I have a question about the shutdown button scenario.
Vika already has mentioned that the interruptibility is ambivalent and information about desirability of enabling interruptions needs to be externally provided.
I think same observation applies to corrigibility - the agent should accept goal changes only from some external agents and even that only in some situations, and not accept in other cases: If I break the vase intentionally (for creating a kaleidoscope) it should keep this new state as a new desired state. But if I or a child breaks the vase acci...
You might be interested in Prospect Theory:
https://en.wikipedia.org/wiki/Prospect_theory
Hello!
Here are my submissions for this time. They are all strategy related.
The first one is a project for popularisation AI safety topics. This is not a technical text by its content but the project itself is still technological.
https://medium.com/threelaws/proposal-for-executable-and-interactive-simulations-of-ai-safety-failure-scenarios-7acab7015be4
As a bonus I would add a couple of non-technical ideas about possible economic or social partial solutions for slowing down AI race (which would enable having more time for solving the AI alignment) :
https://m...
To people who become interested in the topic of side effects and whitelists, I would add links to a couple of additional articles of my own past work on related subjects that you might be interested in - for developing the ideas further, for discussion, or for cooperation:
The principles are based mainly on the idea of competence-based whitelisting and preserving reversibility (keeping the future options open) as the primary goal of AI, while all task-b...
A question: can one post multiple initial applications, each less than a page long? Is there a limit for the total volume?
Hey! I believe we were in a same IRC channel at that time and I also did read your story back then. I still remember some of it. What is the backstory? :)
Hello! Thanks for the prize announcement :)
Hope these observations and clarifying questions are of some help:
https://medium.com/threelaws/a-reply-to-aligned-iterated-distillation-and-amplification-problem-points-c8a3e1e31a30
Summary of potential problems spotted regarding the use of AlphaGoZero:
Hello!
I have significantly elaborated and extended my article of self deception in the last couple of months (before that it was about two pages long).
"Self-deception: Fundamental limits to computation due to fundamental limits to attention-like processes"
https://medium.com/threelaws/definition-of-self-deception-in-the-context-of-robot-safety-721061449f7
I included some examples for the taxonomy, positioned this topic in relation to other similar topics, compared the applicability of this article to applicability of other known AI problems.
Additio...
Why should one option exclude the other?
Having the blinders would not be so good either.
I propose that with proper labeling these options can both be implemented. So that people can themselves decide what to pay attention to and what to develop further.
Besides potential solutions that are oriented towards being robust to scale, I would like to emphasise that there are also failure modes that are robust to scale - that is, problems which do not go away with scaling up the resources:
Fundamental limits to computation due to fundamental limits to attention-like processes:
https://medium.com/threelaws/definition-of-self-deception-in-the-context-of-robot-safety-721061449f7
Hello Scott! You might be interested in my proposals for AI goal structures that are designed to be robust to scale:
Using homeostasis-based goal structures:
https://medium.com/threelaws/making-ai-less-dangerous-2742e29797bd
and
Permissions-then-goals based AI user “interfaces” + legal accountability:
https://medium.com/threelaws/first-law-of-robotics-and-a-possible-definition-of-robot-safety-419bc41a1ffe
Hello! My newest proposal:
https://medium.com/threelaws/making-ai-less-dangerous-2742e29797bd
I would like to propose a certain kind of AI goal structures that would be an alternative to utility maximisation based goal structures. The proposed alternative framework would make AI significantly safer, though it would not guarantee total safety. It can be used at strong AI level and also much below, so it is well scalable. The main idea would be to replace utility maximisation with the concept of homeostasis.
I agree, sounds plausible that this could happen. Likewise as we humans may build a strongly optimising agent because we are lazy and want to use simpler forms of maths. The tiling agents problem is definitely important.
That being said, agents properly understanding and modelling homeostasis is among the required properties (thus essential). It is not meant to be sufficient one. There may be no single sufficient property that solves everything, therefore there is no competition between different required properties. Required properties are conjunctive, the... (read more)