All of Roland Pihlakas's Comments + Replies

I agree, sounds plausible that this could happen. Likewise as we humans may build a strongly optimising agent because we are lazy and want to use simpler forms of maths. The tiling agents problem is definitely important.

That being said, agents properly understanding and modelling homeostasis is among the required properties (thus essential). It is not meant to be sufficient one. There may be no single sufficient property that solves everything, therefore there is no competition between different required properties. Required properties are conjunctive, the... (read more)

Thank you for your question!

I agree that the simulations need to have sufficient complexity. Indeed, that was one of main motivations I became interested in creating multi-objective benchmarks in the past. Various AI safety toy problems seemed to me so much simplified that they lacked essential objectives and other decisive nuances. This motivation is still very much one of my main driving motivations.

That being said, complexity has also downsides:
1) The complexity introduces confounding factors. When a model fails such a benchmark, it is not clear whether... (read more)

4Charlie Steiner
I agree it's a good point that you don't need the complexity of the whole world to test ideas. With a fairly small in terms of number of states, you can encode interesting things in a long sequence of states so long as the generating process is sufficiently interesting. And adding more states is itself no virtue if it doesn't help you understand what you're trying to test for. Some out-of-order thoughts: * Testing for 'big' values, e.g. achievement, might require complex environments and evaluations. Not necessarily large state spaces, but the complexity of differentiating between subtle shades of value (which seems like a useful capability to be sure we're getting) has to go somewhere. * Using more complicated environments that are human-legible might better leverage human feedback and/or make sense to human observers - maybe you could encode achievement in the actions of a square in a gridworld, but maybe humans would end up making a lot of mistakes when trying to judge the outcome. If you want to gather data from humans, to reflect a way that humans are complicated that you want to see if an AI can learn, a rich environment seems useful. On the other hand, if you just want to test general learning power, you could have a square in a gridworld have random complex decision procedures and see if they can be learned. * There's a divide between contexts where humans are basically right, and so we just want an AI to do a good job of learning what we're already doing, and contexts where humans are inconsistent, or disagree with each other, where we want an AI to carefully resolve these inconsistencies/disagreements in a way that humans endorse (except also sometimes we're inconsistent or disagree about our standards for resolving inconsistencies and disagreements!). Building small benchmarks for the first kind of problem seems kind of trivial in the fully-observed setting where the AI can't wirehead. Even if you try to emulate the partial observability of t

I think your own message is also too extreme to be rational. So it seems to me that you are fighting fire with a fire. Yes, Remmelt has some extreme expressions, but you definitely have extreme expressions here too, while having even weaker arguments.

Could we find a golden middle road, a common ground, please? With more reflective thinking and with less focus on right and wrong?

I agree that Remmelt can improve the message. And I believe he will do that.

I may not agree that we are going to die with 99% probability. At the same time I find that his curr... (read more)

1Remmelt
BTW if anyone does want to get into the argument, Will Petillo’s Lenses of Control post is a good entry point.  It’s concise and correct – a difficult combination to achieve here. 
-2Remmelt
Right – this comes back to actually examining people’s reasoning.  Relying on the authority status of an insider (who dismissed the argument) or on your ‘crank vibe’ of the outsider (who made the argument) is not a reliable way of checking whether a particular argument is good. IMO it’s also fine to say “Hey, I don’t have time to assess this argument, so for now I’m going to go with these priors that seemed to broadly kinda work in the past for filtering out poorly substantiated claims. But maybe someone else actually has a chance to go through the argument, I’ll keep an eye open.”   I’m putting these quotes together because I want to check whether you’re tracking the epistemic process I’m proposing here. Reasoning logically from premises is necessarily black-and-white thinking. Either the truth value is true or it is false. A way to check the reasoning is to first consider the premises (in how they are described using defined terms, do they correspond comprehensively enough with how the world works?). And then check whether the logic follows from the premises through to each next argument step until you reach the conclusion. Finally, when you reach the conclusion, and you could not find any soundness or validity issues, then that is the conclusion you have reasoned to. If the conclusion is that it turns out impossible for some physical/informational system to meet several specified desiderata at the same time, this conclusion may sound extreme.  But if you (and many other people in the field who are inclined to disagree with the conclusion) cannot find any problem with the reasoning, the rational thing would be to accept it, and then consider how it applies to the real world. Apparently, computer scientists hotly contested CAP theorem for a while. They wanted to build distributed data stores that could send messages that consistently represented new data entries, while the data was also made continuously available throughout the network, while the network

The following is meant as a question to find out, not a statement of belief.

Nobody seems to have mentioned the possibility that initially they did not intend to fire Sam, but just to warn him or to give him a choice to restrain himself. Yet possibly he himself escalated it to firing or chose firing instead of complying with the restraint. He might have done that just in order to have all the consequences that have now taken place, giving him more power.

For example, people in power positions may escalate disagreements, because that is a territory they are more experienced with as compared to their opponents.

I propose blacklists are less useful if they are about proxy measures, and much more useful if they are about ultimate objectives. Some of the ultimate objectives can also be represented in the form of blacklists. For example, listing many ways to kill a person is less useful. But saying that death or violence is to be avoided, is more useful.

I imagine that the objectives which fulfill the human needs for Power (control over AI), Self-Direction (autonomy, freedom from too much influence from AI), and maybe others, would be partially also working in ensuring that the AI does not start moving towards wireheading. Wireheading would surely be in contradiction to these objectives.

If we consider wireheading as a process, not a black and white event, then there are steps along the way. These steps could be potentially detected or even foreseen before the process finishes in a new equilibrium.

A question. Is it relevant for your current problem formulation that you also want to ensure that authorised people still have reasonable access to the diamond? In other words, is it important here that the system still needs to yield to actions or input from certain humans, be interruptible and corrigible? Or, in ML terms, does it have to avoid both false negatives and false positives when detecting or avoiding intrusion scenarios? 

I imagine that an algorithmically more trivial way to make the system both "honest" and "secured" is to make it so heavily secured that almost certainly nobody can access the diamond.

1Roland Pihlakas
The paper is now published with open access here:  https://link.springer.com/article/10.1007/s10458-022-09586-2 

You can apply the nonlinear transformation either to the rewards or to the Q values. The aggregation can occur only after transformation. When transformation is applied to Q values then the aggregation takes place quite late in the process - as Ben said, during action selection.

Both the approach of transforming the rewards and the approach of transforming the Q values are valid, but have different philosophical interpretations and also have different experimental outcomes to the agent behaviour. I think both approaches need more research.

For example, I wou... (read more)

Yes, maybe the the minimum cost is 3 even without floor or ceiling? But the question is then how to find concrete solutions that can be proven using realistic efforts. I interpret the challenge as request for submission of concrete solutions, not just theoretical ones. Anyway, my finding is below, maybe it can be improved further. And could there be any way to emulate floor or ceiling using the functions permitted in the initial problem formulation?

By the way, for me the >! works reliably when entered right in the beginning of the message. After a newline it does not work reliably.

 ceil(3!! * sqrt(sqrt(5! / 2 + 2)))

If you would allow ceiling function then I could give you a solution with score 60 for the Puzzle 1. Ceiling or floor functions are cool because they add even more branches to the search, and enable involving irrational number computations too. :P Though you might want to restrict the number of ceiling or floor functions permitted per solution. 

By the way, please share a hint about how do you enter spoilers here?
 

4Scott Garrabrant
Typing >! at the start of a line makes a spoiler

Submitting my post for early feedback in order to improve it further:

Exponentially diminishing returns and conjunctive goals: Mitigating Goodhart’s law with common sense. Towards corrigibility and interruptibility.

Abstract.

Utility maximising agents have been the Gordian Knot of AI safety. Here a concrete VNM-rational formula is proposed for satisficing agents, which can be contrasted with the hitherto over-discussed and too general approach of naive maximisation strategies. For example, the 100 paperclip scenario is easily solved by the proposed framework... (read more)

It looks like there is so much information on this page that trying to edit the question kills the browser.

An additional idea: Additionally to supporting the configuration of the default behaviours, perhaps the agent should interactively ask for confirmation of shutdown instead of running deterministically?

1TurnTrout
Oops! :) Can you expand?

I have a question about the shutdown button scenario.

Vika already has mentioned that the interruptibility is ambivalent and information about desirability of enabling interruptions needs to be externally provided.

I think same observation applies to corrigibility - the agent should accept goal changes only from some external agents and even that only in some situations, and not accept in other cases: If I break the vase intentionally (for creating a kaleidoscope) it should keep this new state as a new desired state. But if I or a child breaks the vase acci... (read more)

1TurnTrout
We have a choice here: "solve complex, value-laden problem" or "undertake cheap preparations so that the agent doesn’t have to deal with these scenarios". Why not just run the agent from a secure server room where we look after it, shutting it down if it does bad things?
1Roland Pihlakas
It looks like there is so much information on this page that trying to edit the question kills the browser. An additional idea: Additionally to supporting the configuration of the default behaviours, perhaps the agent should interactively ask for confirmation of shutdown instead of running deterministically?

You might be interested in Prospect Theory:

https://en.wikipedia.org/wiki/Prospect_theory

Hello!

Here are my submissions for this time. They are all strategy related.

The first one is a project for popularisation AI safety topics. This is not a technical text by its content but the project itself is still technological.

https://medium.com/threelaws/proposal-for-executable-and-interactive-simulations-of-ai-safety-failure-scenarios-7acab7015be4

As a bonus I would add a couple of non-technical ideas about possible economic or social partial solutions for slowing down AI race (which would enable having more time for solving the AI alignment) :

https://m... (read more)

To people who become interested in the topic of side effects and whitelists, I would add links to a couple of additional articles of my own past work on related subjects that you might be interested in - for developing the ideas further, for discussion, or for cooperation:


https://medium.com/threelaws/first-law-of-robotics-and-a-possible-definition-of-robot-safety-419bc41a1ffe

The principles are based mainly on the idea of competence-based whitelisting and preserving reversibility (keeping the future options open) as the primary goal of AI, while all task-b... (read more)

A question: can one post multiple initial applications, each less than a page long? Is there a limit for the total volume?

Hey! I believe we were in a same IRC channel at that time and I also did read your story back then. I still remember some of it. What is the backstory? :)

Hello! Thanks for the prize announcement :)

Hope these observations and clarifying questions are of some help:

https://medium.com/threelaws/a-reply-to-aligned-iterated-distillation-and-amplification-problem-points-c8a3e1e31a30

Summary of potential problems spotted regarding the use of AlphaGoZero:

  • Complete visibility vs Incomplete visibility.
  • Almost complete experience (self-play) vs Once-only problems. Limits of attention.
  • Exploitation (a game match) vs Exploration (the real world).
  • Having one goal vs Having many conjunctive goals. Also, having utility maximisa
... (read more)

Hello!

I have significantly elaborated and extended my article of self deception in the last couple of months (before that it was about two pages long).

"Self-deception: Fundamental limits to computation due to fundamental limits to attention-like processes"

https://medium.com/threelaws/definition-of-self-deception-in-the-context-of-robot-safety-721061449f7

I included some examples for the taxonomy, positioned this topic in relation to other similar topics, compared the applicability of this article to applicability of other known AI problems.

Additio... (read more)

Why should one option exclude the other?

Having the blinders would not be so good either.

I propose that with proper labeling these options can both be implemented. So that people can themselves decide what to pay attention to and what to develop further.

Besides potential solutions that are oriented towards being robust to scale, I would like to emphasise that there are also failure modes that are robust to scale - that is, problems which do not go away with scaling up the resources:
Fundamental limits to computation due to fundamental limits to attention-like processes:
https://medium.com/threelaws/definition-of-self-deception-in-the-context-of-robot-safety-721061449f7

Hello Scott! You might be interested in my proposals for AI goal structures that are designed to be robust to scale:

Using homeostasis-based goal structures:

https://medium.com/threelaws/making-ai-less-dangerous-2742e29797bd

and

Permissions-then-goals based AI user “interfaces” + legal accountability:

https://medium.com/threelaws/first-law-of-robotics-and-a-possible-definition-of-robot-safety-419bc41a1ffe

Hello! My newest proposal:

https://medium.com/threelaws/making-ai-less-dangerous-2742e29797bd

I would like to propose a certain kind of AI goal structures that would be an alternative to utility maximisation based goal structures. The proposed alternative framework would make AI significantly safer, though it would not guarantee total safety. It can be used at strong AI level and also much below, so it is well scalable. The main idea would be to replace utility maximisation with the concept of homeostasis.

3cousin_it
Acknowledged, thank you!