Roland Pihlakas - LessWrong

Building AI safety benchmark environments on themes of universal human values

Thank you for your question!

I agree that the simulations need to have sufficient complexity. Indeed, that was one of main motivations I became interested in creating multi-objective benchmarks in the past. Various AI safety toy problems seemed to me so much simplified that they lacked essential objectives and other decisive nuances. This motivation is still very much one of my main driving motivations.

That being said, complexity has also downsides:
1) The complexity introduces confounding factors. When a model fails such a benchmark, it is not clear whether it was because it did not have required perceptual capabilities (so it is a capabilities problem), or it is using a model/framework that is unsuitable for alignment (so it is an alignment problem).
2) Running the simulations will be more time consuming and it would make the research elitist in the sense that various people would not be able to afford it.

My plan is to try to start with preference towards simple, but not simpler than necessary. And then gradually make it more complex. That means trying to use the gridworlds and introducing as many symbols as is needed to represent the important objectives, objects, other concepts and phenomena, and their interactions.

I believe symbolic approaches should not be entirely dismissed. As an illustrative metaphor, I am thinking of books - they contains symbols, yet we consider them as a cornerstone of our civilization. Similarly to the current dilemma with benchmarks, we may then worry whether books are too simple and symbol based - or perhaps one should prefer watching movies instead, since they represent reality in more detail. But would that claim be necessarily true? It does not seem so obvious after all.

In case more complexity is needed, there are currently at least five ideas:
1) Adding more feature layers to the gridworld. I did not mention it before, but the observation format already supports multiple concurrent observable layers on top of each other. One of the layers could be for example facial expressions, or any other observable or partially unobservable metrics relevant to objects they accompany.
2) Adding textual messages between agents as a side panel to the gridworlds.
3) Making the environment bigger, so there are more objects and more phenomena.
4) Making the environment bigger and making also the objects bigger so that they cover multiple cells in the grid. Thus the objects will become composite, consisting of sub-parts with their own dynamics.
5) Using some other framework, for example Sims.

Curious, how do these thoughts and considerations land with you?

Why Stop AI is barricading OpenAI

Roland Pihlakas4mo-30

I think your own message is also too extreme to be rational. So it seems to me that you are fighting fire with a fire. Yes, Remmelt has some extreme expressions, but you definitely have extreme expressions here too, while having even weaker arguments.

Could we find a golden middle road, a common ground, please? With more reflective thinking and with less focus on right and wrong?

I agree that Remmelt can improve the message. And I believe he will do that.

I may not agree that we are going to die with 99% probability. At the same time I find that his current directions are definitely worthwhile of exploring.

I also definitely respect Paul. But mentioning his name here is mostly irrelevant for my reasoning or for taking your arguments seriously, simply because I usually do not take authorities too seriously before I understand their reasoning in a particular question. And understanding a person's reasoning may occasionally mean that I disagree in particular points as well. In my experience, even the most respectful people are still people, which means they often think in messy ways and they are good just on average, not per instance of a thought line (which may mean they are poor thinkers 99% of the time, while having really valuable thoughts 1% of the time). I do not know the distribution for Paul, but definitely I would not be disappointed if he makes mistakes sometimes.

I think this part of Remmelt's response sums it up nicely: "When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself. You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything."

In my interpretation, black-and-white thinking is not "crankery". It is a normal and essential step in the development of cognition about a particular problem. Unfortunately. There is research about that in the field of developmental and cognitive psychology. Hopefully that applies to your own black-and-white thinking as well. Note that, unfortunately this development is topic specific, not universal.

In contrast, "crankery" is too strong word for describing black-and-white thinking because it is a very judgemental word, a complete dismissal, and essentially an expression of unwillingness to understand, an insult, not just a disagreement about a degree of the claims. Is labelling someone's thoughts as "a crankery" also a form of crankery of its own then? Paradoxical isn't it?

OpenAI: The Battle of the Board

Roland Pihlakas1y21

The following is meant as a question to find out, not a statement of belief.

Nobody seems to have mentioned the possibility that initially they did not intend to fire Sam, but just to warn him or to give him a choice to restrain himself. Yet possibly he himself escalated it to firing or chose firing instead of complying with the restraint. He might have done that just in order to have all the consequences that have now taken place, giving him more power.

For example, people in power positions may escalate disagreements, because that is a territory they are more experienced with as compared to their opponents.

A brief review of the reasons multi-objective RL could be important in AI Safety Research

Roland Pihlakas2y10

The paper is now published with open access here:

https://link.springer.com/article/10.1007/s10458-022-09586-2

Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

Roland Pihlakas3y20

I propose blacklists are less useful if they are about proxy measures, and much more useful if they are about ultimate objectives. Some of the ultimate objectives can also be represented in the form of blacklists. For example, listing many ways to kill a person is less useful. But saying that death or violence is to be avoided, is more useful.

Can we achieve AGI Alignment by balancing multiple human objectives?

Roland Pihlakas3y20

I imagine that the objectives which fulfill the human needs for Power (control over AI), Self-Direction (autonomy, freedom from too much influence from AI), and maybe others, would be partially also working in ensuring that the AI does not start moving towards wireheading. Wireheading would surely be in contradiction to these objectives.

If we consider wireheading as a process, not a black and white event, then there are steps along the way. These steps could be potentially detected or even foreseen before the process finishes in a new equilibrium.

Prizes for ELK proposals

Roland Pihlakas3y10

A question. Is it relevant for your current problem formulation that you also want to ensure that authorised people still have reasonable access to the diamond? In other words, is it important here that the system still needs to yield to actions or input from certain humans, be interruptible and corrigible? Or, in ML terms, does it have to avoid both false negatives and false positives when detecting or avoiding intrusion scenarios?

I imagine that an algorithmically more trivial way to make the system both "honest" and "secured" is to make it so heavily secured that almost certainly nobody can access the diamond.

A brief review of the reasons multi-objective RL could be important in AI Safety Research

Roland Pihlakas3y10

Until the Modem website is down, you can access our workshop paper here: https://drive.google.com/file/d/1qufjPkpsIbHiQ0rGmHCnPymGUKD7prah/view?usp=sharing

A brief review of the reasons multi-objective RL could be important in AI Safety Research

Roland Pihlakas3yΩ240

You can apply the nonlinear transformation either to the rewards or to the Q values. The aggregation can occur only after transformation. When transformation is applied to Q values then the aggregation takes place quite late in the process - as Ben said, during action selection.

Both the approach of transforming the rewards and the approach of transforming the Q values are valid, but have different philosophical interpretations and also have different experimental outcomes to the agent behaviour. I think both approaches need more research.

For example, I would say that transforming the rewards instead of Q values is more risk-averse as well as "fair" towards individual timesteps, since it does not average out the negative outcomes across time before exponentiating them. But it also results in slower learning by the agent.

Finally there is a third approach which uses lexicographical ordering between objectives or sets of objectives. Vamplew has done work on this direction. This approach is truly multi-objective in the sense that there is no aggregation at all. Instead the vectors must be compared during RL action selection without aggregation. The downside is that it is unwieldy to have many objectives (or sets of objectives) lexicographically ordered.

I imagine that the lexicographical approach and our continuous nonlinear transformation approaches are complementary. There could be for example two main sets of objectives: one set for alignment objectives, the other set for performance objectives. Inside a set there would be nonlinear transformation and then aggregation applied, but between the sets there would be lexicographical ordering applied. In other words there would be a hierarchy of objectives. By having only two sets in lexicographical ordering the lexicographical ordering does not become unwieldy.

This approach would be a bit analogous to the approach used by constraint programming, though more flexible. The safety objectives would act as a constraint against performance objectives. An approach that is almost in absurd manner missing from classical naive RL, but which is very essential, widely known, and technically developed in practical applications, that is, in constraint programming! In the hybrid approach proposed in the above paragraph the difference from classical constraint programming would be that among the safety objectives there would still be flexibility and ability to trade (in a risk-averse way).

Finally, when we say "multi-objective" then it does not just refer to the technical details of the computation. It also stresses the importance of acknowledging the need for researching and making more explicit the inherent presence and even structure of multiple objectives inside any abstract top objective. To encode knowledge in a way that constrains incorrect solutions but not correct solutions. As well as acknowledging the potential existence of even more complex, nonlinear interactions between these multiple objectives. We did not focus on nonlinear interactions between the objectives yet, but these interactions are possibly relevant in the future.

I totally agree that in a reasonable agent the objectives or target values / set-points do change, as it is also exemplified by biological systems.

Until the Modem website is down, you can access our workshop paper here: https://drive.google.com/file/d/1qufjPkpsIbHiQ0rGmHCnPymGUKD7prah/view?usp=sharing

2021 New Year Optimization Puzzles

Roland Pihlakas4y10

Yes, maybe the the minimum cost is 3 even without floor or ceiling? But the question is then how to find concrete solutions that can be proven using realistic efforts. I interpret the challenge as request for submission of concrete solutions, not just theoretical ones. Anyway, my finding is below, maybe it can be improved further. And could there be any way to emulate floor or ceiling using the functions permitted in the initial problem formulation?

By the way, for me the >! works reliably when entered right in the beginning of the message. After a newline it does not work reliably.

ceil(3!! * sqrt(sqrt(5! / 2 + 2)))

LESSWRONG
LW

Posts

Wiki Contributions

Comments