You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Assessors that are hard to seduce

5 Stuart_Armstrong 09 March 2015 02:19PM

A putative new idea for AI control; index here.

This is a stub of an idea, looking at "human in the loop"-type restrictions.

An assessor is a particular type of restriction. It involves a (generally) human agent, who is tasked with observing something (the behaviour of the agent, or maybe the expected outcome of its plans), and then deciding on whether to allow the AI to proceed or not.

Many of the methods suitable for preventing the hacking of restrictions can also be used to protect assessors. If you consider that the AI would have three options: "hold a gun to the assessor's head", "seduce the assessor" and "do what the assessor wants", it's clear that the first two options (the first more clearly than the second) are things that would a) be stupid to do if there were no assessor, and b) are behaviours that are very informative as to whether the assessor exists.

However, there are a few more steps we can take with assessors. First of all, the AI need not know which particular human will be an assessor, forcing it to take general human behaviour into account. This is more effective if the assessing process is not one that allows any real interactions between AI and assessor.

Indeed, the assessor need not be a human: it could be a superintelligent agent itself (a hypothetical one, if need be) that has its own motivation.

Lastly, since the assessors are agents and therefore can reach their own predictions, we could ask for the assessor's impressions to be accurate predictions of what the AI intends to do (again, this could be done by having a hypothetical superintelligence assess the accuracy of the assessors' expected predictions). We'll look at ideas about modelling and model accuracy in a subsequent post.

Assessing Kurzweil: the gory details

14 Stuart_Armstrong 15 January 2013 02:29PM

This post goes along with this one, which was merely summarising the results of the volunteer assessment. Here we present the further details of the methodology and results.

Kurzweil's predictions were decomposed into 172 separate statements, taken from the book "The Age of Spiritual Machines" (published in 1999). Volunteers were requested on Less Wrong and on reddit.com/r/futurology. 18 people initially volunteered to do varying amounts of assessment of Kurzweil's predictions; 9 ultimately did so.

Each volunteer was given a separate randomised list of the numbers 1 to 172, with instructions to go through the statements in the order given by the list and give their assessment of the correctness of the prediction (the exact instructions are at the end of this post). They were to assess the predictions on the following five point scale:

  • 1=True, 2=Weakly True, 3=Cannot decide, 4=Weakly False, 5=False

They assessed a varying amount of predictions, giving 531 assessments in total, for an average of 59 assessments per volunteer (the maximum attempted was all 172 predictions, the minimum was 10). They generally followed the randomised order correctly - there were three out of order assessments (assessing prediction 36 instead of 38, 162 instead of a 172, and missing out 75). Since the number of errors was very low, and seemed accidental, I decided that this would not affect the randomisation and kept those answers in.

The assessments (anonymised) can be found here.

continue reading »