Patrik Bartak

Message

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

This post summarizes work done over the summer as part of the Summer 2023 AI Safety Hub Labs programme. Our results will also be published as part of an upcoming paper. In this post, we focus on explaining how we define and evaluate properties of deceptive behavior in LMs and...

Nov 8, 2023•49

Message

39 karma

Member for 3 years

Patrik Bartak — LessWrong

Patrik Bartak

Message

Patrik Bartak

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

Nov 8, 2023•49

Message

39 karma

Member for 3 years

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

Felix Hofstätter

Felix Hofstätter, Francis Rhys Ward, HarrietW, LAThomson, Ollie J, Patrik Bartak, Sam F. Brown+ 0 more

Felix Hofstätter, Francis Rhys Ward, HarrietW, LAThomson, Ollie J, Patrik Bartak, Sam F. Brown

This post summarizes work done over the summer as part of the Summer 2023 AI Safety Hub Labs programme. Our results will also be published as part of an upcoming paper. In this post, we focus on explaining how we define and evaluate properties of deceptive behavior in LMs and present evaluation results for state-of-the-art models. The work presented in this post was done by the AISHL group of Felix Hofstätter, Harriet Wood, Louis Thomson, Oliver Jaffe, and Patrik Bartak under the supervision of Francis Rhys Ward and with help from Sam Brown (not an AISHL participant).

1 Introduction

For a long time, the alignment community has discussed the possibility that advanced AI systems may... (read 5223 more words →)

•••

LESSWRONG
LW

LESSWRONG
LW

Patrik Bartak

Patrik Bartak

Patrik Bartak

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

Patrik Bartak

Patrik Bartak

Patrik Bartak

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

1 Introduction