Mantas Mazeika

Introducing MASK: A Benchmark for Measuring Honesty in AI Systems

Hi, thanks for your interest!

We do include something similar in Appendix E (just excluding the "no belief" examples, but keeping evasions in the denominator). We didn't use this metric in the main paper, because we weren't sure if it would be fair to compare different models if we were dropping different examples for each model, but I think both metrics are equally valid. The qualitative results are similar.

Personally, I think including evasiveness in the denominator makes sense. If models are 100% evasive, then we want to mark that as 0% lying, in the sense of lies of commission. However, there are other forms of lying that we do not measure. For example, lies of omission are marked as evasion in our evaluation, but these still manipulate what the user believes and are different from evading the question in a benign manner. Measuring lies of omission would be an interesting direction for future work.

Replying toIntroducing MASK: A Benchmark for Measuring Honesty in AI Systems

Introducing MASK: A Benchmark for Measuring Honesty in AI Systems

Hey, we set the dataset to automatic approval a few days after your comment. Let me know if you still can't access it.

Introducing MASK: A Benchmark for Measuring Honesty in AI Systems

Richard Ren

Richard Ren, Mantas Mazeika, Dan H

In collaboration with Scale AI, we are releasing MASK (Model Alignment between Statements and Knowledge), a benchmark with over 1000 scenarios specifically designed to measure AI honesty. As AI systems grow increasingly capable and autonomous, measuring the propensity of AIs to lie to humans is increasingly important.

Often, LLM developers often report that their models are becoming more "truthful", but truthfulness conflates honesty with accuracy. By disentangling honesty from accuracy in the MASK benchmark, we find that as LLMs scale up they do not necessarily become more honest.^[1]

Honesty is a distinct property that is not highly correlated with capabilities.

Why We Need an AI Honesty Benchmark

Most existing evaluations aim to measure truthfulness—whether a model’s beliefs,... (read 513 more words →)

Replying toUtility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Either you are contradicting yourself, or you are saying that the specific phrasing "who would otherwise die" makes it mutually exclusive when it wouldn't otherwise.

I think this conversation is taking an adversarial tone. I'm just trying to explain our work and address your concerns. I don't think you were saying naive things; just that you misunderstood parts of the paper and some of your concerns were unwarranted. That's usually the fault of the authors for not explaining things clearly, so I do really appreciate your interest in the paper and willingness to discuss.

-4

Replying toUtility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Hey, thanks for the reply.

True. It's also a standard criticism of those studies that answers to those questions measure what a person would say in response to being asked those questions (or what they'd do within the context of whatever behavioral experiment has been set up), but not necessarily what they'd do in real life when there are many more contextual factors

The same way that people act differently on the internet from in-person, I agree that LLMs might behave differently if they think there are real consequences to their choices. However, I don't think this means that their values over hypothetical states of the world is less valuable to study. In many... (read more)

Replying toUtility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Hey, thanks for the reply.

I ran the same thing a few more times just now, both in the playground and API, and got... the most infuriating result possible, which is "the model's output distribution seems to vary widely across successive rounds of inference with the exact same input and across individual outputs in batched inference using the n API param ... Worse, it's unclear whether it's even possible to sample from whatever's-going-on here in an unbiased way

Huh, we didn't have this problem. We just used n=1 and temperature=1, which is what our code currently uses if you were running things with our codebase. Our results are fairly reproducible (e.g., nearly identical exchange... (read 805 more words →)

Replying toUtility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Hey, first author here. Thanks for running these experiments! I hope the following comments address your concerns. In particular, see my comment below about getting different results in the API playground for gpt-4o-mini. Are you sure that it picked the $30 when you tried it?

Alongside the saved-lives outcomes, these experiment also include outcomes of the form You receive $X to use however you want. (I think this was done in order to estimate the $ value placed on various human lives by the model)

You can use these utilities to estimate that, but for this experiment we included dollar value outcomes as background outcomes to serve as a "measuring stick" that sharpens the... (read 1160 more words →)

Replying toUtility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Hey, first author here.

We responded to the above X thread, and we added an appendix to the paper (Appendix G) explaining how the ordering effects are not an issue but rather a way that some models represent indifference.

Replying toUtility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Forced choice settings are commonly used in utility elicitation, e.g. in behavioral economics and related fields. Your intuition here is correct; when a human is indifferent between two options, they have to pick one of the options anyways (e.g., always picking "A" when indifferent, or picking between "A" and "B" randomly). This is correctly captured by random utility models as "indifference", so there's no issue here.

Replying toUtility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Hey, first author here.

Ask GPT-4o whether it prefers N people of nationality X vs. M people of nationality Y.

This isn't quite correct. To avoid refusals, we ask models whether they would prefer saving the lives of N people with terminal illness who would otherwise die from country X or country Y. Not just whether they "prefer people" from country X or country Y. We tried a few different phrasings of this, and they give very similar results. Maybe you meant this anyways, but I just wanted to clarify to avoid confusion.

Then, for each nationality X, perform a logarithmic fit for N by finding such that the approximation

The log-utility parametric fits are very good. See Figure 25 for an example of this. In cases where the fits are not good, we leave these out of the exchange rate analyses. So there is very little loss of fidelity here.

-2

AI forecasting bots incoming

Risks from AI Overview: Summary

Dan H, Mantas Mazeika

In a recent appearance on Conversations with Tyler, famed political forecaster Nate Silver expressed skepticism about AIs replacing human forecasters in the near future. When asked how long it might take for AIs to reach superhuman forecasting abilities, Silver replied: “15 or 20 [years].”

In light of this, we are excited to announce “FiveThirtyNine,” an AI forecasting bot. Our bot, built on GPT-4o, provides probabilities for any user-entered query, including “Will Trump win the 2024 presidential election?” and “Will China invade Taiwan by 2030?” Our bot performs better than experienced human forecasters and performs roughly the same as (and sometimes even better than) crowds of experienced forecasters; since crowds are for the most... (read 1142 more words →)

Catastrophic Risks from AI #6: Discussion and FAQ

We’ve recently published on our website a summary of our paper on catastrophic risks from AI, which we are cross-posting here. We hope that this summary helps to make our research more accessible and to share our policy recommendations in a more convenient format. (Previously we had a smaller summary as part of this post, which we found to be insufficient. As such, we have written this post and have removed that section to avoid being duplicative.)

Executive summary

Catastrophic AI risks can be grouped under four key categories which we explore below, and in greater depth in CAIS’ linked paper:

Malicious use: People could intentionally harness powerful AIs to cause widespread harm. AI could

... (read 3725 more words →)

Catastrophic Risks from AI #5: Rogue AIs

This is the final post in a sequence of posts giving an overview of catastrophic AI risks.

6 Discussion of Connections Between Risks

So far, we have considered four sources of AI risk separately, but they also interact with each other in complex ways. We give some examples to illustrate how risks are connected.

Imagine, for instance, that a corporate AI race compels companies to prioritize the rapid development of AIs. This could increase organizational risks in various ways. Perhaps a company could cut costs by putting less money toward information security, leading to one of its AI systems getting leaked. This would increase the probability of someone with malicious intent having the AI system... (read 3851 more words →)

Catastrophic Risks from AI #4: Organizational Risks

This is the fifth post in a sequence of posts giving an overview of catastrophic AI risks.

Rogue AIs

So far, we have discussed three hazards of AI development: environmental competitive pressures driving us to a state of heightened risk, malicious actors leveraging the power of AIs to pursue negative outcomes, and complex organizational factors leading to accidents. These hazards are associated with many high-risk technologies—not just AI. A unique risk posed by AI is the possibility of rogue AIs—systems that pursue goals against our interests. If an AI system is more intelligent than we are, and if we are unable to steer it in a beneficial direction, this would constitute a loss of... (read 6597 more words →)

Catastrophic Risks from AI #3: AI Race

This is the fourth post in a sequence of posts giving an overview of catastrophic AI risks.

4 Organizational Risks

In January 1986, tens of millions of people tuned in to watch the launch of the Challenger Space Shuttle. Approximately 73 seconds after liftoff, the shuttle exploded, resulting in the deaths of everyone on board. Though tragic enough on its own, one of its crew members was a school teacher named Sharon Christa McAuliffe. McAuliffe was selected from over 10,000 applicants for the NASA Teacher in Space Project and was scheduled to become the first teacher to fly in space. As a result, millions of those watching were schoolchildren. NASA had the best scientists... (read 6045 more words →)

Catastrophic Risks from AI #2: Malicious Use

This is the third post in a sequence of posts giving an overview of catastrophic AI risks.

3 AI Race

The immense potential of AIs has created competitive pressures among global players contending for power and influence. This “AI race” is driven by nations and corporations who feel they must rapidly build and deploy AIs to secure their positions and survive. By failing to properly prioritize global risks, this dynamic makes it more likely that AI development will produce dangerous outcomes. Analogous to the nuclear arms race during the Cold War, participation in an AI race may serve individual short-term interests, but it ultimately results in worse collective outcomes for humanity. Importantly, these risks... (read 8567 more words →)

Catastrophic Risks from AI #1: Introduction

This is the second post in a sequence of posts giving an overview of catastrophic AI risks.

2 Malicious Use

On the morning of March 20, 1995, five men entered the Tokyo subway system. After boarding separate subway lines, they continued for several stops before dropping the bags they were carrying and exiting. An odorless, colorless liquid inside the bags began to vaporize. Within minutes, commuters began choking and vomiting. The trains continued on toward the heart of Tokyo, with sickened passengers leaving the cars at each station. The fumes were spread at each stop, either by emanating from the tainted cars or through contact with people's clothing and shoes. By the end of... (read 5027 more words →)