tchauvin — LessWrong

LESSWRONG
LW

Replying toPredicting AI Releases Through Side Channels

Predicting AI Releases Through Side Channels

Nice attempt. This reminds of the Pizza Meter and Gay Bar Index related to Pentagon crisis situations. I found it hard to find reliable information on this when I looked (I can't even find a good link to share), but the mechanism seems plausible.

tchauvin1y

In general, the hacking capabilities of state actors and the likely involvement of national security when we get closer to AGI feel like significant blind spots of Lesswrong discourse.

(The Hacker and The State by Ben Buchanan is a great book to learn about the former)

•••

End-to-end hacking with language models

tchauvin

Cross-posted from https://tchauvin.com/end-to-end-hacking-with-language-models

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort.

Thanks to JS Denain and Léo Grinsztajn for valuable feedback on drafts of this post.

representation of this post in hacker aesthetic

How close are we to autonomous hacking agents, i.e. AI agents that can surpass humans in cyber-offensive capabilities?

I studied this in the summer of 2023 at MATS (mentored by Jeffrey Ladish). I wrote scaffolding to connect GPT-4 to a Kali Linux VM via a terminal interface, and had GPT-4 (acting as an agent) attempt to solve Hack The Box challenges.

As I've moved on to other work, this is the 7-month late writeup. This is an informal post where I share my takeaways from... (read 2216 more words →)

tchauvin2y

If you are very good at cyber and extremely smart, you can hide vulnerabilities in 10k-lines programs in a way that less smart specialists will have trouble discovering even after days of examination - code generation/analysis is not really defense favored

I think the first part of the sentence is true, but "not defense favored" isn't a clear conclusion to me. I think that backdoors work well in closed-source code, but are really hard in open-source widely used code − just look at the amount of effort that went into the recent xz / liblzma backdoor, and the fact that we don't know of any other backdoor in widely used OSS.

The main effect

... (read more)

An Overview of AI risks - the Flyer

Charbel-Raphaël

Charbel-Raphaël, Jonathan Claybrough, tchauvin

EffiSciences recently published a document outlining various types of AI risks destined for a technical audience.

We shared this flyer as part of a hackathon we organized with Entrepreneur first at MetaAI Paris.

Feel free to copy and modify this flyer for your own usage.

Special thanks to Jonathan Claybrough and Timothée Chauvin for their significant contributions. Thanks to Esben Kran for helping for the technical mentoring during the hackathon.

Note that the distribution of this flyer alone was insufficient to communicate all the essential ideas; the one-on-one discussions proved to be extremely important for updates. However, the flyer served its purpose in catalyzing productive conversations.

Replying toNavigating AI Risks (NAIR) #1: Slowing Down AI

tchauvin3y

Navigating AI Risks (NAIR) #1: Slowing Down AI

The link of "this is a linkpost for" is not the correct one

Replying toReinforcement Learning Goal Misgeneralization: Can we guess what kind of goals are selected by default?

tchauvin3y

Reinforcement Learning Goal Misgeneralization: Can we guess what kind of goals are selected by default?

Here are the same two GIFs but with a consistent speed (30ms/frame) and an infinite loop, in case anyone else is interested for e.g. presentations:

CoinRun training (in-distribution) CoinRun test (out-of-distribution)

Is there an analysis of the common consideration that splitting an AI lab into two (e.g. the founding of Anthropic) speeds up the development of TAI and therefore increases AI x-risk?

tchauvin

I'm asking because I can think of arguments going both ways.

Note: this post is focused on the generic question "what to expect from an AI lab splitting into two" more than on the specifics of the OpenAI vs Anthropic case.

Here's the basic argument: after splitting, the two groups of individuals are now competing instead of cooperating, with two consequences:

they will rush faster toward TAI, speeding up TAI timelines (while outside safety work isn't correspondingly sped up);
by doing so, they will differentially neglect their own safety work compared to their capabilities work.

However, there are some possible considerations against this frame:

the orgs are competing on safety as well as capabilities, in anticipation of future

... (read 188 more words →)

Replying toAre funds (such as the Long-Term Future Fund) willing to give extra money to AI safety researchers to balance for the opportunity cost of taking an "industry" job?

tchauvinMar 16, 2023

Are funds (such as the Long-Term Future Fund) willing to give extra money to AI safety researchers to balance for the opportunity cost of taking an "industry" job?

I think you can guess a reasonable answer even as a complete outsider (like me), considering the purpose of these funds, which is to maximize the amount of expected good they cause by the allocation of their money. A few things that must come into consideration:

is it productive to pay very good researchers the bare minimum they need to survive? No:
- it consequently make the path of independent research unattractive to most;
- it produces some level of adverse selection in future applicants, i.e. you'll move toward getting more applications from people who couldn't really be hired in the industry, which isn't always bad but could be thought to correlate with not being an excellent

tchauvinMar 11, 2023

Hoarding Gmail-accounts in a post-CAPTCHA world?

Google can decide to verify existing accounts. I think it's likely that in its efforts to avoid platforming spam, cybercrime and astroturfing, Google may decide to leave existing "reputable" accounts alone, but will ask for e.g. a phone number or other KYC for existing accounts that were basically never used, as an easy fix to the issue you're describing here.

Replying toBy Default, GPTs Think In Plain Sight

tchauvin3y

By Default, GPTs Think In Plain Sight

Another mitigation strategy against the "LLM learns steganography from other LLM's output" problem is to add some noise to the training data (performing a number of transformations that preserve meaning) prior to training, to break the embedded steganography schemes. I feel like this should be relatively easy and effective?

Note that the loss would obviously be worse on the outputs of the previous LLM.

Which makes me think that comparing losses could be a way to detect that some text contains steganography, even if we can't pinpoint how exactly (but our transformations break it).

EDIT: in case of disagreement, I can try being more specific about my claim that this should be relatively easy and effective, using concrete examples. Basically I believe that after scrambling, there can only remain a small number of bits of steganographic information per token, even for a steganographic encoding that originally used the null space of natural language optimally.

Replying to[$10k bounty] Read and compile Robin Hanson’s best posts

tchauvin4y*

[$10k bounty] Read and compile Robin Hanson’s best posts

EDIT

Thanks for the replies and sorry for the inaccuracies. I initially reported 4,331 blog posts and 890k words; the real results are that Robin wrote 3,302 blog posts (thanks DominikPeters for pointing this out, and for finding these better urls) and 1.5M words.

(4,331 blog posts corresponds to all authors on overcomingbias. 890k words doesn't represent anything, because the posts were truncated when accessed from the monthly archive urls.)

# Get the real number of words from Robin
$ n_current_pages=331
$ echo https://www.overcomingbias.com/author/robin-hanson > /tmp/page_urls
$ for i in $(seq 2 $n_current_pages); do echo https://www.overcomingbias.com/author/robin-hanson/page/$i >> /tmp/page_urls; done
$ getwords() { curl $1 | pup '#content' | html2text --ignore-links | wc -w; }
$ export -f getwords
$  parallel

tchauvin5y

How You Can Gain Self Control Without "Self-Control"

Interesting... Can you tell more about how your self-control training looked like? Like when in the day, how long, how hard, what tasks, etc? Was the most productive period in your life during or after this training? Why did you stop?

To carry on with the strength training comparison, we're usually trying to achieve a maximum deployed strength over our lifetime. Perhaps we're already deploying as much strength as we can every day for useful tasks, so that adding strength training on pointless tasks would remove strength from the other tasks?

Using spaced repetition to make the most out of blog posts and books

tchauvin

Cross-posted from timot.cool

Spaced repetition (SR) is still an early field of collective experimentation. People have been coming up with many ideas on what to use SR for: trivia like the capitals of the world, foreign vocabulary, their domain of expertise... What I almost never see discussed is the use of SR for content like blog posts and non-fiction books. We're reading them to induce long-term change in our behavior or thinking capabilities, yet these sources of knowledge seemingly don't trigger the SR reflex as much or at all.

Why? Because blog posts and books are mostly not about raw facts, which are the easiest way to get started with SR. Yet I've personally... (read 691 more words →)