LESSWRONG
LW

elspood — LessWrong

Security Mindset - Fire Alarms and Trigger Signatures

Series Overview and Goals

This is the second in a series of articles about applying traditional security mindset to the problems of alignment and AI research in general. As much as possible, we should try to mine the lessons from the history of security and apply them to the alignment problem. To the extent that security is orthogonal to alignment research, good security practices and leveraging existing security capabilities should still help extend timelines, hopefully long enough to allow time for alignment research to make sufficient progress. At the very least, it would be undignified for existential risk to be realized via a basic, preventable security failure.

Fire Alarms

There may be no fire alarm... (read 952 more words →)

Replying toConjecture: Internal Infohazard Policy

elspood3y

Conjecture: Internal Infohazard Policy

This is a great draft and you have collated many core ideas. Thank you for doing this!

As a matter of practical implementation, I think it's a good idea to always have a draft of official, approved statements of capabilities that can be rehearsed by any individual who may find themselves in a situation where they need to discuss them. These statements can be thoroughly vetted for second- and higher-order information leakage ahead of time, instead of trying to evaluate in real-time what their statements might reveal. It can be counterproductive in many circumstances to only be able to say "I can't talk about that". It also gives people a framework to practice... (read more)

Replying toSecurity Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

elspood4y

Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

I'm glad you found it useful, even in this form. If the thing you're working on is something you could share, I'd be happy to offer further assistance, if you like.

Replying toSecurity Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

elspood4y

Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

Obviously this can't be answered with justice in a single comment, but here are some broad pointers that might help see the shape of the solution:

Israeli airport security focuses on behavioral cues, asking unpredictable questions, and profiling. A somewhat extreme threat model there, with much different base rates to account for (but also much lower traffic volume).
Reinforced cockpit doors address the hijackers with guns and knives scenarios, but are a fully general kind of a no-brainer control.
Good policework and better coordination in law enforcement are commonly cited, e.g. in the context of 9/11 hijackings, before anyone even gets to an airport.

In general, if the airlines had responsibility for security you would see... (read more)

Replying toSecurity Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

elspood4y

Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

I appreciate the nudge here to put some of this into action. I hear alarm bells when thinking about formalizing a centralized location for AI safety proposals and information about how they break, but my rough intuition is that if there is a way these can be scrubbed of descriptions of capabilities which could be used irresponsibly to bootstrap AGI, then this is a net positive. At the very least, we should be scrambling to discuss safety controls for already public ML paradigms, in case any of these are just one key insight or a few teraflops away from being world-ending.

I would like to hear from others about this topic, though; I'm very wary of being at fault for accelerating the doom of humanity.

Replying toSecurity Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

elspood4y

Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

My project seems to have expired from the OWASP site, but here is an interactive version that should have most of the data:

https://periodictable.github.io/

You'll need to mouse over the elements to see the details, so not really mobile friendly, sorry.

I agree that linters are a weak form of automatic verification that are actually quite valuable. You can get a lot of mileage out of simply blacklisting unsafe APIs and a little out of clever pattern matching.

Replying toSecurity Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

elspood4y

Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

I would say that some formal proofs are actually impossible, but would agree that software with many (or even all) of the security properties we want could actually have formal-proof guarantees. I could even see a path to many of these proofs today.

While the intent of my post was to draw parallel lessons from software security, I actually think alignment is an oblique or orthogonal problem in many ways. I could imagine timelines in which alignment gets 'solved' before software security. In fact, I think survival timelines might even require anyone who might be working on classes of software reliability that don't relate to alignment to actually switch their focus to alignment at this point.

Software security is important, but I don't think it's on the critical path to survival unless somehow it is a key defense against takeoff. Certainly many imagined takeoff scenarios are made easier if an AI can exploit available computing, but I think the ability to exploit physics would grant more than enough escape potential.

Replying toSecurity Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

elspood4y

Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

The halting problem only makes it impossible to write a program that can analyze a piece of code and then reliably say "this is secure" or "this is insecure".

It would be nice to able to have this important impossible thing. :)

I think we are trying to say the same thing, though. Do you agree with this more concise assertion?

"It's not possible to make a high confidence checker system that can analyze an arbitrary specification, but it is probably possible (although very hard) to design systems that can be programmatically checked for the important qualities of alignment that we want, if such qualities can also be formally defined."

Replying toSecurity Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

elspood4y

Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

I would agree that some people figured this out faster than others, but the analogy is also instructional here: if even a small community like the infosec world has a hard time percolating information about failure modes and how to address them, we should expect the average ML engineer to be doing very unsafe things for a very long time by default.

To dive deeper into the XSS example, I think even among those that understood the output encoding and canonicalization solutions early, it still took a while to formalize the definition of an encoding context concisely enough to be able to have confidence that all such edge cases could be covered.

It might... (read more)

Replying toSecurity Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

elspood4y

Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

I think you make good points generally about status motives and obstacles for breakers. As counterpoints, I would offer:

Eliezer is a good example of someone who built a lot of status on the back of "breaking" others' unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.
There are lots of high-status breakers, and lots of independent status-rewarding communities around the security world. Some of these are whitehat/ethical, like leaderboards for various bug bounty programs, OWASP, etc. Some of them not so much so, like Blackhat/DEFCON in the early days, criminal enterprises, etc.

Perhaps here is another opportunity to learn lessons from the security community about what makes a... (read more)

Replying toSecurity Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

elspood4y

Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

Many! Thanks for sharing. This could easily turn into its own post.

In general, I think this is a great idea. I'm somewhat skeptical that this format would generate deep insights; in my experience successful Capture the Flag / wargames / tabletop exercises work best in the form where each group spends a lot of time preparing for their particular role, but opsec wargames are usually easier to score, so the judge role makes less sense there. That said, in the alignment world I'm generally supportive of trying as many different approaches as possible to see what works best.

Prior to reading your post, my general thoughts about how these kind of adversarial exercises... (read more)

Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

elspood

Background

I have been doing red team, blue team (offensive, defensive) computer security for a living since September 2000. The goal of this post is to compile a list of general principles I've learned during this time that are likely relevant to the field of AGI Alignment. If this is useful, I could continue with a broader or deeper exploration.

Alignment Won't Happen By Accident

I used to use the phrase when teaching security mindset to software developers that "security doesn't happen by accident." A system that isn't explicitly designed with a security feature is not going to have that security feature. More specifically, a system that isn't designed to be robust against a certain... (read 1819 more words →)

369