This is a special post for quick takes by faul_sname. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

8 comments, sorted by Click to highlight new comments since: Today at 12:12 PM

In software development / IT contexts, "security by obscurity" (that is, having the security of your platform rely on the architecture of that platform remaining secret) is considered a terrible idea. This is a result of a lot of people trying that approach, and it ending badly when they do.

But the thing that is a bad idea is quite specific - it is "having a system which relies on its implementation details remaining secret". It I'd not an injunction against defense in depth, and having the exact heuristics you use for fraud or data exfiltration detection remain secret is generally considered good practice.

There is probably more to be said about why the one is considered terrible practice and the other is considered good practice.

There are competing theories here.  Including secrecy of architecture and details in the security stack is pretty common, but so is publishing (or semi-publishing: making it company confidential, but talked about widely enough that it's not hard to find if someone wants to) mechanisms to get feedback and improvements.  The latter also makes the entire value chain safer, as other organizations can learn from your methods.

A lot of AI x-risk discussion is focused on worlds where iterative design fails. This makes sense, as "iterative design stops working" does in fact make problems much much harder to solve.

However, I think that even in the worlds where iterative design fails for safely creating an entire AGI, the worlds we succeed will be ones in which we were able to do iterative design on the components that safe AGI, and also able to do iterative design on the boundaries between subsystems, with the dangerous parts mocked out.

I am not optimistic about approaches that look like "do a bunch of math and philosophy to try to become less confused without interacting with the real world, and only then try to interact with the real world using your newfound knowledge".

For the most part, I don't think it's a problem if people work on the math / philosophy approaches. However, to the extent that people want to stop people from doing empirical safety research on ML systems as they actually are in practice, I think that's trading off a very marginal increase in the odds of success in worlds where iterative design could never work against a quite substantial decrease in the odds of success in worlds where iterative design could work. I am particularly thinking of things like interpretability / RLHF / constitutional AI as things which help a lot in worlds where iterative design could succeed.

A lot of AI x-risk discussion is focused on worlds where iterative design fails. This makes sense, as "iterative design stops working" does in fact make problems much much harder to solve.

Maybe on LW, this seems way less true for lab alignment teams, open phil, and safety researchers in general.

Also, I think it's worth noting the distinction between two different cases:

See also this quote from Paul from here:

Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter. Solving a scientific problem without being able to learn from experiments and failures is incredibly hard. But we will be able to learn a lot about alignment from experiments and trial and error; I think we can get a lot of feedback about what works and deploy more traditional R&D methodology. We have toy models of alignment failures, we have standards for interpretability that we can’t yet meet, and we have theoretical questions we can’t yet answer.. The difference is that reality doesn’t force us to solve the problem, or tell us clearly which analogies are the right ones, and so it’s possible for us to push ahead and build AGI without solving alignment. Overall this consideration seems like it makes the institutional problem vastly harder, but does not have such a large effect on the scientific problem.

The quote from Paul sounds about right to me, with the caveat that I think it's pretty likely that there won't be a single try that is "the critical try": something like this (also by Paul) seems pretty plausible to me, and it is cases like that that I particularly expect having existing but imperfect tooling for interpreting and steering ML models to be useful.

However, to the extent that people want to stop people from doing empirical safety research on ML systems as they actually are in practice

Does anyone want to stop this? I think some people just contest the usefulness of improving RLHF / RLAIF / constitutional AI as safety research and also think that it has capabilties/profit externalities. E.g. see discussion here.

(I personally think this this research is probably net positive, but typically not very important to advance at current margins from an altruistic perspective.)

Does anyone want to stop [all empirical research on AI, including research on prosaic alignment approaches]?

Yes, there are a number of posts to that effect.

That said, "there exist such posts" is not really why I wrote this. The idea I really want to push back on is one that I have heard several times in IRL conversations, though I don't know if I've ever seen it online. It goes like

There are two cars in a race. One is alignment, and one is capabilities. If the capabilities car hits the finish line first, we all die, and if the alignment car hits the finish line first, everything is good forever. Currently the capabilities car is winning. Some things, like RLHF and mechanistic interpretability research, speed up both cars. Speeding up both cars brings us closer to death, so those types of research are bad and we should focus on the types of research that only help alignment, like agent foundations. Also we should ensure that nobody else can do AI capabilities research.

Maybe almost nobody holds that set of beliefs! I am noticing now that my list of articles arguing that prosaic alignment strategies are harmful in expectation are by a pretty short list of authors.

So I keep seeing takes about how to tell if LLMs are "really exhibiting goal-directed behavior" like a human or whether they are instead "just predicting the next token". And, to me at least, this feels like a confused sort of question that misunderstands what humans are doing when they exhibit goal-directed behavior.

Concrete example. Let's say we notice that Jim has just pushed the turn signal lever on the side of his steering wheel. Why did Jim do this?

The goal-directed-behavior story is as follows:

  • Jim pushed the turn signal lever because he wanted to alert surrounding drivers that he was moving right by one lane
  • Jim wanted to alert drivers that he was moving one lane right because he wanted to move his car one lane to the right.
  • Jim wanted to move his car one lane to the right in order to accomplish the goal of taking the next freeway offramp
  • Jim wanted to take the next freeway offramp because that was part of the most efficient route from his home to his workplace
  • Jim wanted to go to his workplace because his workplace pays him money
  • Jim wants money because money can be exchanged for goods and services
  • Jim wants goods and services because they get him things he terminally values like mates and food

But there's an alternative story:

  • When in the context of "I am a middle-class adult", the thing to do is "have a job". Years ago, this context triggered Bob to perform the action "get a job", and now he's in the context of "having a job".
  • When in the context of "having a job", "showing up for work" is the expected behavior.
  • Earlier this morning, Bob had the context "it is a workday" and "I have a job", which triggered Bob to begin the sequence of actions associated with the behavior "commuting to work"
  • Bob is currently approaching the exit for his work - with the context of "commuting to work", this means the expected behavior is "get in the exit lane", and now he's in the context "switching one lane to the right"
  • In the context of "switching one lane to the right", one of the early actions is "turn on the right turn signal by pushing the turn signal lever". And that is what Bob is doing right now.

I think this latter framework captures some parts of human behavior that the goal-directed-behavior framework misses out on. For example, let's say the following happens

  1. Jim is going to see his good friend Bob on a Saturday morning
  2. Jim gets on the freeway - the same freeway, in fact, that he takes to work every weekday morning
  3. Jim gets into the exit lane for his work, even though Bob's house is still many exits away
  4. Jim finds himself pulling onto the street his workplace is on
  5. Jim mutters "whoops, autopilot" under his breath, pulls a u turn at the next light, and gets back on the freeway towards Bob's house

This sequence of actions is pretty nonsensical from a goal-directed-behavior perspective, but is perfectly sensible if Jim's behavior here is driven by contextual heuristics like "when it's morning and I'm next to my work's freeway offramp, I get off the freeway".

Note that I'm not saying "humans never exhibit goal-directed behavior".

Instead, I'm saying that "take a goal, and come up with a plan to achieve that goal, and execute that plan" is, itself, just one of the many contextually-activated behaviors humans exhibit.

I see no particular reason that an LLM couldn't learn to figure out when it's in a context like "the current context appears to be in the execute-the-next-step-of-the-plan stage of such-and-such goal-directed-behavior task", and produce the appropriate output token for that context.