RobertM

LessWrong dev & admin as of July 5th, 2022.

Comments

Sorted by
RobertM40

In general, Intercom is the best place to send us feedback like this, though we're moderately likely to notice a top-level shortform comment.  Will look into it; sounds like it could very well be a bug.  Thanks for flagging it.

RobertM60

If you include Facebook & Google (i.e. the entire orgs) as "frontier AI companies", then 6-figures.  If you only include Deepmind and FAIR (and OpenAI and Anthropic), maybe order of 10-15k, though who knows what turnover's been like.  Rough current headcount estimates:

Deepmind: 2600 (as of May 2024, includes post-Brain-merge employees)

Meta AI (formerly FAIR): ~1200 (unreliable sources; seems plausible, but is probably an implicit undercount since they almost certainly rely a lot of various internal infrastructure used by all of Facebook's engineering departments that they'd otherwise need to build/manage themselves.)

OpenAI: >1700

Anthropic: >500 (as of May 2024)

So that's a floor of ~6k current employees.

At some point in the last couple months I was tinkering with a feature that'd try to show you a preview of the section of each linked post that'd be most contextually relevant given where it was linked from, but it was both technically fiddly and the LLM reliability is not that great.  But there might be something there.

Yeah, I meant terser compared to typical RLHD'd output from e.g. 4o.  (I was looking at the traces they showed in https://openai.com/index/learning-to-reason-with-llms/).

RobertM124

o1's reasoning traces being much terser (sometimes to the point of incomprehensibility) seems predicted by doing gradient updates based on the quality of the final output without letting the raters see the reasoning traces, since this means the optimization pressure exerted on the cognition used for the reasoning traces is almost entirely in the direction of performance, as opposed to human-readability.

In the short term this might be good news for the "faithfulness" of those traces, but what it's faithful to is the model's ontology (hence less human-readable), see e.g. here and here.

In the long term, if you keep doing pretraining on model-generated traces, you might rapidly find yourself in steganography-land, as the pretraining bakes in the previously-externalized cognition into capabilities that the model can deploy in a single forward pass, and anything it externalizes as part of its own chain of thought will be much more compressed (and more alien) than what we see now.

RobertM50

I'm just saying it's harder to optimize in the world than to learn human values

Leaning what human values are is of course part of a subset of learning about reality, but also doesn't really have anything to do with alignment (as describing an agent's tendency to optimize for states of the world that humans would find good).

RobertM40

alignment generalizes further than capabilities

But this is untrue in practice (observe that models do not become suddenly useless after they're jailbroken) and unlikely in practice (since capabilities come by default, when you learn to predict reality, but alignment does not; why would predicting reality lead to having preferences that are human-friendly?  And the post-training "alignment" that AI labs are performing seems like it'd be quite unfriendly to me, if it did somehow generalize to superhuman capabilities).  Also, whether or not it's true, it is not something I've heard almost any employee of one of the large labs claim to believe (minus maybe TurnTrout? not sure if he's endorse it or not).

both because verification is way, way easier than generation, plus combined with the fact that we can afford to explore less in the space of values, combined with in practice reward models for humans being easier than capabilities strongly points to alignment generalizing further than capabilities

This is not what "generalizes futher" means.  "Generalizes further" means "you get more of it for less work".

RobertM6-5

A LLM that is to bioengineering as Karpathy is to CS or Three Blue One Brown is to Math makes explanations. Students everywhere praise it. In a few years there's a huge crop of startups populated by people who used it. But one person uses it's stuff to help him make a weapon, though, and manages to kill some people. Laws like 1047 have been passed, though, so the maker turns out to be liable for this.

This still requires that an ordinary person wouldn't have been able to access the relevant information without the covered model (including with the help of non-covered models, which are accessible to ordinary people).  In other words, I think this is wrong:

So, you can be held liable for critical harms even when you supply information that was publicly accessible, if it wasn't information an "ordinary person" wouldn't know.

The bill's text does not constrain the exclusion to information not "known" by an ordinary person, but to information not "publicly accessible" to an ordinary person.  That's a much higher bar given the existence of already quite powerful[1] non-covered models, which make nearly all the information that's out there available to ordinary people.  It looks almost as if it requires the covered model to be doing novel intellectual labor, which is load-bearing for the harm that was caused.

You analogy fails for another reason: an LLM is not a youtuber.  If that youtuber was doing personalized 1:1 instruction with many people, one of whom went on to make a novel bioweapon that caused hudreds of millions of dollars of damage, it would be reasonable to check that the youtuber was not actually a co-conspirator, or even using some random schmuck as a patsy.  Maybe it turns out the random schmuck was in fact the driving force behind everything, but we find chat logs like this:

  • Schmuck: "Hey, youtuber, help me design [extremely dangerous bioweapon]!"
  • Youtuber: "Haha, sure thing!  Here are step-by-step instructions."
  • Schmuck: "Great!  Now help me design a release plan."
  • Youtuber: "Of course!  Here's what you need to do for maximum coverage."

We would correctly throw the book at the youtuber.  (Heck, we'd probably do that for providing critical assistance with either step, nevermind both.)  What does throwing the book at an LLM look like?


Also, I observe that we do not live in a world where random laypeople frequently watch youtube videos (or consume other static content) and then go on to commit large-scale CBRN attacks.  In fact, I'm not sure there's ever been a case of a layperson carrying out such an attack without the active assistance of domain experts for the "hard parts".  This might have been less true of cyber attacks a few decades ago; some early computer viruses were probably written by relative amateurs and caused a lot of damage.  Software security just really sucked.  I would pretty surprised if it were still possible for a layperson to do something similar today, without doing enough upskilling that they no longer meaningfully counted as a layperson by the time they're done.

And so if a few years from now a layperson does a lot of damage by one of these mechanisms, that will be a departure from the current status quo, where the laypeople who are at all motivated to cause that kind of damage are empirically unable to do so without professional assistance.  Maybe the departure will turn out to be a dramatic increase in the number of laypeople so motivated, or maybe it turns out we live in the unhappy world where it's very easy to cause that kind of damage (and we've just been unreasonably lucky so far).  But I'd bet against those.

ETA: I agree there's a fundamental asymmetry between "costs" and "benefits" here, but this is in fact analogous to how we treat human actions.  We do not generally let people cause mass casualty events because their other work has benefits, even if those benefits are arguably "larger" than the harms.

  1. ^

    In terms of summarizing, distilling, and explaining humanity's existing knowledge.

RobertM40

Oh, that's true, I sort of lost track of the broader context of the thread.  Though then the company needs to very clearly define who's responsible for doing the risk evals, and making go/no-go/etc calls based on their results... and how much input do they accept from other employees?

Load More