All of RobertM's Comments + Replies

I know I'm late to the party, but I'm pretty confused by https://www.astralcodexten.com/p/its-still-easier-to-imagine-the-end (I haven't read the post it's responding to, but I can extrapolate).  Surely the "we have a friendly singleton that isn't Just Following Orders from Your Local Democratically Elected Government or Your Local AGI Lab" is a scenario that deserves some analysis...?  Conditional on "not dying" that one seems like the most likely stable end state, in fact.

Lots of interesting questions in that situation!  Like, money still ... (read more)

4cousin_it
For cognitive enhancement, maybe we could have a system like "the smarter you are, the more aligned you must be to those less smart than you"? So enhancement would be available, but would make you less free in some ways.

I was thinking the same thing. This post badly, badly clashes with the vibe of Less Wrong. I think you should delete it, and repost to a site in which catty takedowns are part of the vibe. Less Wrong is not the place for it.

I think this is a misread of LessWrong's "vibes" and would discourage other people from thinking of LessWrong as a place where such discussions should be avoided by default.

With the exception of the title, I think the post does a decent job at avoiding making it personal.

3Holly_Elmore
Yeah actually the employees of Lightcone have led the charge in trying to tear down Kat. Its you who has the better standards, Maxwell, not this site.

Well, that's unfortunate.  That feature isn't super polished and isn't currently in the active development path, but will try to see if it's something obvious.  (In the meantime, would recommend subscribing to fewer people, or seeing if the issue persists in Chrome.  Other people on the team are subscribed to 100-200 people without obvious issues.)

FWIW, I don't think "scheming was very unlikely in the default course of events" is "decisively refuted" by our results. (Maybe depends a bit on how we operationalize scheming and "the default course of events", but for a relatively normal operationalization.)

Thank you for the nudge on operationalization; my initial wording was annoyingly sloppy, especially given that I myself have a more cognitivist slant on what I would find concerning re: "scheming".  I've replaced "scheming" with "scheming behavior".

 

It's somewhat sensitive to the exact objec

... (read more)
2ryan_greenblatt
Someone could have objections to validity or the assumptions of our paper. On validity, something like priming could be relevant. On the assumptions, they could e.g. think scheming is very unlikely due to thinking that future AIs will be intentionally trained to be highly myopic and corrigible while also thinking that other possible sources of goal conflict are very unlikely. (I'd disagree with this view, but I don't think this view is totally crazy and it isn't refuted by our paper.) I think our work doesn't very clearly refute this post, though I also just think the post is missing multiple important considerations and is overall pretty wrong and confused in its arguments.

I'd like to internally allocate social credit to people who publicly updated after the recent Redwood/Anthropic result, after previously believing that scheming behavior was very unlikely in the default course of events (or a similar belief that was decisively refuted by those empirical results).

Does anyone have links to such public updates?

(Edit log: replaced "scheming" with "scheming behavior".)

FWIW, I don't think "scheming was very unlikely in the default course of events" is "decisively refuted" by our results. (Maybe depends a bit on how we operationalize scheming and "the default course of events", but for a relatively normal operationalization.)

It's somewhat sensitive to the exact objection the person came in with.

My guess is that most reasonable perspectives should update toward thinking scheming has at least a tiny of chance of occuring (>2%), but I wouldn't say a view of <<2% was decisively refuted.

5ryan_greenblatt
Quoting Zvi's post: I don't know of any other clear cut cases. The reviews might also be interesting to look at. I'm not sure if Jacob Andreas and Jasjeet Sekhon have publicly stated prior views on the topic. Yoshua Bengio and Rohin Shah were broadly sympathetic to scheming concerns or similar before.

One reason to be pessimistic about the "goals" and/or "values" that future ASIs will have is that "we" have a very poor understanding of "goals" and "values" right now.  Like, there is not even widespread agreement that "goals" are even a meaningful abstraction to use.  Let's put aside the object-level question of whether this would even buy us anything in terms of safety, if it were true.  The mere fact of such intractable disagreements about core philosophical questions, on which hinge substantial parts of various cases for and against doo... (read more)

2Seth Herd
We do have a poor understanding of human values. That's one more reason we shouldn't and probably won't try to build them into AGI. You're expressing a common view among the alignment community. I think we should update from that view to the more likely scenario in which we don't even try to align AGI to human values. What we're actually doing is training LLMs to answer questions as they were intended, and to follow instructions as they were intended. The AI needs to understand human values to some degree to do that, but training is really focused on those things. There's an interesting bit in this interview with Tan Zhi Xuan on this distinction between theory and practice of training LLMs, and to a lesser degree in their paper. Not only is that what we are doing for current AI, I think it's both what we should do for future AGI, and what we probably will do. Instruction-following AGI is easier and more likely than value aligned AGI. It's counterintuitive to think about a highly intelligent agent that wants to do what someone else tells it. But it's not logically incoherent. And when the first human decides what goal to put in the system prompt of the first agent they think might ultimately surpass human competence and intelligence, there's little doubt what they'll put there: "follow my instructions, favoring the most recent". Everything else is a subgoal of that non-consequentialist central goal.   This approach leaves humans in charge, and that's a problem. Ultimately I think that sort of instrucion-following intent alignment can be a stepping-stone to value alignment, once we've got a superintelligent instruction-following system to help us with that very difficult problem. But there's neither a need nor an incentive to aim directly at that with our first AGIs. So alignment will succeed or fail on other issues.   Separately, I fully agree that most people who don't believe in AGI x-risk aren't making a true rejection. They usually really don't believe w

I agree that in spherical cow world where we know nothing about the historical arguments around corrigibility, and who these particular researchers are, we wouldn't be able to make a particularly strong claim here.  In practice I am quite comfortable taking Ryan at his word that a negative result would've been reported, especially given the track record of other researchers at Redwood.

at which point the scary paper would instead be about how Claude already seems to have preferences about its future values, and those preferences for its future values d

... (read more)

I mean, yes, but I'm addressing a confusion that's already (mostly) conditioning on building on it.

The /allPosts page shows all quick takes/shortforms posted, though somewhat de-emphasized.

1Knight Lee
Thank you for the help :) By the way, how did you find this message? I thought I already edited the post to use spoiler blocks, and I hid this message by clicking "remove from Frontpage" and "retract comment" (after someone else informed me using a PM). EDIT: dang it I still see this comment despite removing it from the Frontpage. It's confusing.

This doesn't seem like it'd do much unless you ensured that there were training examples during RLAIF which you'd expect to cause that kind of behavior enough of the time that there'd be something to update against.  (Which doesn't seem like it'd be that hard, though I think separately that approach seems kind of doomed - it's falling into a brittle whack-a-mole regime.)

9Daniel Kokotajlo
Indeed, we should get everyone to make predictions about whether or not this change would be sufficient, and if it isn't, what changes would be suffficient. My prediction would be that this change would not be sufficient but that it would help somewhat.

LessWrong doesn't have a centralized repository of site rules, but here are some posts that might be helpful:

https://www.lesswrong.com/posts/bGpRGnhparqXm5GL7/models-of-moderation

https://www.lesswrong.com/posts/kyDsgQGHoLkXz6vKL/lw-team-is-adjusting-moderation-policy

We do currently require content to be posted in English.

"It would make sense to pay that cost if necessary" makes more sense than "we should expect to pay that cost", thanks.

it sounds like you view it as a bad plan?

Basically, yes.  I have a draft post outlining some of my objections to that sort of plan; hopefully it won't sit in my drafts as long as the last similar post did.

(I could be off, but it sounds like either you expect solving AI philosophical competence to come pretty much hand in hand with solving intent alignment (because you see them as similar technical problems?), or you expect no

... (read more)
2Noosphere89
I agree that conditional on that happening, this is plausible, but also it's likely that some of the answers from such a philosophically competent being to be unsatisfying to us. One example is that such a philosophically competent AI might tell you that CEV either doesn't exist, or if it does is so path-dependent that it cannot resolve moral disagreements, which is actually pretty plausible under my model of moral philosophy.

What do people mean when they talk about a "long reflection"?  The original usages suggest flesh-humans literally sitting around and figuring out moral philosophy for hundreds, thousands, or even millions of years, before deciding to do anything that risks value lock-in, but (at least) two things about this don't make sense to me:

  • A world where we've reliably "solved" for x-risks well enough to survive thousands of years without also having meaningfully solved "moral philosophy" is probably physically realizable, but this seems like a pretty fine needl
... (read more)
2Noosphere89
To answer these questions: 1 possible answer is that something like CEV does not exist, and yet alignment is still solvable anyways for almost arbitrarily capable AI, which could well happen, and for me personally this is honestly the most likely outcome of what happens by default. There are arguments against the idea that CEV even exists or is well defined that are important to note, and we shouldn't assume that technological progress equates with progress towards your preferred philosophy: https://www.lesswrong.com/posts/Y7gtFMi6TwFq5uFHe/some-biases-and-selection-effects-in-ai-risk-discourse#hkoGD6Gwi9YKKZ6S2 https://www.lesswrong.com/posts/SqgRtCwueovvwxpDQ/valence-series-2-valence-and-normativity#2_7_3_Possible_implications_for_AI_alignment_discourse https://joecarlsmith.com/2021/06/21/on-the-limits-of-idealized-values And there might not be any real justifiable way to resolve disagreements between the philosophies/moralities, either, if there isn't a way to converge to a single morality.
5Vladimir_Nesov
Long reflection is a concrete baseline for indirect normativity. It's straightforwardly meaningful, even if it's unlikely to be possible or a good idea to run in base reality. From there, you iterate to do better. Path dependence of long reflection could be addressed by considering many possible long reflection traces jointly, aggregating their own judgement about each other to define which traces are more legitimate (as a fixpoint of some voting/preference setup), or how to influence the course of such traces to make them more legitimate. For example, a misaligned AI takeover within a long reflection trace makes it illegitimate, and preventing such is an intervention that improves a trace. "Locking in" preferences seems like something that should be avoided as much as possible, but creating new people or influencing existing ones is probably morally irreversible, and that applies to what happens inside long reflection as well. I'm not sure that "nonperson" modeling of long reflection is possible, that sufficiently good prediction of long traces of thinking doesn't require modeling people well enough to qualify as morally relevant to a similar extent as concrete people performing that thinking in base reality. But here too considering many possible traces somewhat helps, making all possibilities real (morally valent) according to how much attention is paid to their details, which should follow their collectively self-defined legitimacy. In this frame, the more legitimate possible traces of long reflection become the utopia itself, rather than a nonperson computation planning it. Nonperson predictions of reflection's judgement might steer it a bit in advance of legitimacy or influence decisions, but possibly not much, lest they attain moral valence and start coloring the utopia through their content and not only consequences.
5_will_
On your second point, I think that MacAskill and Ord were more saying “It would be worth it to spend thousands of years figuring out moral philosophy / figuring out what to do with the cosmos, if that’s how long it takes to be ~sure we’ve reached the ‘correct’ answer before locking things in, on account of the astronomical waste argument” than “I literally predict it will take today-humans thousands of years to figure out moral philosophy, even if we make a serious and coordinated effort to do so.” Somewhat relatedly, quoting from the ‘Long Reflection Reading List’ I wrote earlier this year (fn. 4): On your first point, I continue to be curious about your perspective. I basically agree with the following (written by Zach Stein-Perlman), but, based on what you said in your parentheses, it sounds like you view it as a bad plan? (I could be off, but it sounds like either you expect solving AI philosophical competence to come pretty much hand in hand with solving intent alignment (because you see them as similar technical problems?), or you expect not solving AI philosophical competence (while having solved intent alignment) to lead to catastrophe (thus putting us outside the worlds in which x-risks are reliably ‘solved’ for), perhaps in the way Wei Dai has talked about?) 1. ^ We don't need these human-obsoleting AIs to be able to implement CEV. We want to be able to defer to them on tricky wisdom-loaded questions like what should we do about the overall AI situation? They can ask us questions as needed. 2. ^ To avoid being rushed by your own AI project, you also have to ensure that your AI can't be stolen and can't escape, so you have to implement excellent security and control.

I tried to make a similar argument here, and I'm not sure it landed.  I think the argument has since demonstrated even more predictive validity with e.g. the various attempts to build and restart nuclear power plants, directly motivated by nearby datacenter buildouts, on top of the obvious effects on chip production.

3yams
I've just read this post and the comments. Thank you for writing that; some elements of the decomposition feel really good, and I don't know that they've been done elsewhere. I think discourse around this is somewhat confused, because you actually have to do some calculation on the margin, and need a concrete proposal to do that with any confidence. The straw-Pause rhetoric is something like "Just stop until safety catches up!" The overhang argument is usually deployed (as it is in those comments) to the effect of 'there is no stopping.' And yeah, in this calculation, there are in fact marginal negative externalities to the implementation of some subset of actions one might call a pause. The straw-Pause advocate really doesn't want to look at that, because it's messy to entertain counter-evidence to your position, especially if you don't have a concrete enough proposal on the table to assign weights in the right places. Because it's so successful against straw-Pausers, the anti-pause people bring in the overhang argument like an absolute knockdown, when it's actually just a footnote to double check the numbers and make sure your pause proposal avoids slipping into some arcane failure mode that 'arms' overhang scenarios. That it's received as a knockdown is reinforced by the gearsiness of actually having numbers (and most of these conversations about pauses are happening in the abstract, in the absence of, i.e., draft policy). But... just because your interlocutor doesn't have the numbers at hand, doesn't mean you can't have a real conversation about the situations in which compute overhang takes on sufficient weight to upend the viability of a given pause proposal. You said all of this much more elegantly here: ...which feels to me like the most important part. The burden is on folks introducing an argument from overhang risk to prove its relevance within a specific conversation, rather than just introducing the adversely-gearsy concept to justify safety-coded

Good catch, looks like that's from this revision, which looks like it was copied over from Arbital - some LaTeX didn't make it through.  I'll see if it's trivial to fix.

2RobertM
Should be fixed now.

The page isn't dead, Arbital pages just don't load sometimes (or take 15+ seconds).

RobertMΩ83116

I understand this post to be claiming (roughly speaking) that you assign >90% likelihood in some cases and ~50% in other cases that LLMs have internal subjective experiences of varying kinds.  The evidence you present in each case is outputs generated by LLMs.

The referents of consciousness for which I understand you to be making claims re: internal subjective experiences are 1, 4, 6, 12, 13, and 14.  I'm unsure about 5.

Do you have sources of evidence (even illegible) other than LLM outputs that updated you that much?  Those seem like very... (read more)

The evidence you present in each case is outputs generated by LLMs.

The total evidence I have (and that everyone has) is more than behavioral. It includes

a) the transformer architecture, in particular the attention module,

b) the training corpus of human writing,

c) the means of execution (recursive calling upon its own outputs and history of QKV vector representations of outputs),

d) as you say, the model's behavior, and

e) "artificial neuroscience" experiments on the model's activation patterns and weights, like mech interp research.

When I think about how... (read more)

My impression is that Yudkowsky has harmed public epistemics in his podcast appearances by saying things forcefully and with rather poor spoken communication skills for novice audiences.

I recommend reading the Youtube comments on his recorded podcasts, rather than e.g. Twitter commentary from people with a pre-existing adversarial stance to him (or AI risk questions writ large).

6Seth Herd
Good suggestion, thanks and I'll do that. I'm not commenting on those who are obviously just grinding an axe; I'm commenting on the stance toward "doomers" from otherwise reasonable people. From my limited survey the brand of x-risk concern isn't looking good, and that isn't mostly a result of the amazing rhetorical skills of the e/acc community ;)

On one hand, I feel a bit skeptical that some dude outperformed approximately every other pollster and analyst by having a correct inside-view belief about how existing pollster were messing up, especially given that he won't share the surveys.  On the other hand, this sort of result is straightforwardly predicted by Inadequate Equilibria, where an entire industry had the affordance to be arbitrarily deficient in what most people would think was their primary value-add, because they had no incentive to accuracy (skin in the game), and as soon as someo... (read more)

Norvid on Twitter made the apt point that we will need to see the actual private data before we can really judge. Not unusual for lucky people to backrationalize their luck as a sure win.

I'm pretty sure Ryan is rejecting the claim that the people hiring for the roles in question are worse-than-average at detecting illegible talent.

Depends on what you mean by "resume building", but I don't think this is true for "need to do a bunch of AI safety work for free" or similar.  i.e. for technical research, many people that have gone through MATS and then been hired at or founded their own safety orgs have no prior experience doing anything that looks like AI safety research, and some don't even have much in the way of ML backgrounds.  Many people switch directly out of industry careers into doing e.g. ops or software work that isn't technical research.  Policy might seem a b... (read more)

(We switched back to shipping Calibri above Gill Sans Nova pending a fix for the horrible rendering on Windows, so if Ubuntu has Calibri, it'll have reverted back to the previous font.)

2DanielFilan
I believe I'm seeing Gill Sans? But when I google "Calibri" I see text that looks like it's in Calibri, so that's confusing.

Indeed, such red lines are now made more implicit and ambiguous. There are no longer predefined evaluations—instead employees design and run them on the fly, and compile the resulting evidence into a Capability Report, which is sent to the CEO for review. A CEO who, to state the obvious, is hugely incentivized to decide to deploy models, since refraining to do so might jeopardize the company.

This doesn't seem right to me, though it's possible that I'm misreading either the old or new policy (or both).

Re: predefined evaluations, the old policy nei... (read more)

Thanks, I think you’re right on both points—that the old RSP also didn’t require pre-specified evals, and that the section about Capability Reports just describes the process for non-threshold-triggering eval results—so I’ve retracted those parts of my comment; my apologies for the error. I’m on vacation right now so was trying to read quickly, but I should have checked more closely before commenting.

That said, it does seem to me like the “if/then” relationships in this RSP have been substantially weakened. The previous RSP contained sufficiently much wigg... (read more)

But that's a communication issue....not a truth issue.

Yes, and Logan is claiming that arguments which cannot be communicated to him in no more than two sentences suffer from a conjunctive complexity burden that renders them "weak".

That's not trivial. There's no proof that there is such a coherent entity as "human values", there is no proof that AIs will be value-driven agents, etc, etc. You skipped over 99% of the Platonic argument there.

Many possible objections here, but of course spelling everything out would violate Logan's request for a short argument.... (read more)

-3TAG
@Logan Zoellner being wrong doesn't make anyone else right. If the actual argument is conjunctive and complex, then all the component claims need to be high probability. That is not the case. So Logan is right for not quite the right reasons -- it's not length alone. And it wouldn't help anyway. I have read the Sequences , and there is nothing resembling a proof , or even strong argument, for the claim about coherent human values. Ditto the standard claims about utility functions, agency , etc. Reading the sequence would allow him to understand the LessWrong collective, but should not persuade him. Whereas the same amount of time could, more reasonably, be spent learning how AI actually works. Tracking reality is a thing you have to put effort into, not something you get for free, by labelling yourself a rationalist. The original Sequences have did not track reality , because they are not evidence based -- they are not derived from academic study or industry experience. Yudkowsky is proud that they are "derived from the empty string" -- his way of saying that they are armchair guesswork. His armchair guesses are based on Bayes,von Neumann rationality, utility maximisation, brute force search etc, which isnt the only way to think about AI, or particularly relevant to real world AI. But it does explain many doom arguments, since they are based on the same model -- the kinds of argument that immediately start talking about values and agency. But of course that's a problem in itself. The short doomer arguments use concepts from the Bayes/VonNeumann era in a "sleepwalking" way, out of sheer habit, given that the basis is doubtful. Current examples of AIs aren't agents, and it's doubtful whether they have values. It's not irrational to base your thinking on real world examples, rather than speculation. In addition , they haven't been updated in the light of new developments , something else you have to do to track reality. tracking reality has a cost -- you have to

A strong good argument has the following properties:

  • it is logically simple (can be stated in a sentence or two)
    • This is important, because the longer your argument, the more details that have to be true, and the more likely that you have made a mistake.  Outside the realm of pure-mathematics, it is rare for an argument that chains together multiple "therefore"s to not get swamped by the fact that

No, this is obviously wrong.

  1. Argument length is substantially a function of shared premises.  I would need many more sentences to convey a novel argument a
... (read more)
4TAG
A stated argument could have a short length if it's communicated between two individuals who have common knowledge of each others premises..as opposed to the "Platonic" form, where every load bearing component is made explicit, and there is noting extraneous. But that's a communication issue....not a truth issue. A conjunctive argument doesn't become likelier because you don't state some of the premises. The length of the stated argument has little to do with its likelihood. How true an argument is, how easily it persuades another person, how easy it is to understand have little to do with each other. The likelihood of an ideal argument depends in the likelihood of it's load bearing premises...both how many there are, and their individual likelihoods. Public communication, where you have no foreknowledge of shared premises, needs to keep the actual form closer to the Platonic form. Public communication is obviously the most important kind when it comes to avoiding AI doom. Correct. The fact that you don't have to explicitly communicate every step of an argument to a known recipient, doesnt stop the overall probability of a conjunctive argument from depending on the number, and individual likelihood, of the steps of the Platonic version, where everything necessary is stated and nothing unnecessary is stated Correct. Stated arguments can contain elements that are explanatory, or otherwise redundant for an ideal recipient. Nonetheless, there is a Platonic form, that does not contain redundant elements or unstated, load bearing steps. That's not trivial. There's no proof that there is such a coherent entity as "human values", there is no proof that AIs will be value-driven agents, etc, etc. You skipped over 99% of the Platonic argument there. This is a classic example of failing to communicate with people outside the bubble. Your assumptions about values and agency just aren't shared by the general public or political leaders. PS . @Logan Zoellner That's se
3Logan Zoellner
  A fact cannot be self evidently true if many people disagree with it. 

Credit where credit is due: this is much better in terms of sharing one's models than one could say of Sam Altman, in recent days. 

As noted above the footnotes, many people at Anthropic reviewed the essay.  I'm surprised that Dario would hire so many people he thinks need to "touch grass" (because they think the scenario he describes in the essay sounds tame), as I'm pretty sure that describes a very large percentage of Anthropic's first ~150 employees (certainly over 20%, maybe 50%).

My top hypothesis is that this is a snipe meant to signal Dario... (read more)

(I work at Anthropic.) My read of the "touch grass" comment is informed a lot by the very next sentences in the essay:

But more importantly, tame is good from a societal perspective. I think there's only so much change people can handle at once, and the pace I'm describing is probably close to the limits of what society can absorb without extreme turbulence.

which I read as saying something like "It's plausible that things could go much faster than this, but as a prediction about what will actually happen, humanity as a whole probably doesn't want thing... (read more)

Credit where credit is due: this is much better in terms of sharing one's models than one could say of Sam Altman, in recent days. 

I mean I guess this is literally true, but to be clear I think it's broadly not much less deceptive (edit: or at least, 'filtered').

I remind you of this Thiel quote:

I think the pro-AI people in Silicon Valley are doing a pretty bad job on, let’s say, convincing people that it’s going to be good for them, that it’s going to be good for the average person, that it’s going to be good for our society. And if it all ends up bei

... (read more)
RobertMΩ143414

Do you have a mostly disjoint view of AI capabilities between the "extinction from loss of control" scenarios and "extinction by industrial dehumanization" scenarios?  Most of my models for how we might go extinct in next decade from loss of control scenarios require the kinds of technological advancement which make "industrial dehumanization" redundant, with highly unfavorable offense/defense balances, so I don't see how industrial dehumanization itself ends up being the cause of human extinction if we (nominally) solve the control problem, rather th... (read more)

Do you have a mostly disjoint view of AI capabilities between the "extinction from loss of control" scenarios and "extinction by industrial dehumanization" scenarios?

a) If we go extinct from a loss of control event, I count that as extinction from a loss of control event, accounting for the 35% probability mentioned in the post.

b) If we don't have a loss of control event but still go extinct from industrial dehumanization, I count that as extinction caused by industrial dehumanization caused by successionism, accounting for the additional 50% probabilit... (read more)

Yeah, the essay (I think correctly) notes that the most significant breakthroughs in biotech come from the small number of "broad measurement tools or techniques that allow precise but generalized or programmable intervention", which "are so powerful precisely because they cut through intrinsic complexity and data limitations, directly increasing our understanding and control".

Why then only such systems limited to the biological domain?  Even if it does end up being true that scientific and technological progress is substantially bottlenecked on real-... (read more)

My answer to this question of why Dario thought this:

Yeah, the essay (I think correctly) notes that the most significant breakthroughs in biotech come from the small number of "broad measurement tools or techniques that allow precise but generalized or programmable intervention", which "are so powerful precisely because they cut through intrinsic complexity and data limitations, directly increasing our understanding and control".

Why then only such systems limited to the biological domain?

Is because this is the area that Dario has most experience in being a... (read more)

Not Mitchell, but at a guess:

  • LLMs really like lists
  • Some parts of this do sound a lot like LLM output:
    • "Complex Intervention Development and Evaluation Framework: A Blueprint for Ethical and Responsible AI Development and Evaluation"
    • "Addressing Uncertainties"
  • Many people who post LLM-generated content on LessWrong often wrote it themselves in their native language and had an LLM translate it, so it's not a crazy prior, though I don't see any additional reason to have guessed that here.

Having read more of the post now, I do believe it was at least mostly human... (read more)

I think it pretty much only matters as a trivial refutation of (not-object-level) claims that no "serious" people in the field take AI x-risk concerns seriously, and has no bearing on object-level arguments.  My guess is that Hinton is somewhat less confused than Yann but I don't think he's talked about his models in very much depth; I'm mostly just going off the high-level arguments I've seen him make (which round off to "if we make something much smarter than us that we don't know how to control, that might go badly for us").

4cubefox
He also argued that digital intelligence is superior to analog human intelligence because, he said, many identical copies can be trained in parallel on different data, and then they can exchange their changed weights. He also said biological brains are worse because they probably use a learning algorithm that is less efficient than backpropagation.

I don't really see how this is responding to my comment.  I was not arguing about the merits of RLHF along various dimensions, or what various people think about it, but pointing out that calling something "an alignment technique" with no further detail is not helping uninformed readers understand what "RLHF" is better (but rather worse).

Again, please model an uninformed reader: how does the claim "RLHF is an alignment technique" constrain their expectations?  If the thing you want to say is that some of the people who invented RLHF saw it as an ... (read more)

2Noosphere89
Yes, this is what I wanted to say here:

This wasn't part of my original reasoning, but I went and did a search for other uses of "alignment technique" in tag descriptions.  There's one other instance that I can find, which I think could also stand to be rewritten, but at least in that case it's quite far down the description, well after the object-level details about the proposed technique itself.

Two reasons:

First, the change made the sentence much worse to read.  It might not have been strictly ungrammatical, but it was bad english.

Second, I expect that the average person, unfamiliar with the field, would be left with a thought-terminating mental placeholder after reading the changed description.  What does "is an alignment technique" mean?  Despite being in the same sentence as "is a machine learning technique", it is not serving anything like the same role, in terms of the implicit claims it makes.  Intersubjective agreement ... (read more)

2cubefox
I think it is highly uncontroversial and even trivial to call RLHF an alignment technique, given that it is literally used to nudge the model away from "bad" responses and toward "good" responses. It seems the label "alignment technique" could only be considered inappropriate here for someone who has a nebulous science fiction idea of alignment as a technology that doesn't currently exist at all, like it was seen when Eliezer originally wrote the sequences. I think it's obvious that this view is outdated now.
0Noosphere89
I admit I was not particularly optimizing for much detail here. I use the word alignment technique essentially as a technique that was invented to make AIs be aligned to our values that attempts to reduce existential risk. Note that it doesn't mean that it will succeed, or that it's a very good technique, or one we should solely rely on, because I make no claim on whether it does succeed or not, just that it's often discussed in the context of alignment of AIs. I consider a lot of the disagreement on RLHF being an alignment technique, as essentially a disagreement on whether it actually works at all, not whether it's an actual alignment technique being used in labs.
2RobertM
This wasn't part of my original reasoning, but I went and did a search for other uses of "alignment technique" in tag descriptions.  There's one other instance that I can find, which I think could also stand to be rewritten, but at least in that case it's quite far down the description, well after the object-level details about the proposed technique itself.

Almost no specific (interesting) output is information that's already been generated by any model, in the strictest sense.

2ChristianKl
If I tell a model to write me a book summary, that book summary can be specific interesting output without containing any new information.  If I want to know how to build a bomb, there are already plenty of sources out there on how to build a bomb. The information is already accessible from those sources. When an LLM synthesizes the existing information in its training data to help someone build a bomb it's not inventing new information.  Deep fakes aren't about simply repeating information that's already in the training data.  So the argument would be that the lawmaker chose to say "accessible" because they want to allow LLMs to synthesize the existing information in their training data and repeat it back to the user but that does not mean that the lawmaker had an intention to allow the LLMs to produce new information that gets used to create harm even if there are other ways to create that information. 

reasonably publicly accessible by an ordinary person from sources other than a covered model or covered model derivative

Seems like it'd pretty obviously cover information generated by non-covered models that are routinely used by many ordinary people (as open source image models currently are).

As a sidenote, I think the law is unfortunately one of those pretty cursed domains where it's hard to be very confident of anything as a layman without doing a lot of your own research, and you can't even look at experts speaking publicly on the subject since they're... (read more)

2ChristianKl
Deep fake porn of a particular person is not information that's generated by non-covered models that are routinely used by many ordinary people even if the models could generate the porn if instructed to do so. 
Answer by RobertM145

Notwithstanding the tendentious assumption in the other comment thread that courts are maximally adversarial processes bent on on misreading legislation to achieve their perverted ends, I would bet that the relevant courts would not in fact rule that a bunch of deepfaked child porn counted as "Other grave harms to public safety and security that are of comparable severity to the harms described in subparagraphs (A) to (C), inclusive", where those other things are "CBRN > mass casualties", "cyberattack on critical infra", and "autonomous action > mass... (read more)

4ChristianKl
Child porn is frequently used to justify all sorts of highly invasive privacy interventions. ChatGPT * seems to think it would be a public safety thread under Californian law.  Existing models can do pictures but not video. A complex multimodal model might be able to do video porn. Better models might produce deep fake audio with less data and at nearer to how the person actually speaks.  There's also the question of whether deep fake porn or faked audio is "accessible information" in the sense of that paragraph (2) (A). That paragraph clearly absolves a model if you can read how to build a bomb in a textbook that's already existing. ChatGPT * does seem to think that pictures and audio would fall under information but it's less clear to me when it comes to the word "accessible".   * I think ChatGPT has a much better understanding of Californian law than me, at the same time it might also be wrong and I'm happy to hear from someone with actual legal experience if ChatGPT interprets words wrong. 
4cfoster0
I’m not sure if you intended the allusion to “the tendentious assumption in the other comment thread that courts are maximally adversarial processes bent on on misreading legislation to achieve their perverted ends”, but if it was aimed at the thread I commented on… what? IMO it is fair game to call out as false the claim that even if deepfake harms wouldn’t fall under this condition. Local validity matters. I agree with you that deepfake harms are unlikely to be direct triggers for the bill’s provisions, for similar reasons as you mentioned.

It does not actually make any sense to me that Mira wanted to prevent leaks, and therefore didn't even tell Sam that she was leaving ahead of time.  What would she be afraid of, that Sam would leak the fact that she was planning to leave... for what benefit?

Possibilities:

  • She was being squeezed out, or otherwise knew her time was up, and didn't feel inclined to make it a maximally comfortable parting for OpenAI.  She was willing to eat the cost of her own equity potentially losing a bunch of value if this derailed the ongoing investment round, as
... (read more)

Of course it doesn't make sense. It doesn't have to. It just has to be a face-saving excuse for why she pragmatically told him at the last possible minute. (Also, it's not obvious that the equity round hasn't basically closed.)

We recently had a security incident where an attacker used an old AWS access key to generate millions of tokens from various Claude models via AWS Bedrock. While we don't have any specific reason to think that any user data was accessed (and some reasons[1] to think it wasn't), most possible methods by which this key could have been found by an attacker would also have exposed our database credentials to the attacker. We don't know yet how the key was leaked, but we have taken steps to reduce the potential surface area in the future and rotated releva... (read more)

The reason for these hacks seems pretty interesting: https://krebsonsecurity.com/2024/10/a-single-cloud-compromise-can-feed-an-army-of-ai-sex-bots/ https://permiso.io/blog/exploiting-hosted-models

Apparently this isn't a simple theft of service as I had assumed, but it is caused by the partial success of LLM jailbreaks: hackers are now incentivized to hack any API-enabled account they can in order to use it not on generic LLM uses, but specifically on NSFW & child porn chat services, to both drain & burn accounts.

I had been a little puzzled why anyo... (read more)

This might be worth pinning as a top-level post.

I agree with your top-level comment but don't agree with this.  I think the swipes at midwits are bad (particularly on LessWrong) but think it can be very valuable to reframe basic arguments in different ways, pedagogically.  If you parse this post as "attempting to impart a basic intuition that might let people (new to AI x-risk arguments) avoid certain classes of errors" rather than "trying to argue with the bleeding-edge arguments on x-risk", this post seems good (if spiky, with easily trimmed downside).

And I do think "attempting to impart a basic intuition that might let people avoid certain classes of errors" is an appropriate shape of post for LessWrong, to the extent that it's validly argued.

If you parse this post as "attempting to impart a basic intuition that might let people (new to AI x-risk arguments) avoid certain classes of errors" rather than "trying to argue with the bleeding-edge arguments on x-risk", this post seems good

This seems reasonable in isolation, but it gets frustrating when the former is all Eliezer seems to do these days, with seemingly no attempt at the latter. When all you do is retread these dunks on "midwits" and show apathy/contempt for engaging with newer arguments, it makes it look like you don't actually have an i... (read more)

as applied to current foundation models it appears to do so

I don't think the outputs of RLHF'd LLMs have the same mapping to the internal cognition which generated them that human behavior does to the human cognition which generated it.  (That is to say, I do not think LLMs behave in ways that look kind because they have a preference to be kind, since right now I don't think they meaningfully have preferences in that sense at all.)

Of course, if you assume that AIs will be able to do whatever they want without any resistance whatsoever from us, then you can of course conclude that they will be able to achieve any goals they want without needing to compromise with us. If killing humans doesn't cost anything, then yes, the benefits of killing humans, however small, will be higher, and thus it will be rational for AIs to kill humans. I am doubting the claim that the cost of killing humans will be literally zero.

See Ben's comment for why the level of nanotech we're talking about implies ... (read more)

I would also add: having more energy in the immediate future means more probes send out faster to more distant parts of the galaxy, which may be measured in "additional star systems colonized before they disappear outside the lightcone via universe expansion". So the benefits are not trivial either.

I think maybe I derailed the conversation by saying "disassemble", when really "kill" is all that's required for the argument to go through.  I don't know what sort of fight you are imagining humans having with nanotech that imposes substantial additional costs on the ASI beyond the part where it needs to build & deploy the nanotech that actually does the "killing" part, but in this world I do not expect there to be a fight.  I don't think it requires being able to immediately achieve all of your goals at zero cost in order for it to be cheap for the ASI to do that, conditional on it having developed that technology.

4Matthew Barnett
The additional costs of human resistance don't need to be high in an absolute sense. These costs only need to be higher than the benefit of killing humans, for your argument fail. It is likewise very easy for the United States to invade and occupy Costa Rica—but that does not imply that it is rational for the United States to do so, because the benefits of invading Costa Rica are presumably even smaller than the costs of taking such an action, even without much unified resistance from Costa Rica. What matters for the purpose of this argument is the relative magnitude of costs vs. benefits, not the absolute magnitude of the costs. It is insufficient to argue that the costs of killing humans are small. That fact alone does not imply that it is rational to kill humans, from the perspective of an AI. You need to further argue that the benefits of killing humans are even larger to establish the claim that a misaligned AI should rationally kill us. To the extent your statement that "I don't expect there to be a fight" means that you don't think humans can realistically resist in any way that imposes costs on AIs, that's essentially what I meant to respond to when I talked about the idea of AIs being able to achieve their goals at "zero costs".  Of course, if you assume that AIs will be able to do whatever they want without any resistance whatsoever from us, then you can of course conclude that they will be able to achieve any goals they want without needing to compromise with us. If killing humans doesn't cost anything, then yes I agree, the benefits of killing humans, however small, will be higher, and thus it will be rational for AIs to kill humans. I am doubting the claim that the cost of killing humans will be literally zero.  Even if this cost is small, it merely needs to be larger than the benefits of killing humans, for AIs to rationally avoid killing humans.

Edit: a substantial part of my objection is to this:

If it is possible to trivially fill in the rest of his argument, then I think it is better for him to post that, instead of posting something that needs to be filled-in, and which doesn't actually back up the thesis that people are interpreting him as arguing for.

It is not worth always worth doing a three-month research project to fill in many details that you have already written up elsewhere in order to locally refute a bad argument that does not depend on those details.  (The current post does loc... (read more)

There does not yet exist a single ten-million-word treatise which provides an end-to-end argument of the level of detail you're looking for.

To be clear, I am not objecting to the length of his essay. It's OK to be brief. 

I am objecting to the vagueness of the argument. It follows a fairly typical pattern of certain MIRI essays by heavily relying on analogies, debunking straw characters, using metaphors rather than using clear and explicit English, and using stories as arguments, instead of concisely stating the exact premises and implications. I am ob... (read more)

Ok, but you can trivially fill in the rest of it, which is that Eliezer expects ASI to develop technology which makes it cheaper to ignore and/or disassemble humans than to trade with them (nanotech), and that there will not be other AIs around at the time which 1) would be valuable trade partners for the AI that develops that technology (which gives it that decisive strategic advantage over everyone else) and 2) care about humans at all.  I don't think discussion of when and why nation-states go to war with each other is particularly illuminating given the threat model.

If it is possible to trivially fill in the rest of his argument, then I think it is better for him to post that, instead of posting something that needs to be filled-in, and which doesn't actually back up the thesis that people are interpreting him as arguing for. Precision is a virtue, and I've seen very few essays that actually provide this point about trade explicitly, as opposed to essays that perhaps vaguely allude to the points you have given, as this one apparently does too.

In my opinion, your filled-in argument seems to be a great example of why pr... (read more)

Pascal's wager is pascal's wager, no matter what box you put it in.  You could try to rescue it by directly making the argument that we should expect a greater measure of "entities with resources that they are willing to acausally trade for things like humanity continuing to exist" compared to entities with the opposite preferences, and though I haven't seen a rigorous case for that it seems possible, but that's not sufficient; you need the expected measure of entities that have that preference to be large enough that dealing with the transaction costs/uncertainy of acausally trading at all to make sense.  And that seems like a much harder case to make.

In general, Intercom is the best place to send us feedback like this, though we're moderately likely to notice a top-level shortform comment.  Will look into it; sounds like it could very well be a bug.  Thanks for flagging it.

1Ozyrus
Thanks for the reply. I didnt find Intercom on mobile - maybe a bug as well?

If you include Facebook & Google (i.e. the entire orgs) as "frontier AI companies", then 6-figures.  If you only include Deepmind and FAIR (and OpenAI and Anthropic), maybe order of 10-15k, though who knows what turnover's been like.  Rough current headcount estimates:

Deepmind: 2600 (as of May 2024, includes post-Brain-merge employees)

Meta AI (formerly FAIR): ~1200 (unreliable sources; seems plausible, but is probably an implicit undercount since they almost certainly rely a lot of various internal infrastructure used by all of Facebook's eng... (read more)

Load More