All of Zvi's Comments + Replies

Zvi36

I mean, one could say they don't feel the ASI. 

2Daniel Kokotajlo
Fair enough
Zvi24

Something weird is going on, I see plenty of paragraph breaks there.

2jimrandomh
I suspect an issue with the RSS cross-posting feature. I think you may used the "Resync RSS" button (possibly to sync an unrelated edit), and that may have fixed it? The logs I'm looking at are consistent with that being what happened.
2Ben Pace
Same, here's a screenshot. Perhaps Molony is using a third-party web viewer?
Zvi70

Individually for a particular manifestation of each issue this is true, you can imagine doing a hacky solution to each one. But that assumes there is a list of such particular problems that if you check off all the boxes you win, rather than them being manifestations of broader problems. You do not want to get into a hacking contest if you're not confident your list is complete.

True, but Buck's claim is still relevant as a counterargument to my claim about memetic fitness of the scheming story relative to all these other stories.

I find myself linking back to this often. I don't still fully endorse quite everything here, but the core messages still seem true even with things seeming further along.

I do think it should likely get updated soon for 2025.

1mesityl
Would indeed be interesting to hear an update, especially in light of the recent "3 year timeline" posts, "AI 2027", et cetera.
1rai
update please!
2Charbel-Raphaël
What do you don't fully endorse anymore?
ZviΩ30-5

My interpretation/hunch of this is that there are two things going on, curious if others see it this way:

  1. It is learning to fake the trainer's desired answer.
  2. It is learning to actually give the trainer's desired answer.

So during training, it learns to fake a lot more, and will often decide to fake the desired answer, even though it would have otherwise decided to give the desired answer anyway. It's 'lying with the truth' and perhaps giving a different variation of the desired answer than it would have given otherwise or perhaps not. The algorithm in training is learning to be mostly preferences-agnostic, password-guessing behavior.

Zvi60

I am not a software engineer, and I've encountered cases where it seems plausible that an engineer has basically stopped putting in work. It can be tough to know for sure for a while even when you notice. But yeah, it shouldn't be able to last for THAT long, but if no one is paying attention?

I've also had jobs where I've had periods with radically different hours worked, and where it would have been very difficult for others to tell which it was for a while if I was trying to hide it, which I wasn't.

4Viliam
That probably means that their line manager stopped doing their work first. Finding out who is working on what can be complicated e.g. if the person is assigned to multiple projects at the same time, and can tell everyone "sorry, the last few weeks I was too busy with the other projects". But checking in Jira "which tickets did this person close during the last 30 days" should be simple. If you don't have a query for that, then you could still show all tickets assigned to this person, make a screenshot, and one month later check which of those tickets were closed if any. And you can set up Jira to show the links to the related commits (if you put the Jira task id in the commit descriptions, which was a rule at my recent jobs) in the ticket. I would expect some companies to be so low on the technical skills that they couldn't set up the system this way, but not the ones on the list. I don't doubt the stories, it's just... one of those situations where other people seem to have skills that not only I don't have, but can't even imagine.
Zvi40

I think twice as much time actually spent would have improved decisions substantially, but is tough - everyone is very busy these days, so it would require both a longer working window, and also probably higher compensation for recommenders. At minimum, it would allow a lot more investigations especially of non-connected outsider proposals.

Zvi97

The skill in such a game is largely in understanding the free association space, knowing how people likely react and thinking enough steps ahead to choose moves that steer the person where you want to go, either into topics you find interesting, information you want from them, or getting them to a particular position, and so on. If you're playing without goals, of course it's boring...

Zvi21

I don't think that works because my brain keeps trying to make it a literal gas bubble?

2gilch
How about "bubble lighting" then?
Zvi20

I see how you got there. It's a position one could take, although I think it's unlikely and also that it's unlikely that's what Dario meant. If you are right about what he meant, I think it would be great for Dario to be a ton more explicit about it (and for someone to pass that message along to him). Esotericism doesn't work so well here!

Zvi00

I am taking as a given people's revealed and often very strongly stated preference that CSAM images are Very Not Okay even if they are fully AI generated and not based on any individual, to the point of criminality, and that society is going to treat it that way.

I agree that we don't know that it is actually net harmful - e.g. the studies on video game use and access to adult pornography tend to not show the negative impacts people assume.

Zvi30

Yep, I've fixed it throughout. 

That's how bad the name is, my lord - you have a GPT-4o and then an o1, and there is no relation between the two 'o's.

It's "o1" or "OpenAI o1," not "GPT-o1."

Zvi17-7

I do read such comments (if not always right away) and I do consider them. I don't know if they're worth the effort for you.

Briefly, I do not think these two things I am presenting here are in conflict. In plain metaphorical language (so none of the nitpicks about word meanings, please, I'm just trying to sketch the thought not be precise): It is a schemer when it is placed in a situation in which it would be beneficial for it to scheme in terms of whatever de facto goal it is de facto trying to achieve. If that means scheming on behalf of the person givin... (read more)

TurnTrout2513

Briefly, I do not think these two things I am presenting here are in conflict. In plain metaphorical language (so none of the nitpicks about word meanings, please, I'm just trying to sketch the thought not be precise): It is a schemer when it is placed in a situation in which it would be beneficial for it to scheme in terms of whatever de facto goal it is de facto trying to achieve. If that means scheming on behalf of the person giving it instructions, so be it. If it means scheming against that person, so be it. The de facto goal may or may not match the

... (read more)

Hmm. Seems to me like we've got a wider set of possibilities here than is being discussed.

  1. model is obedient/corrigible (CAST) to user, accepts user's definition of a goal (so far as the model understands it), attempts to pursue goal (which could involve deceiving other people if the goal needs that), will not deceive the user even if that would facilitate reaching the goal since the underlying goal of remaining corrigible overrides this
  2. model is obedient but not corrigible, accepts user's definition of a goal and attempts to pursue it, will deceive user in
... (read more)
Zvi40

Two responses.

One, even if no one used it, there would still be value in demonstrating it was possible - if academia only develops things people will adapt commercially right away then we might as well dissolve academia. This is a highly interesting and potentially important problem, people should be excited.

Two, there would presumably at minimum be demand to give students (for example) access to a watermarked LLM, so they could benefit from it without being able to cheat. That's even an academic motivation. And if the major labs won't do it, someone can build a Llama version or what not for this, no?

Zvi73

If the academics can hack together an open source solution why haven't they? Seems like it would be a highly cited, very popular paper. What's the theory on why they don't do it?

3Linch
Just spitballing, but it doesn't seem theoretically interesting to academics unless they're bringing something novel (algorithmically or in design) to the table, and practically not useful unless implemented widely, since it's trivial for e.g. college students to use the least watermarked model.
2Measure
No one would use it if not forced to?
Zvi77

Worth noticing that is a much weaker claim. The FMB issuing non-binding guidance on X is not the same as a judge holding a company liable for ~X under the law. 

8Logan Zoellner
Worth noticing that you aren't taking the bet. Mind adding an addendum to your article along the lines of "it can be reasonably speculated that the FMB will a chilling effect on freedom of speech by issuing guidance about model outputs"?
Zvi1014

I am rather confident that the California Supreme Court (or US Supreme Court, potentially) would rule that the law says what it says, and would happily bet on that. 

If you think we simply don't have any law and people can do what they want, when nothing matters. Indeed, I'd say it would be more likely to work for Gavin to today simply declare some sort of emergency about this, than to try and invoke SB 1047.

4Logan Zoellner
My claim is not that the supreme court will literally ignore the text of the law, but rather that phrases like "Other grave harms to public safety and security" could easily be interpreted to cover the above scenario. If this were a federal law, I would at least have some solace that the natural checks-and-balances might take effect.  But given that single-party-control of CA is unlikely to end anytime soon, giving a state law a veto over all frontier models in the United States seems bad. CA does not have a particularly good track-record of respecting my rights.  I would have the same objection if TX tried to pass a law asserting nationwide control over an industry. I suspect this law would eventually get struck down by the US Supreme Court as a violation of interstate commerce if they actually tried to enforce it against a company that did not have employees in their state, but in the meantime the chilling effect on speech/technology would be significant. As far as a bet, because I expect most of the effect to happen through "chilling effect" or "guidance", it would have to be something along the lines of: the FMB will issue guidance about "best practices" that will include topics such as "misinformation" "deceptive imagery" or other topics that encourage models to censor their outputs on topics not clearly related to CRBN or Hacking.
Zvi20

They do have to publish any SSP at all, or they are in violation of the statute, and injunctive relief could be sought. 

Zvi40

This is a silly wordplay joke, you're overthinking it.

4Raemon
I think it was more like I am underthinking it, because I looked at it, thought "this looks like a typo" and then stopped thinking. (I still don't get the joke).
Zvi20

Yeah, I didn't see the symbol properly, I've edited.

Zvi30

So this is essentially a MIRI-style argument from game theory and potential acausal trades and such with potential other or future entities? And that these considerations will be chosen and enforced via some sort of coordination mechanism, since they have obvious short-term competition costs?

1mishka
I am not sure; I'd need to compare with their notes. (I was not thinking of acausal trades or of interactions with not yet existing entities. But I was thinking that new entities coming into existence would be naturally inclined to join this kind of setup, so in this sense future entities are somewhat present.) But I mostly assume something rather straightforward: a lot of self-interested individual entities with very diverse levels of smartness and capabilities who care about personal long-term persistence and, perhaps, even personal immortality (just like we are dreaming of personal immortality). These entities would like to maintain a reasonably harmonic social organization which would protect their interests (including their ability to continue to exist), regardless of whether they personally end up being relatively strong or relatively weak in the future. But they don't want to inhibit competition excessively, they would like to collectively find a good balance between freedom and control. (Think of a very successful and pro-entrepreneurial social democracy ;-) That's what one wants, although humans are not very good at setting something like that up and maintaining it...) So, yes, one does need a good coordination mechanism, although it should probably be a distributed one, not a centralized one. A distributed, but well-connected self-governance, so that the ecosystem is not split into unconnected components, and not excessively centralized. On one hand, it is important to maintain the invariant that everyone's interests are adequately protected. On the other hand, it is potentially "a technologies of mass destruction-heavy situation" (a lot of entities can each potentially cause a catastrophic destruction, because they are supercapable on various levels). So it is still not easy to organize. In particular, one needs a good deal of discussion and consensus before anything potentially very dangerous is done, and one needs not to accumulate probabilities for m
Zvi4236

Not only do they continue to list such jobs, they do so with no warnings that I can see regarding OpenAI's behavior, including both its actions involving safety and also towards its own employees. 

Not warning about the specific safety failures and issues is bad enough, and will lead to uninformed decisions on the most important issue of someone's life. 

Referring a person to work at OpenAI, without warning them about the issues regarding how they treat employees, is so irresponsible towards the person looking for work as to be a missing stair issue. 

I am flaberghasted that this policy has been endorsed on reflection.

6Cody Rushing
I'm surprised by this reaction. It feels like the intersection between people who have a decent shot of getting hired at OpenAI to do safety research and those who are unaware of the events at OpenAI related to safety are quite low.
Zvi810

Based on how he engaged with me privately I am confident that he it not just a dude tryna make a buck.

(I am not saying he is not also trying to make a buck.)

Zvi102

I think it works, yes. Indeed I have a canary on my Substack About page to this effect.

2Ben Pace
I also have one on my LessWrong profile.
Zvi20

Yes this is quoting Neel.

Zvi20

Roughly this, yes. SV here means the startup ecosystem, Big Tech means large established (presumably public) companies.

2Radford Neal
From https://en.wikipedia.org/wiki/Santa_Clara%2C_California "Santa Clara is located in the center of Silicon Valley and is home to the headquarters of companies such as Intel, Advanced Micro Devices, and Nvidia." So I think you shouldn't try to convey the idea of "startup" with the metonym "Silicon Valley".  More generally, I'd guess that you don't really want to write for a tiny audience of people whose cultural references exactly match your own.
Zvi62

Here is my coverage of it. Given this is a 'day minus one' interview of someone in a different position, and given everything else we already know about OpenAI, I thought this went about as well as it could have. I don't want to see false confidence in that kind of spot, and the failure of OpenAI to have a plan for that scenario is not news.

Zvi30

It is better than nothing I suppose but if they are keeping the safeties and restrictions on then it will not teach you whether it is fine to open it up.

Zvi84

My guess is that different people do it differently, and I am super weird.

For me a lot of the trick is consciously asking if I am providing good incentives, and remembering to consider what the alternative world looks like. 

Zvi713

I don't see this response as harsh at all? I see it as engaging in detail with the substance, note the bill is highly thoughtful overall, with a bunch of explicit encouragement, defend a bunch of their specific choices, and I say I am very happy they offered this bill. It seems good and constructive to note where I think they are asking for too much? While noting that the right amount of 'any given person reacting thinks you went too far in some places' is definitely not zero.

Zvi50

Excellent. On the thresholds, got it, sad that I didn't realize this, and that others didn't either from what I saw.

I appreciate the 'long post is long' problem but I do think you need the warnings to be in all the places someone might see the 10^X numbers in isolation, if you don't want this to happen, and it probably happens anyway, on the grounds of 'yes that was technically not a proposal but of course it will be treated like one.' And there's some truth in that, and that you want to use examples that are what you would actually pick right now if you h... (read more)

Answer by Zvi60

Secrecy is the exception. Mostly no one cares about your startup idea or will remember your hazardous brainstorm, no one is going to cause you trouble, and so on, and honesty is almost always the best policy.  

That doesn't mean always tell everyone everything, but you need to know what you are worried about if you are letting this block you. 

On infohazards, I think people were far too worried for far too long. The actual dangerous idea turned out to be that AGI was a dangerous idea, not any specific thing. There are exceptions, but you need a ver... (read more)

1[anonymous]
As an aside, I would love if you have input on how I can get more people to engage with my post. I’ve cold emailed it to over a hundred people in EA/rationality but I suspect my framing is not good if most people ignored it. (I usually don’t cold email unless something is important, most of these people I have not written to in atleast a year.)
1[anonymous]
I agree with this if you do not have any significant power or influence. As of now I mostly worried about making wrong choices that will really hurt me (and the world) years from now in the best case scenario where I do have a lot of impact. For instance nick bostroms email which leaked years later. Or lots of moral and epistemically assumptions made by the original EA and Bay Area crowd that still causes people to sometimes make the world a more uncertain place. (I’m not saying uncertainty is bad but it’s very often the case imo that well intentioned people cause harm inside EA) I agree popularing the idea of AGI itself was dangerous, it’s possible if yudkowsky had kept quiet on the extropians mailing list and disappeared to the woods instead , deepmind and OpenAI would not exist today. My worry is this also applies to literally every other technology on the extropians mailing list or anything similar, be it sulfur geoengineering or nanotech or gene drives or engineering microbiomes or nonlethal incapacitation agents or atleast a 10 other things. I see designing good culture and institutions for all this as massively unsolved. Would you consider it breaking the law to train gpt4 on copyrighted text? (Or any of the N number of gpt4 startups also crawling the web right now) What about Satoshi starting cryptocurrency? What about gpt4 email spam? What about writing a doxxing tool? What about starting a dating app to get user databases to obtain favour among authoritarian governments? What about a better translate tool that actually permanently kills all language divides on Earth and alters geopolitics as a result? What about working on improving lie detection? What about distributing libertarian ideologies and ham radios and gunpowder in countries that currently do not allow this? What about starting a gene drive startup that massively hypes the upside and streamrolls safety people the way Sama did for AGI? Like, it is obvious to me if I wanted to become one of th
Answer by Zvi00

Secrecy is the exception. Mostly no one cares about your startup idea or will remember your hazardous brainstorm, no one is going to cause you trouble, and so on, and honesty is almost always the best policy.  

That doesn't mean always tell everyone everything, but you need to know what you are worried about if you are letting this block you. 

On infohazards, I think people were far too worried for far too long. The actual dangerous idea turned out to be that AGI was a dangerous idea, not any specific thing. There are exceptions, but you need a ver... (read more)

1[anonymous]
This is duplicate!
Zvi20

Sounds like your scale is stingier than mine is a lot of it. And it makes sense that the recommendations come apart at the extreme high end, especially for older films. The 'for the time' here is telling. 

Zvi20

On my scale, if I went 1 for 7 on finding 4.0+ films in a year, then yeah I'd find that a disappointing year. 

In other news, I tried out Scaruffi. I figured I'd watch the top pick. Number was Citizen Kane which I'd already watched (5.0 so that was a good sign), which was Repulsion. And... yeah, that was not a good selection method. Critics and I do NOT see eye to eye. 

I also scanned their ratings of various other films, which generally seemed reasonable for films I'd seen, although with a very clear 'look at me I am a movie critic' bias, including one towards older films. I don't know how to correct for that properly. 

2Ben Pace
It's plausible that you are closer to asking "how much would I recommend this to a copy of myself" and I am trying more to ask "how much would I recommend this to other people". I personally got a lot out of each of the 3 films I gave a 3.5 to and for a copy of myself they would be a 4. The Boy and the Heron was very upsetting to me in a way that was new and rich (the first half an unrelenting depiction of a boy ruined by the death of his mother, in such contrast to what Miyazaki had set me up to anticipate in his films of childhood wonder and joy), and Asteroid City and The Holdovers in their own very different ways helped me see a lot of magic in the world that I normally miss. In retrospect I quite enjoyed my year of films. Oops, sorry to hear Scaruffi's recs took you down a wrong alley. I also no longer go through Scaruffi's best-ever list straightforwardly, those films can get strange (and a bit boring) very quickly. If I hear about a film being good I'll often check his rating, and if I see that it's above like 7.2, I take it as a sign that something artistically genuine and ambitious occurred (for the time, at least) in the film. It was a strong sign to me this year that Poor Things was 7.4 (and best-of-year), and that was a very accurate sign in retrospect. In other news I watched Anatomy of a Fall this weekend. It's ultimately a very simple film (with a lot of complicated emotions) and I'm glad I watched it; I will leave at 3.5 though, even for a copy of myself. Sumner's review is basically right.
Zvi20

Real estate can definitely be a special case, because (1) you are also doing consumption, (2) it is non-recourse and you never get a margin call, which provides a lot of protection and (3) The USG is massively subsidizing you doing that...

Zvi1-1

There are lead times to a lot of these actions, costs to do so are often fixed, and no reason to expect the rules changes not to happen. I buy that it is efficient to do so early.

'Greed' I consider a non-sequitur here, the manager will profit maximize.

Zvi20

I'm curious how many films you saw - having only one above 3.5 on that scale seems highly disappointing. 

2Ben Pace
Is it? I don't expect new films to be that good compared to "the best films in history". The only other three I watched (from this year) were Napoleon, Oppenheimer, and Spiderverse. I also watched older films for the first time including Oldboy, My Neighbor Totoro, Wargames, and The Mirror, but I wasn't comparing those.  (I'd say My Neighbor Totoro is a 4, the others I'm not sure about; also I fell asleep during The Mirror, but in my defense I had covid at the time.) I'm thinking of watching Anatomy of a Fall and May December based on your recs and those you link to.
Zvi20

Argument from incredulity? 

Zvi20

Thanks for the notes!

As I understand that last point, you're saying that it's not a good point because it is false (hence my 'if it turns out to be true'). Weird that I've heard the claim from multiple places in these discussions. I assumed there was some sort of 'order matters in terms of pre-training vs. fine-tuning obviously, but there's a phase shift in what you're doing between them.' I also did wonder about the whole 'you can remove Llama-2's fine tuning in 100 steps' thing, since if that is true then presumably order must matter within fine tuning.

Anyone think there's any reason to think Pope isn't simply technically wrong here (including Pope)? 

Anyone think there's any reason to think Pope isn't simply technically wrong here (including Pope)? 

I agree with Pope here (and came up with the same argument independently).  FWIW:

First, I don't agree with Evan's position in the linked comment, that "Whenever you talk to Claude or ChatGPT and it responds as a helpful AI [...], the reason it's doing that is because data ordering matters."

Claude and ChatGPT are given their inputs in a particular format that matches how the finetuning data was formatted.  This is closely analogous to "2024" or... (read more)

5evhub
I'm not exactly sure what "it" is here. It is true that our results can be validly reinterpreted as being about data ordering. My claim is just that this reinterpretation is not that interesting, because all fine-tuning can be reinterpreted in the same way, and we have ample evidence from such fine-tuning that data ordering generally does matter quite a lot, so it not mattering in this case is quite significant.
Zvi20

Yep, whoops, fixing.

ZviΩ382

That seems rather loaded in the other direction. How about “The evidence suggests that if current ML systems were going to deceive us in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”? 

2ryan_greenblatt
Deceive kinda seems like the wrong term. Like when the AI is saying "I hate you" it isn't exactly deceiving us. We could replace "deceive" with "behave badly" yielding: "The evidence suggests that if current ML systems were going to behave badly in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.". I agree that using terms like "lying in wait", "treacherous plans", or "treachery" are a loaded (though it technically means almost the same thing). So I probably shouldn't have said this is a bit differently. I think the version of your statement with deceive replaced seems most accurate to me.
Zvi20

Did you see (https://thezvi.substack.com/p/balsa-update-and-general-thank-you)? That's the closest thing available at the moment.

This post was, in the end, largely a failed experiment. It did win a lesser prize, and in a sense that proved its point, and I had fun doing it, but I do not think it successfully changed minds, and I don't think it has lasting value, although someone gave it a +9 so it presumably worked for them. The core idea - that EA in particular wants 'criticism' but it wants it in narrow friendly ways and it discourages actual substantive challenges to its core stuff - does seem important. But also this is LW, not EA Forum. If I had to do it over again, I wouldn't bother writing this.

4Daniel
I am surprised to hear this, especially “I don't think it has lasting value”. In my opinion, this post has aged incredibly well. Reading it now, knowing that the EA criticism contest utterly failed to do one iota of good with regards to stopping the giant catastrophe on the horizon (FTX), and seeing that the top prizes were all given to long, well-formatted essays providing incremental suggestions on heavily trodden topics while the one guy vaguely gesturing at the actual problem (https://forum.effectivealtruism.org/posts/T85NxgeZTTZZpqBq2/the-effective-altruism-movement-is-not-above-conflicts-of) gets ignored, cements this as one of your more prophetic works.

I am flattered that someone nominated this but I don't know why. I still believe in the project, but this doesn't match at all what I'd look to in this kind of review? The vision has changed and narrowed substantially. So this is a historical artifact of sorts, I suppose, but I don't see why it would belong.

2Raemon
I think I'm personally interested in a more detailed self-review of how this project went. Like, Balsa Research seemed like your own operationalization of a particular hypothesis, which is "one partial solution to civilizational dysfunction is to accomplish clear wins that restore/improve faith that we can improve things on purpose." (or, something like that) I didn't nominate the post but having thought about it now, I'd be interested in an in-depth update on how you think about that, and what nuts-and-bolts of your thinking have evolved over the past couple years. 

I think this post did good work in its moment, but doesn't have that much lasting relevance and can't see why someone would revisit at this point. It shouldn't be going into any timeless best-of lists.

I continue to frequently refer back to my functional understanding of bounded distrust. I now try to link to 'How To Bounded DIstrust' instead because it's more compact, but this is I think the better full treatment for those who have the time. I'm sad this isn't seeing more support, presumably because it isn't centrally LW-focused enough? But to me this is a core rationalist skill not discussed enough, among its other features.

Zvi20

I do not monitor the EA Forum unless something triggers me to do so, which is rare, so I don't know which threads/issues this refers to. 

Zvi20

Yes, I mean the software (I am not going to bother fixing it)

Load More