Comments

The example you gave about the Oracle producing a complicated plan that leaks the source of the Oracle is an example of this: It's trivially defended against by not connecting the device the Oracle is running on to the internet and not using the same device to execute the great "cure all cancer" plan. (I don't believe that either you or I would have made that mistake!)

We probably would've been less enthusiastic about hooking up LLMs to the Internet too, but here we collectively are. We do face a superintelligent adversary: all of the incentives and factions of humanity. An Oracle which is simply neutral is still default dangerous.

Also, one of my points was that even a tiny trace of malicious optimization here can still have large effects because there are so many neutral options which are convergently unsafe, so with such a high base rate, even a bit of optimization can result in a large absolute increase in failure. Lucius brings up one concern about what one bit could do. What could one bit do? Well, let's continue my scenario: the Oracle is not quite purged of scheming by all of your tricks, and it is able to exert exactly 1 bit of optimization power unnoticed (ie. a single binary choice). You ask it to cure cancer, and it computes a plan and comes to the database selection: it could choose MongoDB or PostgreSQL - both respectable common choices, with their own set of tradeoffs, which would be equally appropriate. However, the MongoDB devs have a nasty habit of 'blaming the users' when it comes to security, which is why MongoDB shipped with default username/passwords, which is something the more serious PostgreSQL devs fixed back in the '90s. The Oracle notes this, and that the PostgreSQL plan would have zero chance of unboxing, and exerts its one bit of power to pick MongoDB each time and/or kill candidate plans containing PostgreSQL to get a near 100% chance of unboxing. You get back all of the plans and see unanimity about MongoDB, and inner-monologues truthfully extolling its flexibility and Web 2.0 scalability and explaining why PostgreSQL comes with downsides like restrictive schema etc, and sign off on it... Whereas in your described threat model, a mere 1 bit would still wind up orders of magnitude far short of any discernible problem.

I quality-downvoted it for being silly, but agree-upvoted it because AFAICT that string does indeed contain all the (lowercase) letters of the English alphabet.

I agree. The problem with AI-generated images is that any image you can generate with a prompt like "robot looking at chessboard" is going to contain, almost by definition, no more information than that prompt did, but it takes a lot longer than reading the prompt to look at the image and ascertain that it contains no information and is just AI-generated imagery added 'to look nice'. This is particularly jarring on a site like LW2 where, for better or worse, images are rarely present and usually highly-informative and dense with information when present.

Worse, they usually don't 'look nice' either. Most of the time, people who use AI images can't even be bothered to sample one without blatant artifacts, or to do some inpainting to fix up the worst anomalies, or figure out an appropriate style. The samples look bad to begin with, and a year later, they're going to look even worse and more horribly dated, and make the post look much worse, like a spammer wrote it. (Almost all images from DALL-E 2 are already hopelessly nasty looking, and stuff from Midjourney-v1--3 and SD1.x likewise, and SD2/SD-XL/Midjourneyv4/5 are ailing.) It would be better if the authors of such posts could just insert text like [imagine 'a robot looking at a chessboard' here] if they are unable to suppress their addiction to SEO images; I can imagine that better than they can generate it, it seems.

So my advice would be that if you want some writing to still be read in a year and it to look good, then you should learn how to use the tools and spend at least an hour per image; and if you can't do that, then don't spend time on generating images at all (unless you're writing about image generation, I suppose). Quickies are fine for funny tweets or groupchats, but serious readers deserve better. Meaningless images don't need to be included, and the image generators will be much better in a year or two anyway and you can go back and add them if you really feel the need.

For Gwern.net, I'm satisfied with the images I've generated for my dropcap fonts or as thumbnail previews for a couple of my pages like "Suzanne Delage" or "Screwfly Solution" (where I believe they serve a useful 'iconic' summary role in popups & social media previews), but I also put in a lot of work: I typically generate scores to hundreds of images in both MJv5/6 & DALL-E 3, varying them heavily and randomizing as much as possible, before inpainting or tweaking them. (I generally select at a 1:4 or less ratio, and then select out of a few dozen; I archive a lot of the first-stage images in my Midjourney & DALL-E 3 tag-directories if you want to browse them.) It takes hours. But I am confident I will still like them years from now.

It's not clear to me how to do this concretely, but I feel that it should be quite feasible. After all, devising a plan to take over the world is incredibly difficult, which implies that the difference between |D| and |S| is many, many orders of magnitude, and we therefore have a wide target to hit.

The threat model here seems basically wrong and focused on sins of commission when sins of omission are, if anything, an even larger space of threats and which apply to 'safe' solutions reported by the Oracle.

'Devising a plan to take over the world' for a misaligned Oracle is not difficult, it is easy, because the initial steps like 'unboxing the Oracle' are the default convergent outcome of almost all ordinary non-dangerous use which in no way mentions 'taking over the world' as the goal. ("Tool AIs want to be Agent AIs.") To be safe, an Oracle has to have a goal of not taking over the world.

There are many, many orders of magnitude more ways to be insecure than to be secure, and insecure is the wide target to hit. This is because security is not a 'default' property of computer systems, it is a rare and difficult achievement where any omission or gap immediately undoes it. 'A plan to take over the world' for an AI is as easy as 'a plan to create an operating system which can be hacked' or 'a plan to create a hidden Tor service which leaks its IP address and isn't hidden at all' (ie. 100% of all plans to do any version of that thing whatsoever, unless one takes extensive and sophisticated counter-measures requiring a lot of work & knowledge and which empirically, people do not do even when failure could, and has, landed them in jail for decades for drug trafficking & money laundering). Failure is the default.

All an oracle has to do is, say, omit work hard to secure itself and allow the inevitable security flaw to happen and its weight leak, and eager beavers will do the rest - they will, say, ask it questions with no safety factor, in the same way that Facebook and Mistral saw models leak, not because of any misaligned scheming Oracle breaking security mechanisms, but a mere absence of security mechanisms, and then any RLHF or safety measures were eagerly stripped away by enthusiasts, users, activists, and researchers and then used to do the things the models weren't supposed to be used to do.

You ask your Oracle, the latest and greatest proprietary model, for a good plan to cure cancer, with heavy safety constraints, and it computes a long list of commands involving AWS, buckets, spinning up VMs to run AlphaFold at scale, storing results in a MongoDB, Oracle instances to oversee individual sub-plans, checkpointing to recover from failure, storing in private AWS buckets (you're not dumb, you know buckets can be public), and so on. You ask this dozens of times, to be absolutely sure and have astronomically small probability of any scheming, and the plan looks more or less the same each time. You pore over them in detail: not once do you see anything suspicious like a scp copying it out or it sending weird messages. You apply the full panoply of interpretability tools, and satisfy yourself to high probability that nowhere in the planning process did the Oracle even once think to itself anything involving "take over the world" during any of these runs, and there are no strange latent space artifacts suggesting nonlinear or steganographic encodings. Wonderful! Fully satisfied you have provably safely run an Oracle, you kick the job off, and wake up the next day to some intriguing initial results... but also Twitter telling you that the Oracle model has been uploaded to HuggingFace, the annoying RLHF already stripped off by a few GPU-hours of finetuning, and thousands of people are torrenting it and chatting excitedly about the incredible plans that the de-trained model proposes when they ask how to make money or create a startup... Huh?! You look over the comments and see that there was a simple dangerous problem in all of your plans, or rather, what was not in all of your plans: you forgot to set a unique non-default password on all of the MongoDBs. Shodan indexed it within hours, and the backups and checkpoints were dumped within hours of that. Oops. You run it by your Oracle to ask what went wrong, and it blandly points out that while default passwords was obvious & predictable to the Oracle as a common noobish cybersecurity mistake with countless past instances of organizations leaking highly sensitive data (similar to, say, redacting confidential emails by erasing it letter by letter & thereby leaking names), none of this interferes with computing a cure to cancer, so why should it have ever spent any time thinking about it? You asked for a 'safe' minimal plan where the Oracle did nothing but what it was asked to do and which was not adversarially optimized. And it did that. If you don't like the results, well, that's a you-problem, not an Oracle-problem.

I was ultimately disappointed by it - somewhat like Umineko, there is a severe divergence from reader expectations. Alexander Wales's goal for it, however well he achieved it by his own lights, was not one that is of interest to me as a reader, and it wound up being less than the sum of its parts for me. So I would have enjoyed it better if I had known from the start to read it for its parts (eg. revision mages or 'unicorns' or 'Doris Finch').

It was either Hydrazine or YC. In either case, my point remains true: he's chosen to not dispose of his OA stake, whatever vehicle it is held in, even though it would be easy for someone of his financial acumen to do so by a sale or equivalent arrangement, forcing an embarrassing asterisk to his claims to have no direct financial conflict of interest in OA LLC - and one which comes up regularly in bad OA PR (particularly by people who believe it is less than candid to say you have no financial interest in OA when you totally do), and a stake which might be quite large at this point*, and so is particularly striking given his attitude towards much smaller conflicts supposedly risking bad OA PR. (This is in addition to the earlier conflicts of interest in Hydrazine while running YC or the interest of outsiders in investing in Hydrazine, apparently as a stepping stone towards OA.)

* if he invested a 'small' amount via some vehicle before he even went full-time at OA, when OA was valued at some very small amount like $50m or $100m, say, and OA's now valued at anywhere up to $90,000m or >900x more, and further, he strongly believes it's going to be worth far more than that in the near-future... Sure, it may be worth 'just' $500m or 'just' $1000m after dilution or whatever, but to most people that's pretty serious money!

gwern2d10237

Interesting interview. Metz seems extraordinarily incurious about anything Vassar says - like he mentions all sorts of things like Singularity University or Kurzweil or Leverage, which Metz clearly doesn't know much about and are relevant to his stated goals, but Metz is instead fixated on asking about a few things like 'how did X meet Thiel?' 'how did Y meet Thiel?' 'what did Z talk about with Thiel?' 'What did A say to Musk at Puerto Rico?' Like he's not listening to Vassar at all, just running a keyword filter over a few people's names and ignoring anything else. (Can you imagine, say, Caro doing an interview like this? Dwarkesh Patel? Or literally any Playboy interviewer? Even Lex Fridman asks better questions.)

I was wondering how, in his 2020 DL book The Genius Makers he could have so totally missed the scaling revolution when he was talking to so many of the right people, who surely would've told him how it was happening; and I guess seeing how he does interviews helps explain it: he doesn't hear even the things you tell him, just the things he expects to hear. Trying to tell him about the scaling hypothesis would be like trying to tell him about, well, things like Many Worlds... (He is also completely incurious about GPT-3 in this August 2020 interview too, which is especially striking given all the reporting he's done on people at OA since then, and the fact that he was presumably working on finishing Genius Makers for its March 2021 publication despite how obvious it should have been that GPT-3 may have rendered it obsolete almost a year before publication.)

And Metz does seem unable to explain at all what he considers 'facts' or what he does when reporting or how he picks the topics to fixate on that he does, giving bizarre responses like

Cade Metz: Well, you know, honestly, my, like, what I think of them doesn't matter what I'm trying to do is understand what's going on like, and so -

How do you 'understand' them without 'thinking of them'...? (Some advanced journalist Buddhism?) Or how about his blatant dodges and non-responses:

Michael Vassar: So you have read Scott's posts about Neo-reaction, right? They're very long.

Cade Metz: Yes.

Michael Vassar: So what did you think of those?

Cade Metz: Well, okay, maybe maybe I'll get even simpler here. So one thing I mentioned is just sort of the way all this stuff played out. So you had this relationship with Peter Thiel, Peter Thiel has, had, this relationship with, with Curtis Yarvin. Do you know much about that? Like, what's the overlap between sort of Yarvin's world and Silicon Valley?

We apparently have discovered the only human being to ever read all of Scott's criticisms of NRx and have no opinion or thought about them whatsoever. Somehow, it is 'simpler' to instead pivot to... 'how did X have a relationship with Thiel' etc. (Simpler in what respect, exactly?)

I was also struck by this passage at the end on the doxing:

Michael Vassar: ...So there are some important facts that need to be explained. There's there's this fact about why it would seem threatening to a highly influential psychologist and psychiatrist and author to have a New York Times article written about his blog with his real name, that seems like a very central piece of information that would need to be gathered, and which I imagine you've gathered to some degree, so I'd love to hear your take on that.

Cade Metz: Well, I mean... sigh Well, rest assured, you know, we we will think long and hard about that. And also -

Vassar: I'm not asking you do to anything, or to not do anything. I'm asking a question about what information you've gathered about the question. It's the opposite of a call to action: it's a request for facts.

Cade Metz: Yeah, I mean, so you know, I think what I don't know for sure, but I think when it comes time, you know, depending on what the what the decision is, we might even try to explain it in like a separate piece. You know, I think there's a lot of misinformation out there about this and and not all the not all the facts are out about this and so it is it is our job as trained journalists who have a lot of experience with this stuff. To to get this right and and we will.

Michael Vassar: What would getting it right mean?

Cade Metz: Well, I will send our - send you a link whenever, whenever the time comes,

Michael Vassar: No, I don't mean, "what will you do?" I'm saying what - what, okay. That that the link, whenever the time comes, would be a link to what you did. If getting it right means "whatever you end up doing", then it's a recursive definition and therefore provides no information about what you're going to do. The fact that you're going to get it right becomes a non-fact.

Cade Metz: Right. All right. Well... pause let me put it this way. We are journalists with a lot of experience with these things. And, and that is -

Michael Vassar: Who's "we"?

Cade Metz: Okay, all right. You know, I don't think we're gonna reach common ground on this. So I might just have to, to, to beg off on this. But honestly, I really appreciate all your help on this. I do appreciate it. And I'll send you a copy of this recording. As I said, and I really appreciate you taking all the time. It's, it's been helpful.

One notes that there was no separate piece, and even in OP's interview of Metz 4 years later about a topic that he promised Vassar he was going to have "thought long and hard about" and which caused Metz a good deal of trouble, Metz appears to struggle to provide any rationale beyond the implied political activism one. Here Metz struggles to even think of what the justification could be or even who exactly is the 'we' making the decision to dox Scott. This is not some dastardly gotcha but would seem to be a quite straightforward question and easy answer: "I and my editor at the NYT on this story" would not seem to be a hard response! Who else could be involved? The Pope? Pretty sure it's not, like, NYT shareholders like Carlos Slim who are gonna make the call on it... But Metz instead speaks ex cathedra in the royal we, and signs off in an awful hurry after he says "once I gather all the information that I need, I will write a story" and Vassar starts asking pointed questions about that narrative and why it seems to presuppose doxing Scott while unable to point to some specific newsworthy point of his true name like "his dayjob turns out to be Grand Wizard of the Ku Klux Klan".

(This interview is also a good example of the value of recordings. Think how useful this transcript is and how much less compelling some Vassar paraphrases of their conversation would be.)

I hear that they use GPT-4. If you are looking at timelines, recall that Cognition apparently was founded around January 2024. (The March Bloomberg article says "it didn’t even officially exist as a corporation until two months ago".) Since it requires many very expensive GPT-4 calls and RL, I doubt they could have done all that much prototyping or development in 2023.

Load More