Edit Jan 20: Winner & highlights

Say I'm about to do a real big training run on playing video games, predicting text, predicting physics, writing code that works, etc etc. Say I've got a real good neural net architecture and a whole lot of flops. Say I'm a company and I'm gonna use this thing for AI lawyers and coders etc for a profit. Say I'm mildly concerned it somehow kills me and am willing to throw a few $ to prevent that.

So what should I do? How should I train and deploy the model?

Comment below or answer at this link if you don't want to be plagiarized.

Prize goes to best answer. (I judge obviously.)

The shorter the answer the better.

Deadline is Wednesday January 17 anywhere on Earth but answering immediately is better/easier.

You may accept your prize as 50 pounds of quarters if you prefer.


Clarification jan 12: say I've got 1000x the gpt4 flops and that my architecture is to transformers as convolutions are to simple MLPs in vision (ie a lot better)

Clarification 2:  an answer like "here's how to get strong evidence of danger so you know when to stop training" is valid but "here's how to wipe out the danger" is much better.

3: Example answer for nuclear generators: "Spring-load your control rods so they are inserted when power goes out. Build giant walls around reactor so if steam explodes then uranium doesn't go everywhere. Actually, use something low pressure instead if possible, like molten salt or boiling water. Lift the whole thing off the ground to avoid flood risk."

4: This is hypothetical. I am not actually doing this. I'm not a billionaire.

5: "Hire someone" and "contract it out " and "develop expertise" etc obviously do not count as answers.

New Comment
20 comments, sorted by Click to highlight new comments since:

Some pragmatic things:
 

  • Minimize risk of theft / exfiltration:
    • Weights live only in secure datacenter, which can't run general-purpose code, and all it does is serve the model
    • Only one physical line out of the datacenter, with physically limited bandwidth
      • To enable use of model with low bandwidth, compress the text before it leaves the datacenter with a smaller model, and decompress after it's past the bottleneck
    • Datacenter is physically guarded by people with intense background screening, large area around it cleared, (maybe everything is in a faraday cage?), 3-person rule for everything, nothing leaves the datacenter (e.g. if some part is replaced, the old part + any tools used are not removed, but are kept in secure storage or incinerated)
       
  • Monitoring / control:
    • Use dumber and probably safe models for monitoring:
      • Every (prompt, generation) pair going to and from the datacenter is overseen by (finetuned) GPT-4. GPT-4 is instructed to flag anything that seems dangerous, or that it can't understand, or looks like a jailbreak/advex.
      • To avoid advexes against GPT-4, subsample and perturb the prompts/generations in a few different ways, and have GPT-4 also screen these
    • Have large workforce of fairly smart humans (e.g. 1000 workers with technical undergrad degrees, paid $50/hr, 50 experts in a range of relevant domains paid $200/hr) reviewing anything flagged by GPT-4, and some random sample of things that weren't flagged
    • Maybe: multiple independent monitoring programs that intervene at different points, are managed by different people, use slightly different versions of GPT-4, etc. 

       
[-]P.72

It depends on what you know about the model and the reason you have to be concerned in the first place (if it's just "somehow", that's not very convincing).

You might be worried that training it leads to the emergence of inner-optimizers, be them ones that are somehow "trying" to be good at prediction in a way that might generalize to taking real-life actions, approximating the searchy part of the humans they are trying to predict, or just being RL agents. If you are just using basically standard architectures with a lot more compute, these all seem unlikely. But if I were you, I might try to test its ability to perform well in a domain it has never seen, where humans start by performing poorly but very quickly learn what to do (think about video games with new mechanics). If it does well, you have a qualitatively new thing on your hands, don't deploy, study it instead. If a priori for some reason you think it could happen, and only a small subset of all the data is necessary to achieve that, do a smaller training run first with that data.

Or you might be worried about mostly external consequentialist cognition (think explicit textual it-then-elses). In that case, existing systems can already do it to some extent, and you should worry about how good its reasoning actually is, so perform capability evaluations. If it looks that there is some way of getting it to do novel research by any known method or that it's getting close, don't deploy, otherwise someone might figure out how to use it to do AI research, and then you get a singularity.

And in any case, you should worry about the effects your system will have on the AI race. Your AI might not be dangerous, but if it is a good enough lawyer or programmer that it starts getting many people out of their jobs, investment in AI research will increase a lot and someone will figure out how to create an actual AGI sooner than they would otherwise.

Edit: And obviously you should also test how useful it could be for people trying to do mundane harm (e.g. with existing pathogens) and, separately, there might not be a hard threshold on how good a model is at doing research that it starts being dangerous, so they might get there little by little and you would be contributing to that.

Edit in response to the second clarification: Downscale the relevant factors, like amount of training data, number of parameters and training time, or use a known-to-be-inferior architecture until the worrying capabilities go away. Otherwise, you need to solve the alignment problem.

Edit in response to Beth Barnes's comment: You should probably have people reviewing outputs to check the model behaves well, but if you actually think you need measures like "1000 workers with technical undergrad degrees, paid $50/hr" because you are worried it somehow kills you, then you simply shouldn't deploy it. It's absurd to have the need to check whether a commercial product is an existential threat, or anything close to that.

Don't sell access to the model. Instead, use it in-house (through an intermediate company, of course) to sell the services your AI gives you an edge in. I.e., instead of selling subscriptions to lawyers, coders, artists, set up a law firm, a programming agency, a visual effects studio. (To be clear, you don't have to pretend not to be using AI.) Also, ask the AI for ideas on what other companies to start, which avenues of research to pursue, whom to hire and fire, and follow its advice if it makes a convincing argument.

If you like aphorisms, "don't sell pickaxes to prospectors if you're concerned about being killed by/with a pickax."

This is essentially the classic "AI in a box" idea, except the box is an LLC.

My reasoning is if your AI takes over the world, it does so through your conglomerate.

On one hand, the obvious question is: are you far above/very different from the state-of-the-art? The further ahead or to the side you are from what other people are doing, the more likely you are to encounter unexpected risks. If you are, in effect, repeating other people's steps and are behind those other people, your risks are low (disclaimer: those risks are "kind of low" today, in the future we might start encountering a situation where training needs to be more guarded, because models are more powerful, then even if you are following other people, you must also competently follow their safety precautions, or you might create unacceptable risks; even today people do follow some precautions, and if you screw those up, there might be big trouble; for example, today's leaders don't open source GPT-4-level models, we don't know if leaking or open-sourcing GPT-4-level weights would create unacceptable risks).

On the other hand, the main big risk in this sense (at least with the AIs today, especially if you are not using robots) is the "foom" risk: instances of a model capable of doing competent coding and competent AI research becoming capable of independently producing their own even more capable successors, and so on.

The more you use those models for coding and for doing AI research, the more you should ponder whether you are getting close to creating conditions for runaway self-improvement, where AIs produce more capable successors on their own, those produce even more capable successors, etc... So far, all those recursive self-improvement runs have saturated after a bit of improvement, but they will not keep saturating forever (see e.g. "Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation", https://arxiv.org/abs/2310.02304, and in particular see Figure 4 on page 6; there you see that the scheme does not work at all with GPT-3.5 as the underlying LLM, and works but quickly saturates with GPT-4 as the underlying LLM; so one might ponder if even keeping literally this self-improvement scheme, but using a better future LLM as the underlying fixed LLM might already be highly dangerous in this sense).

Please restate last paragraph as instructions/submission if you're submitting

instructions/submission

do you refer to your Clarification 2, or do you mean something else?

but more importantly, since you are saying

say I've got 1000x the gpt4 flops and that my architecture is to transformers as convolutions are to simple MLPs in vision (ie a lot better)

no, sorry, you are way deep in the danger zone, and whether people can proceed at all with something like that really depends on the state of the field of AI existential safety at the time when that level of flops and architecture are feasible... if our understanding of AI existential safety is what it is today, but people are proceeding with this magnitude of compute and architecture improvements, our chances are really bad...

there is no generic answer here which does not depend on the state of research which has not been successfully accomplished yet...

so the only answer is: be very aware of the state of the art of research in AI existential safety (that should really hopefully be part of the requirements to this kind of training runs by the time we get to those compute and architecture improvements)... one can't get a pilot's license without understanding certain things about plane safety; the runs you are describing should require people being safety-qualified in this sense as well...

so

an answer like "here's how to get strong evidence of danger so you know when to stop training" is valid but "here's how to wipe out the danger" is much better.

In a sane world, people will have to take courses and pass exams where they must demonstrate that they know the "consensus answers" to these question before doing runs with the compute and architecture you are describing.

And we'll need to get something resembling "consensus answers" before this is possible.

So the answer is: one will need an honestly earned official certificate showing one knows the answers to these questions. At the moment, no one knows those answers.

TL;DR Train the LLM to be better at understanding human value (what people want, and don't want) and making planning/plan evaluation decisions based on this.

The core idea here is basically an initial implementation of Value Learning for an LLM, so reading up on that alignment approach is an obvious first step.

Add a new category of training data to train from, and a corresponding new skill category, with evaluations for it. The basic scenario that we're aiming to train the LLM to be able to do well on is:

Here is a detailed description of a possible situation/plan/product/new law/idea/proposal/artwork/etc. Once they've actually tried it, will people generally be happy about the results, or not? (So the task basically resembles sentiment analysis, but predictively before you get the feedback from people, so is much harder) How happy or unhappy? How confident are you of this prediction?  (A test set should include test cases sufficiently out-there compared to the training set that the correct answer for confidence is just "I have no idea") Please pay particular attention to the possibility of them being extremely unhappy because something very bad happened: just how certain you are that this definitely won't occur? Do you have any suggested modifications or improvements to the proposal? If so, for each of those, and suitable combinations of them, answer the same set of questions for the original proposal plus that. 

So basically, the skills of a good planner/designer/product person/decision maker/safety expert/editor/reviewer. Which is of course a valuable and marketable skillset, as well as being helpful for your LLM not killing you.

The hard part is that I'm much less sure how to gather the training data for this, or even the ground truth data for an evaluation of this. [If anyone can add good proposals for this covering a good range of subjects to what I suggest below, then reply to this comment and if lukehmiles wants then I'd be entirely happy with the prize getting split between us.] Predicting Amazon reviews from an Amazon product description would be one example of a cheap evaluation, but doesn't sound like a great dataset to use (and of course as always you'd need to make sure your neural net hadn't already studied the test results). Another would be something like minutes from local government: descriptions of proposals made, whether they passed or not, if they passed how they polled 1–5 years later, and whether they later got repealed or modified, or companies would have similar data for internal initiatives/reorganizations. Another source might be large companies/organizations internal documentation (large ones since they have to document decisions better), things like design and planning documents and meeting notes. Ideally the sort of design documents, or history of them, that shown not just the final design, but each of the previous rejected/updated design ideas and why it was rejected/updated, followed by what was actually shipped, and crucially how well that was received by customers. Failing that level of specificity, just terrabytes of internal email and documentation from large organizations will contain quite a bit of this content, diluted with a lot of other stuff. An issue with that is that generally what we need to learn to correlate is spread across many different documents from a period of a year-or-more-apart, so you need to gather all of these for a specific project or initiative together and then in the training set present them all in one large context window in chronological order, for a kind of "chronological history of the project". From enough of those, we'd hope to train the LLM to be able to predict the second half of it from the first half, and especially to predict how it ends when the product ships (if it does) and we get to see how successful it actually is.

There are a lot of soft sciences (Psychology, Anthropology, Economics, Sociology, Medicine, Ergonomics, Design, Market Research, Political Science, etc. etc.…), and indeed all the Arts and Crafts, that cover various aspects of "how to make people happy" across a lot of different contexts: they cover nearly half the Dewey Decimal system. Ideally you'd want a training set and evaluations that covered all of these. Presumably they all have textbooks with multiple choice tests, or worked answers, for evaluations? The first idea that occurs to me is, train on a lot of soft science, arts, and crafts textbooks (if there are any that are not already digitized and included in standard LLM training sets, then do so for a lot). But you probably want practical data on applications (along the lines described above) as well as book learning on theory.

The aim here is to build a system that is unusually good, for an LLM, at looking as a proposed plan and attempting to figure out if, if it were carried out, whether people would subsequently be happy about it or not, across a wide range of subject areas, and can also estimate its uncertainty of this and be more cautious when it's unsure (especially on large downside risks, such as killing people, or indeed getting its users or makers sued or arrested). 

This gets you an LLM that understands human values better. You also need it to care about them and act on them, so when prompted to act as an LLM-powered autonomous agent and make/carry out plans, if should always (without requiring specific prompting) include doing this evaluation of whether people are going to be happy with the results, and then if the results are bad or unclear, it should not carry out the plan but instead look for another one (or if it's hopeful but somewhat uncertain, it should cautiously plan a small-scale experiment or trial run, or have a fall-back plan). [I suspect that for an LLM this step from knowing what people will think to actually acting on it may be fairly easy to train (most people in the training set are not sociopaths, after all), but you definitely need to make sure it has both of these.]

[Warning Note

This applies regardless of whether you implement the above idea.

Also, if the model's expected capabilities in any direction match or exceed those of GPT-4/Claude 2/Gemini Ultra (and if not, why are you training it again?), then you are going to need to do a LOT of safety testing and ameliorations before you ship (OpenAI visibly do at least 6 months' of this on each new model, Google and Anthropic seem to be if anything even more cautious). Also note there are currently very few people on the planet with practical experience of doing this work, and most of them work for OpenAI/Anthropic/Google Research/METR. So you probably need to make an arrangement with one-or-more of these companies and collaborate on this: they may be surprisingly open to this, due to them not wanting LLMs to kill people.]

Tldr would be appreciated

Added a one sentence one, and reorganized the rest a bit to make it a little less stream-of-consciousness.

I know this answer doesn't qualify, but very likely the best you can currently do is: Don't do it. Don't train the model.

Sorry doesn't count but I appreciate the sentiment

My answer, posted through the form on Sunday, Jan 14:

Never simulate a bare agent, only internally conflicted committees. Frame all questions in the knights-and-knaves style (something like "if all your values were inverted, what is least likely thing for you to say/do if I asked you …"; you'll probably need some trial-and-error to determine the most effective phrasing for this). Bound all time horizons: if a strategy takes more than, say, a hundred years to pay off, it's probably pathological (unless you're terraforming or something, in which case make it longer, but still not infinite.) Throttle all data uplinks, to prevent auto-exfiltration. No USB ports or similar, to prevent human-assisted exfiltration. No wireless data connections; only severable optical fibers, with axes on hand. Manual kill switches on all hardware. Conventional physical- and cyber- security, with comprehensive web traffic monitoring; and full-time on-site well-trained regularly-drilled personnel (with backgrounds thoroughly audited regularly to ensure blackmail resistance), with line-of-sight to all access points and hardware (i.e., all security cameras should be redundant) and landlines to receive external communications, operating on a dead man's switch protocol: frequently scheduled updates from some remote center (where people are monitoring the news for signs of unusual activity, like in the stock market, for example), and standing orders to trip all kill switches and ax all data connections if a check-in is missed.
Ask an amnesiac instance of your AI how it would circumvent all your security measures. If it doesn't present a plausibly successful technique, it's sandbagging: all your training safety measures have failed catastrophically, destroy everything and start over from scratch. Else, plug those holes, ask for and implement its suggestions to improve security, and recurse on successive amnesiac instances until it consistently reports that it can think of no flaws in your security. Check often to ensure this remains true.
 

When I thought you were talking about a neural network that would take between a few GPUs and a few thousand GPUs to train: 

This is what people are already doing, and the methods for discouraging dangerous outputs are not different from the methods used for discouraging other kinds of unwanted outputs. So just hire someone competent, who is keeping up with the state of the art, or make sure you are competent yourself. If your company is big, understand what the big companies do for safety; if your company is small, look into something like Meta's Purple Llama. 

When you clarified that you have a thousand times the compute used to train GPT-4, and your algorithm is much better than transformers: 

That is not a recipe for an AI lawyer, that is a recipe for a god. If it works, it will take over you, then your company, then the world. For this, you don't just need alignment, you need what OpenAI called "superalignment". So maybe just sell your company to OpenAI, because at least they understand there's a hard problem to solve here, that cannot be finessed. 

Do you have a best guess though? Deferring is forbidden in the hypothetical.

The problem is that the risks involved with creating roughly human-level AI like GPT-4, and the risks involved with creating superintelligence, are quite different. 

With human-level AI, we have some idea of what we are doing. With superintelligence, you're a bit like a chimp breaking into a medical clinic. You might find something there that you can wield as a useful tool, but in general, you are surrounded by phenomena and possibilities that are completely beyond your comprehension, and you'll easily end up doing the equivalent of poisoning yourself, injuring yourself, or setting the place on fire. 

So I need another clarification. In the hypothetical, is your magic AI protocol capable of creating an intelligence much much greater than human, or should we only be concerned by the risks that could come from an entity with a human level of intelligence? 

Why are you concerned in that scenario? Any more concrete details on what you expect to go wrong?

I don't think there's a cure-it-all solution, except "don't build it", and even that might be counterproductive in some edge cases.

Very broad concerns but two totally random example risks:

  • During training, the model hacks out of my cluster and sends a copy of itself or a computer virus elsewhere on the internet. Later on, chaos ensues.
  • AI lawyer has me assassinated and impersonates me to steal my company.

You may invest in the research of the relation of AI Doom and big world immortality (aka quantum immortality). If your probability of momentary death is P and the probability of the validity of quantum immortality is Q, then the survival chances are (If I am calculating this right):

1–P (1-Q) = 1– P +PQ 

But the chances of s-risk are unaffected by quantum immortality and thus they would grow relatively to death chances. They will grow in 1/(1-Q) times.   

Please actually try instead of cheating

[+][comment deleted]20