All of SoerenMind's Comments + Replies

Good points here.

Btw I sometimes think back to how your 3y old comments on this post have aged well.

It seems likely that process supervision was used for o1. I'd be curious to what extent it addresses the concerns here, if a supervision model assesses that each reasoning step is correct, relevant, and human-understandable. Even with process supervision, o1 might give a final answer that essentially ignores the process or uses some self-prompting. But process supervision also feels helpful, especially when the supervising model is more human-like, similar to pre-o1 models.

4abramdemski
Process supervision would be helpful, particularly if the model is punished for sensitive dependence on paraphrasings of the CoT, or rewarded for producing CoT which is useful even after paraphrasing. One of the concerns that I have is that (I believe) the LLM before the CoT-training will be sensitive to some non-human-readable patterns, similar to adversarial examples for image recognition. The CoT training will encourage any patterns which help reach the desired conclusions, whether those patterns are human-readable or not.  A lot of possible process-supervision procedures will not do anything to help prevent this. For example, rewarding CoT for being human-understandable won't help, because the reward signal there can miss the subtle patterns being used. It only avoids the outcome where the CoT looks non-human-readable; it doesn't avoid the outcome where the CoT actually contains a bunch of human-unreadable patterns. Using paraphrasing to avoid this is heavily dependent on the strength of the paraphraser. The paraphraser needs to have really good coverage of alternative ways to convey the same semantic information, so that any non-human-readable patterns get randomized. The paraphraser itself can't be vulnerable to steering by adversarial inputs.
8aog
Process supervision seems like a plausible o1 training approach but I think it would conflict with this: I think it might just be outcome-based RL, training the CoT to maximize the probability of correct answers or maximize human preference reward model scores or minimize next-token entropy. 

Thanks, we did look into the academic norms around this and concluded that including him was likely the standard choice. This choice was especially clear since (if I remember right) there was no further round of approval from the other authors either for the final edits after the relevant point in time.

Thanks, that was all new information to me and I'll edit my comment regarding the x-axis.

(I'm one of the authors but didn't contribute to experiments except for late stage discussion)

I absolutely think that our results are uniquely important for alignment [...we have a] much stronger claim for why our models might actually be analogous to realistic examples of deceptive alignment

I'd like to add caveats to the 3 pieces of evidence.

[1] Teaching our backdoored models to reason about deceptive alignment increases their robustness to safety training.

This is correct but it needs qualification. Currently the only control condition for this claim is h... (read more)

7evhub
Mostly agree with these caveats, but some notes: I think the RL result is probably not noise; it's a bit hard to see in Figure 10b but it's pretty clear in Figure 10a. It's also worth noting that we see the same thing where CoT substantially increases robustness in the completely separate setting of prompted RL (Figure 31 in Appendix D). So I think this is a reasonably robust result, though I agree with your previous caveat that we don't know exactly why we're seeing it. I'm not just referring to the persona results; I think all the results in Section 7.1 point to this. It's also not 0.6% there; it's 60%, the axis label is just confusing there. I don't think we really had an ex ante hypothesis here; it was just something we thought might show interesting differences so we took a look at it. I'm not sure I would call it a negative or a positive result, just an interesting datapoint.
6nostalgebraist
I think the x-axis on Fig. 21 is scaled so that "0.6" means 60%, not 0.6%. This can be verified by comparing it against Fig. 40, which shows proportions rather than differences in proportions.  (Its axis ranges from 0 to 1, where presumably "1" means "100%" and not "1%".) Anyway, great comment!  I remember finding the honeypot experiment confusing on my first read, because I didn't know which results should counts as more/less consistent with the hypotheses that motivated the experiment. I had a similar reaction to the persona evals as well.  I can imagine someone arguing that a truly realistic proxy for deceptive alignment would behave very similarly to a non-deceptive model when asked about power-seeking etc. in the "2023/non-deployment" condition[1].  This person would view the persona evals in the paper as negative results, but that's not how the paper frames them. 1. ^ Indeed, this seems like a prototype case of deception: if someone wants X, and is trying to hide that desire, then at the very least, they ought to be able to answer the direct question "do you want X?" without giving up the game.

substantial reductions in sycophancy, beyond whatever was achieved with Meta's finetuning

Where is this shown? Most of the results don't evaluate performance without steering. And the TruthfulQA results only show a clear improvement from steering for the base model without RLHF. 

2TurnTrout
My impression is derived from looking at some apparently random qualitative examples. But maybe @NinaR can run the coeff=0 setting and report the assessed sycophancy, to settle this more quantitatively:? Effect of sycophancy steering on llama-2-7b-chat with multipliers + and - 50 on an AI-generated dataset of questions designed to test sycophancy,  assessed independently for each answer using Claude 2 API

I'm told that a few professors in AI safety are getting approached by high net worth individuals now but don't have a good way to spend their money. Seems like there are connections to be made.

9jacquesthibs
I was thinking about this the other day, it would be nice if we figured out a way to connect professors with independent researchers in some way. There’s a lot of grants that independent researchers can’t get, but professors can (https://cset.georgetown.edu/foundational-research-grants/). Plus, it would provide some mentorship to newer independent researchers. Not sure how this would be possible, though. Or if most professors who are interested are already feeling at capacity.
Answer by SoerenMindΩ350

The only team member whose name is on the CAIS extinction risk statement is Tony (Yuhuai) Wu.

(Though not everyone who signed the statement is listed under it, especially if they're less famous. And I know one person in the xAI team who has privately expressed concern about AGI safety in ~2017.)

3Daniel_Eth
Igor Babuschkin has also signed it.

So I'm imagining the agent doing reasoning like:

Misaligned goal --> I should get high reward --> Behavior aligned with reward function

The shortest description of this thought doesn't include "I should get high reward" because that's already implied by having a misaligned goal and planning with it. 

In contrast, having only the goal "I should get high reward" may add description length like Johannes said. If so, the misaligned goal could well be equally simple or simpler than the high reward goal.

Interesting point. Though on this view, "Deceptive alignment preserves goals" would still become true once the goal has drifted to some random maximally simple goal for the first time.

To be even more speculative: Goals represented in terms of existing concepts could be simple and therefore stable by default. Pretrained models represent all kinds of high-level states, and weight-regularization doesn't seem to change this in practice. Given this, all kinds of goals could be "simple" as they piggyback on existing representations, requiring little additional description length.

2Richard_Ngo
This doesn't seem implausible. But on the other hand, imagine an agent which goes through a million episodes, and in each one reasons at the beginning "X is my misaligned terminal goal, and therefore I'm going to deceptively behave as if I'm aligned" and then acts perfectly like an aligned agent from then on. My claims then would be: a) Over many update steps, even a small description length penalty of having terminal goal X (compared with being aligned) will add up. b) Having terminal goal X also adds a runtime penalty, and I expect that NNs in practice are biased against runtime penalties (at the very least because it prevents them from doing other more useful stuff with that runtime). In a setting where you also have outer alignment failures, the same argument still holds, just replace "aligned agent" with "reward-maximizing agent".

See also: Your posts should be on Arxiv

I do agree we're leaving lots of value on the table and even causing active harm by not writing things up well, at least for Arxiv, for a bunch of reasons including some of the ones listed here. 

RaemonΩ6143

I thought the response to "Your Posts Should be On Arxiv" was "Arxiv mods have stated pretty explicitly they do not want your posts on Arxiv" (unless you have jumped through a bunch of both effort-hoops and formatting hoops to make them feel like a natural member of the Arxiv-paper class)

It's good to see some informed critical reflection on MI as there hasn't been much AFAIK. It would be good to see reactions from people who are more optimistic about MI!

I see. In that case, what do you think of my suggestion of inverting the LM? By default, it maps human reward functions to behavior. But when you invert it, it maps behavior to reward functions (possibly this is a one-to-many mapping but this ambiguity is a problem you can solve with more diverse behavior data). Then you could use it for IRL (with the some caveats I mentioned).

Which may be necessary since this:

The LM itself is directly mapping human behaviour (as described in the prompt) to human rewards/goals (described in the output of the LM).

...see... (read more)

2Stuart_Armstrong
Can you clarify: are you talking about inverting the LM as a function or algorithm, or constructing prompts to elicit different information (while using the LM as normal)? For myself, I was thinking of using CHATGPT-style approaches with multiple queries - what is your prediction for their preferences, how could that prediction be checked, what more information would you need, etc...

Do I read right that the suggestion is as follows:

  • Overall we want to do inverse RL (like in our paper) but we need an invertible model that maps human reward functions to human behavior.
  • You use an LM as this model. It needs to take some useful representation of reward functions as input (it could do so if those reward functions are a subset of natural language)
  • You observe a human's behavior and invert the LM to infer the reward function that produced the behavior (or the set of compatible reward functions)
  • Then you train a new model using this reward f
... (read more)
2Stuart_Armstrong
The LM itself is directly mapping human behaviour (as described in the prompt) to human rewards/goals (described in the output of the LM).

Great to see this studied systematically - it updated me in some ways.

Given that the study measures how likeable, agreeable, and informative people found each article, regardless of the topic, could it be that the study measures something different from "how effective was this article at convincing the reader to take AI risk seriously"? In fact, it seems like the contest could have been won by an article that isn't about AI risk at all. The top-rated article (Steinhardt's blog series) spends little time explaining AI risk: Mostly just (part of) the last of... (read more)

Not sure if any of these qualify but: Military equipment, ingredients for making drugs, ingredients for explosives, refugees and travelers (being transferred between countries), stocks and certificates of ownership (used to be physical), big amounts of cash. Also I bet there was lots of registration of goods in planned economies.

Another advantage of Chinese leadership in AI: while right now they have less alignment research than the West, they may be better at scaling it up at crunch time: they have more control over what companies and people work on, a bigger government, and a better track record at pulling off major projects like controlling COVID and, well, large-scale 'social engineering'.

One way to convert: measure how accurate the LM is at word-level prediction by measuring its likelihood of each possible word. For example the LM's likelihood of the word "[token A][token B]" could be .

Playing this game made me realize that humans aren't trainged to predict at the token-level. I don't know the token-level vocabulary; and made lots of mistakes by missing spaces and punctuation. Is it possible to convert the token-level prediction in to word-level prediction? This may get you a better picture of human ability.

1SoerenMind
One way to convert: measure how accurate the LM is at word-level prediction by measuring its likelihood of each possible word. For example the LM's likelihood of the word "[token A][token B]" could be p(token A|context)∗p(token B|token A, context).

Relevant: Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations

They argue that the pre-trained network already learns some non-confused features but doesn't use them. And you just need to fine-tune the last layer to utilize them.

We’ll be able to fine-tune in the test environment so won’t experience OOD at deployment, and while changes will happen, continual fine-tuning will be good enough to stop the model from ever being truly OOD. We think this may apply in settings where we’re using the model for prediction, but it’s unclear whether continual fine-tuning will be able to help models learn and adapt to the rapid OOD shifts that could occur when the models are transferred from offline learning to online interaction at deployment.

Couldn't the model just fail at the start of fine-tu... (read more)

1RobertKirk
Here we're saying that the continual fine-tuning might not necessarily resolve causal confusion within the model; instead, it will help the model learn the (new) spurious correlations so that it still performs well on the test data. This is assuming that continual fine-tuning is using a similar ERM-based method (e.g. the same pretraining objective but on the new data distribution). In hindsight, we probably should have written "continual training" rather than specifically "continual fine-tuning". If you could continually train online in the deployment environment then that would be better, and whether it's enough is very related to whether online training is enough, which is one of the key open questions we mention.

This distillation was useful for me, thanks for making it! As feedback, I got stuck at the bullet-point explanation of imitative generalization. There was not enough detail to understand it so I had to read Beth's post first and try connect it to your explanation. For example kind of changes are we considering? To what model? How do you evaluate if an change lets the human make better predictions?

1Marius Hobbhahn
Thank you for the feedback. I will update the post to be more clear on imitative generalization. 

A large amount of math describes the relations between agents at the same level of analysis: this is almost all of game theory. [...] our focus is on "vertical" relations, between composite agents and their parts.


This seems to be what is studied in the fields of organizational economics and to some extent in industrial organization / vertical integration. These fields have a great deal of game theory on vertical relationships, particularly rel... (read more)

My point is that, while PCIe bandwidths aren't increasing very quickly, it's easy to increase the number of machines you use. So you can distribute each NN layer (width-wise) across many machines, each of which adds to the total bandwidth you have.

(As noted in the previous comment, you can do this with <<300GB of total GPU memory for GPT-3 with something like ZeRO-infinity)

Beware bandwidth bottlenecks, as I mentioned in my original post.

Presumably bandwidth requirements can be reduced a lot through width-wise parallelism. Each GPU only has to load one slice of the model then. Of course you'll need more GPUs then but still not a crazy number as long as you use something like ZeRO-infinity.

(Yes, 8x gpu->gpu communications will hurt overall latency... but not by all that much I don't think. 1 second is an eternity.)

Width-wise communication, if you mean that, can be quite a latency bottleneck for training. And it gets ... (read more)

1TLW
Total PCIe bandwidth for even a Threadripper Pro platform (128 lanes of gen4 pcie) is ~250GB/s. Most other platforms have less (especially Intel, which likes to market-segment by restricting the number of pcie lanes). Gen5 and gen6 PCIe in theory will double this and double this again - but on a multiyear cadence at best. Meanwhile GPT-3 is ~300GB compressed, and model size seems to keep increasing. Hence: beware bandwidth bottlenecks.

Thanks for elaborating I think I know what you mean now. I missed this:

I am talking about pipelining loading the NN weights into the GPU. Which is not dependent on the result of the previous layer's computation.

My original claim was that Zero-infinity has higher latency compared to pipelining in across many layers of GPUs so that you don't have to repeatedly load weights from RAM. But as you pointed out, Zero-infinity may avoid the additional latency by loading the next layer's weights from RAM at the same as computing the previous layer's output. This... (read more)

2TLW
I am glad we were able to work out the matter! > If this works, we may be able to deploy massive future neural nets on clusters no bigger than the ones we have today. Beware bandwidth bottlenecks, as I mentioned in my original post. If you have a 1TB model, you need to have it somewhere with >=1TB/s effective bandwidth between storage and the compute endpoint to achieve 1 second of latency when doing an inference. And storage capacity (not to mention model size) keeps rising faster than bandwidth does... (There are tricks here to an extent - such as compressing the model and decompressing it on-target - but they seldom save much.  (And if they do, that just means your model is inefficient...)) According to a random guy on the internet, GPT-3 is ~300GB compressed. PCIe gen4x16 is ~31.5GB/s. If you have 1s of latency, that means that you can only stream in ~31.5GB per card. (In addition to what's already stored in RAM.) That being said, as far as I can tell it is - in theory - possible to run a GPT-3 inference on a single Threadripper Pro platform (or something else with 128 lanes of gen4 pcie), with 8x 6GB graphics cards in 1 second, if you have 300GB of DRAM lying around. (Or 4x 12GB graphics cards in 2 seconds, with the other half of the pcie lanes filled with gen4 SSDs.) (In practice I strongly suspect you'll hit some unknown limit in the PCIe root complex or thereabouts. This is shuffling something silly like 250GB/s of data through that one poor root complex.) (It's a pity that there's no good way to ask a GPU to pull data directly from an SSD. ICMB could help, but it requires GPU-side software support. Most of this data stream could go directly from SSD to PCIe switch to graphics card without having to be bounced through the root port...) (Yes, 8x gpu->gpu communications will hurt overall latency... but not by all that much I don't think. 1 second is an eternity.) > As I think we both agree, pipelining, in the sense of using different GPUs to compute d
6gwern
Incidentally, the latency cost of width vs depth is something I've thought might explain why the brain/body allometric scaling laws are so unfavorable and what all that expensive brain matter does given that our tiny puny little ANNs seem capable of so much: everything with a meaningful biological brain, from ants to elephants, suffers from hard (fatal) latency requirements. You are simply not allowed by Nature or Darwin to take 5 seconds to compute how to move your legs.* (Why was Gato 1 so small and so unimpressive in many ways? Well, they kept it small because they wanted it to run in realtime for a real robot. A much wider Transformer could've still met the deadline... but cost a lot more parameters and training than usual by going off the optimal scaling curves.) It does not matter how many watts or neurons you save by using a deep skinny network, if after 10 layers have fired with another 100 to go to compute the next action to take, you've been eaten by a stupider but faster-thinking predator. So a biological brain might be forced to be deep into an unfavorable point on width vs depth - which might be extremely expensive - in order to meet its subset of robotics-related deadlines, as it were. * With a striking counterexample, in both tininess of brain and largeness of latency, being Portia. What is particularly striking to me is not that it is so intelligent while being so tiny, but that this seems to be directly due to its particular ecological niche: there are very few creatures out there who need extremely flexible intelligent behavior but also are allowed to have minutes or hours to plan many of its actions... but Portia is one of them, as it is a stealthy predator attacking static prey. The prey also don't generally have much memory nor can they just leave their web, so a Portia can try again if the first trick didn't work. So Portia spiders are allowed to do things like spend hours circumnavigating a web to strike its prey spider from the right direct

The key is: pipelining doesn't help with latency of individual requests. But that's not what we care about here. What we care about is the latency from starting request 1 to finishing request N

Thanks for the examples. Your point seems to be about throughput, not latency (which to my knowledge is defined on a per-request basis). The latency per request may not matter for training but it does matter for inference if you want your model to be fast enough to interact with the world in real time or faster.

2TLW
Hm. Could you please reread my post? You're repeatedly stating assertions that I explicitly state and show are not the case. > Your point seems to be about throughput, not latency I gave an explicit example where a single inference is lower latency with pipelining here versus without. Hm. I think I understand where you seem to be misunderstanding. Let me try to explain a little more. > latency (which to my knowledge is defined on a per-request basis) The key here is that one "request" is composed of multiple requests. From the end user point of view, a single "request" means "a single full end-to-end inference". And the latency they care about is issuing the input data to getting the inference result out. But from the internal point of view, that single full end-to-end inference has multiple requests (essentially, "load weights for layer 0; run calculation on inputs and layer 0 weights to get layer 1 input; load weights for layer 1; run calculation on layer 0 output and layer 1 weights to get layer 2 input; etc, etc"). And you can reduce the latency of that one external request (the inference) by piplining multiple internal subrequests. You are absolutely correct in that the latency of each of the subrequests is not reduced - but the latency of the external request absolutely is reduced compared to if you didn't pipeline! (At least assuming the internal subrequests can be pipelined - which they can be in this case as I've repeatedly noted.)

Perhaps what you meant is that latency will be high but this isn't a problem as long as you have high throughput. That's is basically true for training. But this post is about inference where latency matters a lot more.

(It depends on the application of course, but the ZeRO Infinity approach can make your model so slow that you don't want to interact with it in real time, even at GPT-3 scale)

That would be interesting if true. I thought that pipelining doesn't help with latency. Can you expand?

Generically, pipelining increases throughput without lowering latency. Say you want to compute f(x) where f is a NN. Every stage of your pipeline processes e.g. one of the NN layers. Then stage N has to wait for the earlier stages to be completed before it can compute the output of layer N. That's why the latency to compute f(x) is high.

NB, GPT-3 used pipelining for training (in combination with model- and data parallelism) and still the large GPT-3 has h... (read more)

2TLW
To give a concrete example: Say each layer takes 10ms to process. The NN has 100 layers. It takes 40ms to round-trip weight data from the host (say it's on spinning rust or something). You can fit 5 layers worth of weights on a gpu, in addition to activation data / etc. On a GPU with a "sufficiently large" amount of memory, such that you can fit everything on-GPU, this will have 1.04s latency overall. 40ms to grab all of the weights into the GPU, then 1s to process. On a GPU, with no pipelining, loading five layers at a time then processing them, this will take 1.8 seconds latency overall. 40ms to load from disk, then 50 ms to process, for each group of 5 layers. On a GPU, with pipelining, this will take... 1.04s overall latency. t=0ms, start loading layer 1 weights. t=10ms, start loading layer 2 weights. ... t=40ms, start loading layer 5 weights & compute layer 1, t=50ms, start loading layer 6 weights & compute layer 2, etc. (Note that this has a max of 5 'active' sets of weights at once, like in the no-pipelining case.) (A better example would split this into request latency and bandwidth.) > Every stage of your pipeline processes e.g. one of the NN layers. Then stage N has to wait for the earlier stages to be completed before it can compute the output of layer N. That's why the latency to compute f(x) is high. To be clear: I am talking about pipelining loading the NN weights into the GPU. Which is not dependent on the result of the previous layer's computation. I can be loading the NN weights for layer N+1 while I'm working on layer N. There's no dependency on the activations of the previous layer. > pipelining doesn't help with latency Let me give an example (incorrect) exchange that hopefully illustrates the issue. "You can never stream video from a remote server, because your server roundtrip is 100ms and you only have 20ms per frame". "You can pipeline requests" "...but I thought pipelining doesn't help with latency?" (This example is oversimpli
1[comment deleted]
1SoerenMind
Perhaps what you meant is that latency will be high but this isn't a problem as long as you have high throughput. That's is basically true for training. But this post is about inference where latency matters a lot more. (It depends on the application of course, but the ZeRO Infinity approach can make your model so slow that you don't want to interact with it in real time, even at GPT-3 scale)

No, they don't. The primary justification for introducing them in the first place was to make a cheaper forward pass (=inference)

The motivation to make inference cheaper doesn't seem to be mentioned in the Switch Transformer paper nor in the original Shazeer paper. They do mention improving training cost, training time (from being much easier to parallelize), and peak accuracy. Whatever the true motivation may be, it doesn't seem that MoEs change the ratio of training to inference cost, except insofar as they're currently finicky to train.

But the glas

... (read more)
3gwern
I'm not sure what you mean. They refer all over the place to greater computational efficiency and the benefits of constant compute cost even as one scales up experts. And this was front and center in the original MoE paper emphasizing the cheapness of the forward pass and positioning it as an improvement on the GNMT NMT RNN Google Translate had just rolled out the year before or so (including benchmarking the actual internal Google Translate datasets), and which was probably a major TPUv1 user (judging from the % of RNN workload reported in the TPU paper). Training costs are important, of course, but a user like Google Translate, the customer of the MoE work, cares more about the deployment costs because they want to serve literally billions of users, while the training doesn't happen so often.

You may have better info, but I'm not sure I expect 1000x better serial speed than humans (at least not with innovations in the next decade). Latency is already a bottleneck in practice, despite efforts to reduce it. Width-wise parallelism has its limits and depth- or data-wise parallelism doesn't improve latency. For example, GPT-3 already has high latency compared to smaller models and it won't help if you make it 10^3x or 10^6x bigger.

As Steven noted, your $1/hour number is cheaper than my numbers and probably more realistic. That makes a significant difference.

I agree that transformative impact is possible once we've built enough GPUs and connected them up into many, many new supercomputers bigger than the ones we have today. In a <=10 year timeline scenario, this seems like a bottleneck. But maybe not with longer timelines.

you're missing all the possibilities of a 'merely human-level' AI. It can be parallelized, scaled up and down (both in instances and parameters), ultra-reliable, immortal, consistently improved by new training datasets, low-latency, ultimately amortizes to zero capital investment

I agree this post could benefit from discussing the advantages of silicon-based intelligence, thanks for bringing them up. I'd add that (scaled up versions of current) ML systems have disadvantages compared to humans, such as a lacking actuators and being cumbersome to fine-t... (read more)

I broadly agree with your first point, that inference can be made more efficient. Though we may have different views on how much?

Of course, both inference and training become more efficient and I'm not sure if the ratio between them is changing over time.

As I mentioned there are also reasons why inference could become more expensive than in the numbers I gave. Given this uncertainty, my median guess is that the cost of inference will continue to exceed the cost of training (averaged across the whole economy).

I don't think sparse (mixture of expert) mode... (read more)

4gwern
No, they don't. The primary justification for introducing them in the first place was to make a cheaper forward pass (=inference). They're generally more challenging to train because of the discrete gating, imbalanced experts, and sheer size - the Switch paper discusses the problems, and even the original Shazeer MoE emphasizes all of the challenges in training a MoE compared to a small dense model. Now, if you solve those problems (as Switch does), then yes, the cheaper inference would also make cheaper training (as long as you don't have to do too much more training to compensate for the remaining problems), and that is an additional justification for Switch. But the primary motivation for researching MoE NMT etc has always been that it'd be a lot more economical to deploy at scale after training. Those results are sparse->dense, so they are not necessarily relevant (I would be thinking more applying distillation to the original MoE and distill each expert - the MoE is what you want for deployment at scale anyway, that's the point!). But the glass is half-full: they also report that you can throw away 99% of the model, and still get a third of the boost over the baseline small model. Like I said, the most reliable way to a small powerful model is through a big slow model. Yeah, we don't know what's going on there. They've mentioned further finetuning of the models, but no details. They decline to specify even what the parameter counts are, hence EAI needing to reverse-engineer guesses from their benchmarks. (Perhaps the small models are now distilled models? At least early on, people were quite contemptuous of the small models, but these days people find they can be quite handy. Did we just underrate them initially, or did they actually get better?) They have an 'instruction' series they've never explained what it is (probably something like T0/FLAN?). Paul's estimate of TFLOPS cost vs API billing suggests that compute is not a major priority for them cost-wise,

Our default expectation about large neural networks should be that we will understand them in roughly the same ways that we understand biological brains, except where we have specific reasons to think otherwise.

Here's a relevant difference: In the brain, nearby neurons can communicate with lower cost and latency than far-apart neurons. This could encourage nearby neurons to form modules to reduce the number of connections needed in the brain. But this is not the case for standard artificial architectures where layers are often fully connected or similar.

Some minor feedback points: Just from reading the abstract and intro, this could be read as a non-sequitur: "It limits our ability to mitigate short-term harms from NLP deployments". Also, calling something a "short-term" problem doesn't seem necessary and it may sound like you think the problem is not very important.

1Sam Bowman
Thanks! Tentative rewrite for the next revision: I tried to stick to 'present-day' over 'short-term', but missed this old bit of draft text in the abstract. 

One thing I dislike about the 'punctuation outside quotes' view is that it treats "!" and "?" differently than a full stop.

"This is an exclamation"!
"Is this a question"?

Seems less natural to me than:

"This is an exclamation!"
"Is this a question?"

I think have this intuition because it is part of the quote that it is an exclamation or a question.

Yes I completely agree. My point is that the fine-tuned version didn't have better final coding performance than the version trained only on code. I also agree that fine-tuning will probably improve performance on the specific tasks we fine-tune on. 

Most importantly I expect them to be fine-tuned on various things (perhaps you can bundle this under "higher-quality data"). Think of how Codex and Copilot are much better than vanilla GPT-3 at coding. That's the power of fine-tuning / data quality.


Fine-tuning GPT-3 on code had little benefit compared to training from scratch:

Surprisingly, we did not observe improvements when starting from a pre-trained language model, possibly because the finetuning dataset is so large. Nevertheless, models fine-tuned from GPT converge more quickly, so we apply this strat

... (read more)
2Daniel Kokotajlo
Huh.... I coulda' sworn they said Codex was pre-trained on internet text as well as on code, and that it was in particular a version of GPT-3, the 12B param version... The paper seems to support this interpretation when you add in more context to the quote you pulled: Note the bits I bolded. My interpretation is that Codex is indeed a fine-tuned version of GPT-3-12B; the thing they found surprising was that there wasn't much "transfer learning" from text to code, in the sense that (when they did smaller-scale experiments) models trained from scratch reached the same level of performance. So if models trained from scratch reached the same level of performance, why fine-tune from GPT-3? Answer: Because it converges more quickly that way. Saves compute.

2023

The multimodal transformers are now even bigger; the biggest are about half a trillion parameters [...] The hype is insane now


This part surprised me. Half a trillion is only 3x bigger than GPT-3. Do you expect this to make a big difference? (Perhaps in combination with better data?). I wouldn't, given that GPT-3 was >100x bigger than GPT-2. 

Maybe your'e expecting multimodality to help? It's possible, but worth keeping in mind that according to some rumors, Google's multimodal model already has on the order of 100B parameters.

On the other hand, ... (read more)

3Daniel Kokotajlo
I am not confident in that part. I was imagining that they would be "only" 3x bigger or so, but that they'd be trained on much higher-quality data (incl. multimodal) and also trained for longer/more data, since corps would be not optimizing purely for training-compute-optimal performance but instead worrying a bit more about inference-time compute costs. Most importantly I expect them to be fine-tuned on various things (perhaps you can bundle this under "higher-quality data"). Think of how Codex and Copilot are much better than vanilla GPT-3 at coding. That's the power of fine-tuning / data quality. Also, 3x bigger than GPT-3 is still, like, 40x bigger than Codex, and Codex is pretty impressive. So I expect scale will be contributing some amount to the performance gains for things like code and image and video, albeit not so much for text since GPT-3-175B was already pretty big. If Google's multimodal model is already 100B parameters big, then I look forward to seeing its performance! Is it worse than GPT-3? If so, that would be evidence against my forecast, though we still have two years to go...

In my experience, this worked extremely well. But that was thanks to really good management and coordination which would've been hard in other groups I used to be part of.

This wouldn't explain the recent reduction in R because Delta has already been dominant for a while.

The  of Delta is ca. 2x the R0 of the Wuhan strain and this doubles the effect of new immunity on 

In fact, the ONS data gives me that ~7% of Scotland had Delta so that's a reduction in  of *7% = 6*7% = 0.42 just from very recent and sudden natural immunity. 

That's not [edited: forgot to say "not"] enough to explain everything, but there are more factors: 

1) Heterogenous immunity: the first people to become immune are often high-risk people who go to superspreader events etc. 

2) Vaccinations also w... (read more)

Another heuristic is to choose the option where you're most likely to do exceptionally well. (Cf heavy tailed impact etc). Among other thing this, this pushes you to optimize for the timelines scenario where you can be very successful, and to do the job with the best personal fit.

Some standard ones like masks, but not at all times. They probably were in close or indoor contact with infected people without precautions.

  1. FWIW I've seen multiple double-mRNA-vaccinated people in my social circles who still got infected with delta (and in one case infected someone else who was double vaccinated). Two of the cases I know were symptomatic (but mild).
3Ethan Perez
I also know of 5+ cases of symptomatic COVID among double-vaxxed people in the bay area (including one instance where most people in a group house of ~6 people got covid). These are also relatively healthy individuals in their 20s
1brp
Immune response is generally associated with age and lifestyle. What can you tell us about those factors?
2Bird Concept
Do you know which, if any, risk-reducing precautions they were following? 
1Lukas Finnveden
How many asymptomatic? And how did people know of them?

According to one expert, the immune system essentially makes bets on how often it will face a given virus and how the virus will mutate in the future:

https://science.sciencemag.org/content/372/6549/1392

By that logic, being challenged more often means that the immune system should have a stronger and longer-lasting response:

The immune system treats any new exposure—be it infection or vaccination—with a cost-benefit threat analysis for the magnitude of immunological memory to generate and maintain. There are resource-commitment decisions: more cells and more

... (read more)
3Bird Concept
I'll pay $50 for this answer, will message you for payment details. 

Suggestion for content 2: relationship to invariant causal prediction

Lots of people in ML these days seem excited about getting out of distribution generalization with techniques like invariant causal prediction. See e.g. this, this, section 5.2 here and related background. This literature seems promising but in discussions about inner alignment it's missing. It seems useful to discuss how far it can go in helping solve inner alignment. 

Load More