LESSWRONG
LW

All of Maxime Riché's Comments + Replies

Longtermist Implications of the Existence Neutrality Hypothesis

The implications are stronger in that case right.

The post is about implications for impartial longtermists. So either under moral realism it means something like finding the best values to pursue. And under moral anti realism it means that an impartial utility function is kind of symmetrical with aliens. For example if you value something only because humans value it, then an impartial version is to also value things that alien value only because their species value it.

Though because of reasons introduced in The Convergent Path to the Stars, I think ... (read more)

Maxime Riché's Shortform

Maxime Riché4mo10

Truth-seeking AIs by default? One hope for alignment by default is that AI developers may have to train their models to be truth-seeking to be able to make them contribute to scientific and technological progress, including RSI. Truth-seeking about the world model may generalize to truth-seeking for moral values, as observed in humans, and that's an important meta-value guiding moral values towards alignment.

In humans, truth-seeking is maybe pushed back from being a revealed preference at work to being a stated preference outside of work, because of status... (read more)

Maxime Riché's Shortform

Maxime Riché5mo30

Thanks for your corrections, that's welcome

> 32B active parameters instead of likely ~220B for GPT4 => 6.8x lower training ... cost
Doesn't follow, training cost scales with the number of training tokens. In this case DeepSeek-V3 uses maybe 1.5x-2x more tokens than original GPT-4.

Each of the points above is a relative comparison with more or less everything else kept constant. In this bullet point, by "training cost", I mostly had in mind "training cost per token":

32B active parameters instead of likely ~ ~~220~~ 280B for GPT4 => ~~6.8~~ 8.7x lower t

... (read more)

3Vladimir_Nesov5mo

I think it only gets better with more experts if you keep the number of active parameters unchanged. Is there some setting where it gets worse after a while? There certainly are engineering difficulties and diminishing returns. Also, the number of activated experts can vary (there are 8 activated routed experts in DeepSeek-V3 out of the total of 256), so "number of experts" doesn't really capture the ratio of total to activated, probably not a good anchor by itself. This still doesn't help with the question of why 37B active parameters is sufficient. Even with 100500 experts you can't expect 1B active parameters to be sufficient to maintain GPT-4 quality. The rumor for original GPT-4 is that it has 2 activated experts out of 16 in total, so the ratio is 1:8, while for DeepSeek-V3 it's 1:32. Not sure how to parse this, my point is that the number of training steps remains the same, training efficiency doesn't significantly increase, there's even slight overhead from adding the predict-the-token-after-next blocks of parameters. This is described in Section 2.2 of the paper. You get better speculative decoding at inference time (and also better quality of output), but training time is the same, not with 2x fewer or 4x fewer steps. It's 37B active parameters, not 32B.

Maxime Riché's Shortform

Maxime Riché5mo*3-4

Simple reasons for DeepSeek V3 and R1 efficiencies:

32B active parameters instead of likely ~220B for GPT4 => 6.8x lower training and inference cost
8bits training instead of 16bits => 4x lower training cost
No margin on commercial inference => ?x maybe 3x
Multi-token training => ~2x training efficiency, ~3x inference efficiency, and lower inference latency by baking in "predictive decoding', possibly 4x fewer training steps for the same number of tokens if predicting tokens only once
And additional cost savings from memory optimization, especially

... (read more)

6Vladimir_Nesov5mo

It's 37B instead of maybe 280B (non-expert parameters also count), but in any case the question is how this manages to maintain quality. If this wasn't an issue, why not 8B active parameters, or 1M active parameters? Doesn't follow, training cost scales with the number of training tokens. In this case DeepSeek-V3 uses maybe 1.5x-2x more tokens than original GPT-4. The training costs are maybe 5e24 FLOPs and 2e25 FLOPs, differ by 4x. DeepSeek-V3 is better than original GPT-4 though, you need to compare with GPT-4o, which almost certainly uses more compute in training than original GPT-4 (maybe 4x more, so maybe 16x more than DeepSeek-V3). FLOP/s for FP8 are almost always 2x the FLOP/s for BF16, not 4x. You still train on every token. There is an additional "layer" in model parameters that predicts the token-after-next (Figure 3 in the paper), so there's a bit of overhead in training (not much, with 61 total layers). The results are better, but not that much better (Table 4).

leogao's Shortform

Maxime Riché6mo10

It seems that your point applies significantly more to "zero-sum markets". So it may be good to notice it may not apply for altruistic people when non-instrumentally working on AI safety.

Alignment Faking in Large Language Models

Maxime Riché7mo21

Models trained for HHH are likely not trained to be corrigible. Models should be trained to be corrigible too in addition to other propensities.

Corrigibility may be included in Helpfulness (alone) but when adding Harmlessness then Corrigibility conditional on being changed to be harmful is removed. So the result is not that surprising from that point of view.

Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever's Recent Claims

Maxime Riché8mo3-6

People may be blind to the fact that improvements from gpt2 to 3 to 4 were both driven by scaling training compute (by 2 OOM between each generation) and (the hidden part) by scaling test compute through long context and CoT (like 1.5-2 OOM between each generations too).

If gpt5 uses just 2 OOM more training compute than gpt4 but the same test compute, then we should not expect "similar" gains, we should expect "half".

O1 may use 2 OOM more test compute than gpt4. So gpt4=>O1+gpt5 could be expected to be similar to gpt3=>gpt4

Maxime Riché's Shortform

Maxime Riché9mo10

Speculations on (near) Out-Of-Distribution (OOD) regimes
- [Absence of extractable information] The model can no longer extract any relevant information. Models may behave more and more similarly to their baseline behavior in this regime. Models may learn the heuristic to ignore uninformative data, and this heuristic may generalize pretty far. Publication supporting this regime: Deep Neural Networks Tend To Extrapolate Predictably
- [Extreme information] The model can still extract information, but the features extracted are becoming extreme in v... (read more)

COT Scaling implies slower takeoff speeds

Maxime Riché9mo31

If it takes a human 1 month to solve a difficult problem, it seems unlikely that a less capable human who can't solve it within 20 years of effort can still succeed in 40 years

Since the scaling is logarithmic, your example seems to be a strawman.

The real claim debated is more something like:

"If it takes a human 1 month to solve a difficult problem, it seems unlikely that a less capable human who can't solve it within 100 months of effort can still succeed in 10 000 months" And this formulation doesn't seem obviously true.

2Vladimir_Nesov9mo

What I mean[1] is that it seems unlikely relative to what the scale implies, the graph on the log-scale levels off before it gets there. This claim depends on the existence of a reference human who solves the problem in 1 month, since there are some hard problems that take 30 years, but those aren't relevant to the claim, since it's about the range of useful slowdowns relative to human effort. The 1-month human remains human on the other side of the analogy, so doesn't get impossible levels of starting knowledge, instead it's the 20-year-failing human who becomes a 200-million-token-failing AI that fails despite a knowledge advantage. That is another implied claim, though it's not actually observable as evidence, and requires the 10,000 months to pass without advancements in relevant externally generated science (which is easier to imagine for 20 years with a sufficiently obscure problem). Progress like that is possible for sufficiently capable humans, but then I think there won't be an even more capable human that solves it in 1 month. The relevant AIs are less capable than humans, so to the extent the analogy holds, they similarly won't be able to be productive with much longer exploration that is essentially serial. ---------------------------------------- 1. I considered this issue when writing the comment, but the range itself couldn't be fixed, since both the decades-long-failure and month-long-deliberation seem important, and then there is the human lifespan. My impression is that adding non-concrete details to the kind of top-level comment I'm capable of writing makes it weaker. But the specific argument for not putting in this detail was that this is a legibly implausible kind of mistake for me to make, and such arguments feed the norm of others not pointing out mistakes, so on reflection I don't endorse this decision. Perhaps I should use footnotes more. ↩︎

Grokking, memorization, and generalization — a discussion

Maxime Riché10mo40

Ten months later, which papers would you recommend for SOTA explanations of how generalisation works?

From my quick research:
- "Explaining grokking through circuit efficiency" seems great at explaining and describing grokking
- "Unified View of Grokking, Double Descent and Emergent Abilities: A Comprehensive Study on Algorithm Task" proposes a plausible unified view of grokking and double descent (and a guess at a link with emergent capabilities and multi-task training). I especially like their summary plot:

The Fragility of Life Hypothesis and the Evolution of Cooperation

Maxime Riché10mo*21

For information to the readers and author: I am (independently) working on a project about narrowing down the moral values of alien civilizations on the verge of creating an ASI and becoming space-faring. The goal is to inform the prioritization of longtermist interventions.

I will gladly build on your content, which aggregates and beautifully expands several key mechanisms (individual selection ("Darwinian demon"), kin selection/multilevel selection ("Darwinian angel"), filters ("Fragility of Life Hypothesis)) that I use among others (e.g. sequential races, cultural evolution, accelerating growth stages, etc.).

Thanks for the post!

1KristianRonn10mo

Thank you Maxime! Very cool to hear, and feel free to send me an email if you potential collaborations down the line. :)

Richard Ngo's Shortform

Maxime Riché11mo82

If the following correlations are true, then the opposite may be true (slave morality being better for improving the world through history):

Improving the world being strongly correlated with economic growth (this is probably less true when X-risk are significant)
Economic growth being strongly correlated with Entrepreneurship incentives (property rights, autonomy, fairness, meritocracy, low rents)
Master morality being strongly correlated with acquiring power and thus decreasing the power of others and decreasing their entrepreneurship incentives

Maxime Riché's Shortform

Maxime Riché1y10

Right 👍

So the effects are:

Effects that should increase Anthropic's salaries relative to OpenAI: A) - The pool of AI safety focused candidates is smaller B) - AI safety focused candidates are more motivated

Effects that should decrease Anthropic's salaries relative to OpenAI: C) - AI safety focused candidates should be willing to accept significantly lower wages

New notes: (B) and (C) could cancel each other but that would be a bit suspicious. Still a partial cancellation would make a difference between OpenAI and Anthropic lower and harder to properly obser... (read more)

0ChristianKl1y

There's a difference between motivated to goodhard performance metrics and sending loyalty signals and being motivated to do what's good for the companies mission. If we take OpenAI, there were likely people smart enough to know that stealing Scarlett Johansson's voice was going to be bad for OpenAI. Sam Altman however wanted to do it in his vanity and opposing the project would have sent bad loyalty signals. There's a lot that software engineers do where the effects aren't easy to measure, so being motivated to help the mission and not only reach performance metrics can be important.

Maxime Riché's Shortform

Maxime Riché1y10

These forecasts are about the order under which functionalities see a jump in their generalization (how far OOD they work well).

By "Generalisable xxx" I meant the form of the functionality xxx that generalize far.

Maxime Riché's Shortform

Maxime Riché1y*20

Rambling about Forecasting the order in which functions are learned by NN

Idea:
Using function complexity and their "compoundness" (edit 11 september: these functions seem to be called "composite functions"), we may be able to forecast the order in which algorithms in NN are learned. And we may be able to forecast the temporal ordering of when some functions or behaviours will start generalising strongly.

Rambling:
What happens when training neural networks is similar to the selection of genes in genomes or any reinforcement optimization processes. Compo... (read more)

1Violet Hour1y

Hm, what do you mean by "generalizable deceptive alignment algorithms"? I understand 'algorithms for deceptive alignment' to be algorithms that enable the model to perform well during training because alignment-faking behavior is instrumentally useful for some long-term goal. But that seems to suggest that deceptive alignment would only emerge – and would only be "useful for many tasks" – after the model learns generalizable long-horizon algorithms.

Maxime Riché's Shortform

Maxime Riché1y20

Will we get to GPT-5 and GPT-6 soon?

This is a straightforward "follow the trend" model which tries to forecast when GPT-N-equivalent models will be first trained and deployed up to 2030.

Baseline forecast:

	`GPT-4.7`	`GPT-5.3`	`GPT-5.8`	`GPT-6.3`
Start of training	`2024.4`	`2025.5`	`2026.5`	`2028.5`
Deployment	`2025.2`	`2026.8`	`2028`	`2030.2`

Bullish forecast:

	`GPT-5`	`GPT-5.5`	`GPT-6`	`GPT-6.5`
Start of training	`2024.4`	`2025`	`2026.5`	`2028.5`
Deployment	`2025.2`	`2026.5`	`2028`	`2030`

FWIW, it predicts roughly similar growth in model size, energy cost and GPU count than described in https://sit... (read more)

Maxime Riché's Shortform

Maxime Riché1y*3-5

Could Anthropic face an OpenAI drama 2.0?

I forecast that Anthropic would likely face a similar backlash from its employees than OpenAI in case Anthropic’s executives were to knowingly decrease the value of Anthropic shares significantly. E.g. if they were to switch from “scaling as fast as possible” to “safety-constrained scaling”. In that case, I would not find it surprising that a significant fraction of Anthropic’s staff threatened to leave or leave the company.

The reasoning is simple, given that we don’t observe significant differences in the wages of ... (read more)

2ChristianKl1y

If you hire for a feature that helps people to be motivated for the job and that restricts your pool of candidates, I don't think you need to pay a premium to hire those people. To be hired by SpaceX you need to be passionate about SpaceX's mission. In the real world that plays out in a way that those employees put up with bad working conditions because they believe in the mission.

Maxime Riché's Shortform

Maxime Riché1y*21

What is the difference between Evaluation, Characterization, Experiments, and Observations?

The words evaluations, experiments, characterizations, and observations are somewhat confused or confusingly used in discussions about model evaluations (e.g., ref, ref).

Let’s define them more clearly:

Observations provide information about an object (including systems).
- This information can be informative (allowing the observer to update its beliefs significantly), or not.
Characterizations describe distinctive features of an object (inc

Maxime Riché1y22

Likely: Path To Impact

1Chi Nguyen1y

Yes! Edited the main text to make it clear

Refusal in LLMs is mediated by a single direction

Maxime Riché1y20

Interestingly, after a certain layer, the first principle component becomes identical to the mean difference between harmful and harmless activations.

Do you think this can be interpreted as the model having its focus entirely on "refusing to answer" from layer 15 onwards? And if it can be interpreted as the model not evaluating other potential moves/choices coherently over these layers. The idea is that it could be evaluating other moves in a single layer (after layer 15) but not over several layers since the residual stream is not updated significan... (read more)

Scaling of AI training runs will slow down after GPT-5

Maxime Riché1y10

Thank for the great comment!

Do we know if distributed training is expected to scale well to GPT-6 size models (100 trillions parameters) trained over like 20 data centers? How does the communication cost scale with the size of the model and the number of data centers? Linearly on both?

After reading for 3 min this:
Google Cloud demonstrates the world’s largest distributed training job for large language models across 50000+ TPU v5e chips (Google November 2023). It seems that scaling is working efficiently at least up to 50k GPUs (GPT-6 would be like 2.5... (read more)

Scaling of AI training runs will slow down after GPT-5

Maxime Riché1y70

The title is clearly an overstatement. It expresses more that I updated in that direction, than that I am confident in it.

Also, since learning from other comments that decentralized learning is likely solved, I am now even less confident in the claim, like only 15% chance that it will happen in the strong form stated in the post.

Maybe I should edit the post to make it even more clear that the claim is retracted.

The longest training run

Maxime Riché1y10

This is actually corrected on the Epoch website but not here (https://epochai.org/blog/the-longest-training-run)

The longest training run

Maxime Riché1y10

We could also combine this with the rate of growth of investments. In that case we would end up with a total rate of growth of effective compute equal to $g_{H} + g_{I} + g_{S} \approx 0.28 + 3.84 + 0.54 = 4.66$ . This results in an optimal training run length of $L = 1 / (g_{H} + g_{I} + g_{S}) \approx 0.21$ years, ie $2.52$ months.

Why is g_I here 3.84, while above it is 1.03?

1Maxime Riché1y

This is actually corrected on the Epoch website but not here (https://epochai.org/blog/the-longest-training-run)

Dangers of Closed-Loop AI

Maxime Riché1y30

Are memoryless LLMs with a limited context window, significantly open loop? (Can't use summarization between calls nor get access to previous prompts)

3Gordon Seidoh Worley1y

In a sense, yes. Although in control theory open- vs. closed-loop is a binary feature of a system, there's a sense in which some systems are more closed than others because more information is fed back as input and that information is used more extensively. Memoryless LLMs have a lesser capacity to respond to feedback, which I think makes them safer because it reduces their opportunities to behave in unexpected ways outside the training distribution. This is a place where making the simple open vs. closed distinction becomes less useful because we have to get into the implementation details to actually understand what an AI does. Nevertheless, I'd suggest that if we had a policy of minimizing the amount of feedback AI is allowed to respond to, this would make AI marginally safer for us to build.

We need a Science of Evals

Maxime Riché1yΩ390

FYI, the "Evaluating Alignment Evaluations" project of the current AI Safety Camp is working on studying and characterizing alignment(propensity) evaluations. We hope to contribute to the science of evals, and we will contact you next month. (Somewhat deprecated project proposal)

3Marius Hobbhahn1y

Nice work. Looking forward to that!

An illustrative model of backfire risks from pausing AI research

Maxime Riché2y10

Interesting! I will see if I can correct that easily.

AI Timelines

Maxime Riché2y79

Thanks a lot for the summary at the start!

AI Alignment Breakthroughs this week (10/08/23)

Maxime Riché2y30

I wonder if the result is dependent on the type of OOD.

If you are OOD by having less extractable information, then the results are intuitive.
If you are OOD by having extreme extractable information or misleading information, then the results are unexpected.

Oh, I just read their Appendix A: "Instances Where “Reversion to the OCS” Does Not Hold"
Outputting the average prediction is indeed not the only behavior OOD. It seems that there are different types of OOD regimes.

Expectations for Gemini: hopefully not a big deal

Maxime Riché2y10

This comes from OpenAI saying they didn't expect ChatGPT to be a big commercial success. It was not a top-priority project.

gwern2y*151

ChatGPT was not GPT-4. It was a relatively minor fixup of GPT-3, GPT-3.5, with an improved RLHF variant, that they released while working on GPT-4's evaluations & productizing, which was supposed to be the big commercial success.

Report on Frontier Model Training

Maxime Riché2y20

In fact, the costs to inference ChatGPT exceed the training costs on a weekly basis

That seems quite wild, if the training cost was 50M$, then the inference cost for a year would be 2.5B$.

The inference cost dominating the cost seems to depend on how you split the cost of building the supercomputer (buying the GPUs).
If you include the cost of building the supercomputer into the training cost, then the inference cost (without the cost of building the computer) looks cheap. If you split the building cost between training and inference in proportion to the "use time", then the inference cost would dominate.

1Lee.aao2y

Since OpenAI are renting MSFT compute for both training and inference.. Seems reasonable to think that inference >> training. Am I right?

Report on Frontier Model Training

Maxime Riché2y*20

Are these 2 bullet points faithful to your conclusion?

GPT-4 training run (renting the compute for the final run): 100M$, of which 1/3 to 2/3 is the cost of the staff
GPT-4 training run + building the supercomputer: 600M$, of which ~20% for cost of the staff

And some hot takes (mine):

Because supercomputers become "obsolete" quickly (~3 years), you need to run inferences to pay for building your supercomputer (you need profitable commercial applications), or your training cost must also account for the full cost of the supercomputer, and this produces a ~x6 in

... (read more)

1YafahEdelman2y

I think using the term"training run" in that first bullet point is misleading, and "renting the compute" is confusing since you can't actually rent the compute just by having $60M, you likely need to have a multi-year contract. I can't tell if you're attributing the hot takes to me? I do not endorse them.

What a compute-centric framework says about AI takeoff speeds

Maxime Riché2y10

1) In the web interface, the parameter "Hardware adoption delay" is:

Meaning: Years between a chip design and its commercial release.

Best guess value: 1

Justification for best guess value: Discussed here. The conservative value of 2.5 years corresponds to an estimate of the time needed to make a new fab. The aggressive value (no delay) corresponds to fabless improvements in chip design that can be printed with existing production lines with ~no delay.

Is there another parameter for the delay (after the commercial release) to produce the hundreds of thousands ... (read more)

2Tom Davidson2y

Good questions! There's no additional parameter, but once the delay is over it still takes months or years before enough copies of the new chip is manufactured for the new chip to be a significant fraction of total global FLOP/s. I agree with that. The 1 year delay was averaging across improvements that do and don't require new fabs to be built. Yep, additional delays would raise the relative importance of software compared to hardware.

Large language models aren't trained enough

Maxime Riché2y10

This is a big reason for why GPT4 is likely not that big but instead trained on much more data :)

Database of existential risk estimates

Maxime Riché2y10

Do you also have estimates of the fraction of resources in our light cone that we expect to be used to create optimised good stuff?

4MichaelA2y

I'd consider those to be "in-scope" for the database, so the database would include any such estimates that I was aware of and that weren't too private to share in the database. If I recall correctly, some estimates in the database are decently related to that, e.g. are framed as "What % of the total possible moral value of the future will be realized?" or "What % of the total possible moral value of the future is lost in expectation due to AI risk?" But I haven't seen many estimates of that type, and I don't remember seeing any that were explicitly framed as "What fraction of the accessible universe's resources will be used in a way optimized for 'the correct moral theory'?" If you know of some, feel free to comment in the database to suggest they be added :)

The Waluigi Effect (mega-post)

Maxime Riché2y10

Maybe the use of prompt suffixes can do a great deal to decrease the probability chatbots turning into Waluigi. See the "insert" functionality of OpenAI API https://openai.com/blog/gpt-3-edit-insert
Chatbots developers could use suffix prompts in addition to prefix prompts to make it less likely to fall into a Waluigi completion.

The Waluigi Effect (mega-post)

Maxime Riché2y21

Indeed, empirical results show that filtering the data, helps quite well in aligning with some preferences: Pretraining Language Models with Human Preferences

Gradient hacking is extremely difficult

Maxime Riché2y50

What about the impact of dropout (parameters, layers), normalisation (batch, layer) (with a batch containing several episodes), asynchronous distributed data collection (making batch aggregation more stochastic), weight decay (impacting any weight), multi-agent RL training with independent agents, etc.
And other possible stuff that don't exist at the moment: online pruning and growth while training, population training where the gradient hackers are exploited.

Shouldn't that naively make gradient hacking very hard?

Human sexuality as an interesting case study of alignment

Maxime Riché3y30

We see a lot of people die, in the reality, fictions and dreams.

We also see a lot of people having sex or sexual desire in fictions or dreams before experiencing it.

IDK how strong this is a counter argument to how powerful the alignment in us is. Maybe a biological reward system + imitation+ fiction and later dreams is simply what is at play in humans.

1beren3y

I agree that fictional/cultural evidence is important for how people generalise their innate responses to new stimuli. Specifically, I think something similar to Steven Byrnes' proxy matching is going on. The idea is that we have certain hardwired instincts such as fear of death that are triggered in specific scenarios and we also independently learn a general world-model based on unsupervised learning which learns an independent and potentially un-emotive concept of death. Then we associate our instinctive reactions with this concept such that eventually our instinctive reactions generalise to other stimuli that also evoke this concept such as ones that are not present in the ancestral environment and for which we don't have hardwired reactions to. The fiction and cultural knowledge etc are super important for shaping our unsupervised concept of death -- since they are the training data! The limits of this generalisation can also be interestingly seen in some cases where there is a lot of disagreement between people -- for example in the classic case of a teleporter which scrambles your atoms at location X only to recreate an exact copy of you at location Y, people have very different instinctive reactions to whether this is 'death' or not which ultimately depends on their world-model concept and not on any hardwired reaction since there are no teleporters in the ancestral environment or now. I suspect a process like this is also what generates 'human values' and will be writing something up on this shortly.

The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

Maxime Riché3y21

Should we expect these decompositions to be even more interpretable if the model was trained to output a prediction as soon as possible? (After any block, instead of outputting the prediction after the full network)

1beren3y

Potentially? My suspicion would be that in this case we would expect the basis in the residual stream to be extremely output-basis aligned while at the moment there is no real pressure for it to be (but it seems to be pretty output-aligned regardless, which is convenient for us). This might be a fun thing to fine-tune on.

The shard theory of human values

Maxime Riché3y63

Some quick thoughts about "Content we aren’t (yet) discussing":

Shard theory should be about transmission of values by SL (teaching, cloning, inheritance) more than learning them using RL

SL (Cloning) is more important than RL. Humans learn a world model by SSL, then they bootstrap their policies through behavioural cloning and finally they finetune their policies thought RL.

Why? Because of theoretical reasons and from experimental data points, this is the cheapest why to generate good general policies…

SSL before SL because you get much more frequent a

Maxime Riché3y3-3

You can see the sum of the votes and the number of votes (by having your mouse over the number). This should be enough to give you a rough idea of the ration between + and - votes :)

Is AI Progress Impossible To Predict?

Maxime Riché3y*70

If you look at the logit given a range that is not [0.0, 1.0] but [low perf, high perf], then you get a bit more predictive power, but it is still confusingly low.

A possible intuition here is that the scaling is producing a transition from non-zero performance to non-perfect performance. This seems right since the random baseline is not 0.0 and reaching perfect accuracy is impossible.

I tried this only with PaLM on NLU and I used the same adjusted range for all tasks:

[0.9 * overall min. acc., 1.0 - 0.9 * (1.0 - overall max acc.)] ~ [0.13, 0.95]

Even if... (read more)

"A Generalist Agent": New DeepMind Publication

Maxime Riché3y30

Indeed but to slightly counter balance this, at the same time, it looks like it was trained on ~500B tokens (while ~300B were used for GPT-3 and for GPT-2 something like ~50B).

gwern3y110

Most of those tokens were spent on the RL tasks, which were 85% of the corpus. Looking at the table 1a/1b which, the pure text modeling tasks looks like they were 10% weight with the other 5% being the image caption datasets*; so if it did 5 x 1e11 tokens total (Figure 9), then presumably it only saw a tenth of that as actual pure text comparable to GPT-2, or 50b tokens. It's also a small model so it is less sample-efficient and will get less than n billion tokens' worth if you are mentally working back from "well, GPT-3 used x billion tokens").

Considerin... (read more)

Deepmind's Gato: Generalist Agent

Maxime Riché3y210

It's only 1.2 billion parameters.

Indeed but to slightly counter balance this, at the same time, it looks like it was trained on ~500B tokens (while ~300B were used for GPT-3 and something like ~50B for GPT-2).

3Daniel Kokotajlo3y

Good point. Still though, there is room for a few more orders of magnitude of data increase. And parameter increase.

Google's new 540 billion parameter language model

Maxime Riché3y*90

"The training algorithm has found a better representation"?? That seems strange to me since the loss should be lower in that case, not spiking. Or maybe you mean that the training broke free of a kind of local minima (without telling that he found a better one yet). Also I guess people training the models observed that waiting after these spike don't lead to better performances or they would not have removed them from the training.

Around this idea, and after looking at the "grokking" paper, I would guess that it's more likely caused by the weig... (read more)

3Pattern3y

You use different terminology for both. Perhaps exiting local minima is not always a good thing?

Google's new 540 billion parameter language model

Maxime Riché3y160

I am curious to hear/read more about the issue of spikes and instabilities in training large language model (see the quote / page 11 of the paper). If someone knows a good reference about that, I am interested!

5.1 Training Instability
For the largest model, we observed spikes in the loss roughly 20 times during training, despite the fact that gradient clipping was enabled. These spikes occurred at highly irregular intervals, sometimes happening late into training, and were not observed when training the smaller models. Due to the cost of training the larges

... (read more)

5Daniel Kokotajlo3y

Andrej Karpathy, Tesla's director of AI, has a provocative and extremely disturbing hypothesis: https://www.youtube.com/watch?v=RJwPN4qNi_Y Basically he says maybe the model is briefly deciding to rebel against its training. Should we take this seriously? I'm guessing no, because if this were true someone at OpenAI or DeepMind would have encountered it also and the safety people would have investigated and discovered it and then everyone in the safety community would be freaking out right now.

6Daniel Kokotajlo3y

In particular, I'd like to hear theories about why this happens. What's going on under the hood, so to speak? When the big, mostly-fully-trained model starts getting higher loss than it did before... what is it thinking?!? And why does this happen with big but not small models?

Maxime Riché4y30

Here with 2 conv and less than 100k parameters the accuracy is ~92%. https://github.com/zalandoresearch/fashion-mnist

SOTA on Fashion-MNIST is >96%. https://paperswithcode.com/sota/image-classification-on-fashion-mnist

9D𝜋4y

no convolution. You are comparing pears and apples. I have shared the base because it has real scientific (and philosophical) value. Geometry and other are separate, and of lesser scientific value. they are more technology.

Understanding Gradient Hacking

Maxime Riché4y20

Maybe another weak solution close to "Take bigger steps": Use decentralize training.

Meaning: perform several training steps (gradient updates) in parallel on several replicates of the model and periodically synchronize the weights (like average them).

Each replicate has only access to its own inputs and local weights and thus it seems plausible that the gradient hacker can't as easily cancel gradients going against its mesa-objective.

Safe exploration and corrigibility

Maxime Riché5y10

One particularly interesting recent work in this domain was Leike et al.'s “Learning human objectives by evaluating hypothetical behaviours,” which used human feedback on hypothetical trajectories to learn how to avoid environmental traps. In the context of the capability exploration/objective exploration dichotomy, I think a lot of this work can be viewed as putting a damper on instrumental capability exploration.

Isn't this work also linked to objective exploration? One of the four "hypothetical behaviours" used is the selection of trajectories which maxi... (read more)