All of jsteinhardt's Comments + Replies

AI Forecasting: Two Years In

I don't see it in the header in Mobile (although I do see the updated text now about it being a link post). Maybe it works on desktop but not mobile?

2habryka2y

Ah, that is possible, and if so, seems like something we should adjust. I don't like hiding UI elements on mobile that are visible on desktop. Edit: Yep, looks like it's indeed just hidden on mobile. Not sure how high priority it is, but seems kind of like a bug to me.

jsteinhardt2y96

Is it clear these results don't count? I see nothing in the Metaculus question text that rules it out.

3Dan H2y

It was specified in the beginning of 2022 in https://www.metaculus.com/questions/8840/ai-performance-on-math-dataset-before-2025/#comment-77113 In your metaculus question you may not have added that restriction. I think the question is much less interesting/informative if it does not have that restriction. The questions were designed assuming there's no calculator access. It's well-known many AIME problems are dramatically easier with a powerful calculator, since one could bash 1000 options and find the number that works for many problems. That's no longer testing problem-solving ability; it tests the ability to set up a simple script so loses nearly all the signal. Separately, the human results we collected was with a no calculator restriction. AMC/AIME exams have a no calculator restriction. There are different maths competitions that allow calculators, but there are substantially fewer quality questions of that sort. I think MMLU+calculator is fine though since many of the exams from which MMLU draws allow calculators.

AI Forecasting: Two Years In

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Mods, could you have these posts link back to my blog Bounded Regret in some form? Right now there is no indication that this is cross-posted from my blog, and no link back to the original source.

2habryka2y

It does link to it in the header: But I'll make it a link-post so it's clearer. Different people have different preferences for how noticeable they want the cross-posting to be.

Elon Musk announces xAI

jsteinhardt2y2710

Dan spent his entire PhD working on AI safety and did some of the most influential work on OOD robustness and OOD detection, as well as writing Unsolved Problems. Even if this work is less valued by some readers on LessWrong (imo mistakenly), it seems pretty inaccurate to say that he didn't work on safety before founding CAIS.

jsteinhardt2y101

Melanie Mitchell and Meg Mitchell are different people. Melanie was the participant in this debate, but you seem to be ascribing Meg's opinions to her, including linking to video interviews with Meg in your comments.

4the gears to ascension2y

Wait, whoops. Let me retrace identity here, sounds like a big mistake, sorry bout that Meg & Melanie when you see this post someday, heh. edit: oops! the video I linked doesn't contain a Mitchell at all! It's Emily M. Bender and Timnit Gebru, both of whom I have a high opinion of for their commentary on near-term AI harms, and both of whom I am frustrated with for not recognizing how catastrophic those very harms could become if they were to keep on getting worse.

I'm leaving it to the moderators to keep the copies mirrored, or just accept that errors won't be corrected on this copy. Hopefully there's some automatic way to do that?

jsteinhardt2y50

Oops, thanks, updated to fix this.

Thanks! I removed the link.

1reallyeli2y

:thumbsup: Looks like you removed it on your blog, but you may also want to remove it on the LW post here.

Steering GPT-2-XL by adding an activation vector

Thanks! I removed the link.

jsteinhardt2yΩ120

Glad it was helpful!

Steering GPT-2-XL by adding an activation vector

jsteinhardt2yΩ336524

Hi Alex,

Let me first acknowledge that your write-up is significantly more thorough than pretty much all content on LessWrong, and that I found the particular examples interesting. I also appreciated that you included a related work section in your write-up. The reason I commented on this post and not others is because it's one of the few ML posts on LessWrong that seemed like it might teach me something, and I wish I had made that more clear before posting critical feedback (I was thinking of the feedback as directed at Oliver / Raemon's moderation norms, ... (read more)

TurnTrout2yΩ5110

Thanks so much, I really appreciate this comment. I think it'll end up improving this post/the upcoming paper.

(I might reply later to specific points)

Steering GPT-2-XL by adding an activation vector

jsteinhardt2yΩ73014

I'll just note that I, like Dan H, find it pretty hard to engage with this post because I can't tell whether it's basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn't really help in this regard.

I'm not sure what you mean about whether the post was "missing something important", but I do think that you should be pretty worried about LessWrong's collective epistemics that Dan H is the only one bringing this important point up, and that rather than being rewarded for doing so or engaged w... (read more)

TurnTrout2yΩ132614

I, like Dan H, find it pretty hard to engage with this post because I can't tell whether it's basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn't really help in this regard.

The answer is: No, our work is very different from that paper. Here's the paragraph in question:

Editing Models with Task Arithmetic explored a "dual" version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weig

jsteinhardt3y10598

Here is my take: since there's so much AI content, it's not really feasible to read all of it, so in practice I read almost none of it (and consequently visit LW less frequently).

The main issue I run into is that for most posts, on a brief skim it seems like basically a thing I have thought about before. Unlike academic papers, most LW posts do not cite previous related work nor explain how what they are talking about relates to this past work. As a result, if I start to skim a post and I think it's talking about something I've seen before, I have no easy ... (read more)

3Ruby3y

Over the years I've thought about a "LessWrong/Alignment" journal article format the way regular papers have Abstract-Intro-Methods-Results-Discussion. Something like that, but tailored to our needs, maybe also bringing in OpenPhil-style reasoning transparency (but doing a better job of communicating models). Such a format could possibly mandate what you're wanting here. I think it's tricky. You have to believe any such format actually makes posts better rather than constraining them, and it's worth the effort of writers to confirm. It is something I'd like to experiment with though.

Viliam3y1813

and consequently visit LW less frequently

Tangentially, "visiting LW less frequently" is not necessarily a bad thing. We are not in the business of selling ads; we do not need to maximize the time users spend here. Perhaps it would be better if people spent less time online (including on LW) and more time doing whatever meaningful things they might do otherwise.

But I agree that even assuming this, "the front page is full of things I do not care about" is a bad way to achieve it.

the gears to ascension3y1513

tools for citation to the existing corpus of lesswrong posts and to off-site scientific papers would be amazing; eg, rolling search for related academic papers as you type your comment via the semanticscholar api, combined with search over lesswrong for all proper nouns in your comment. or something. I have a lot of stuff I want to say that I expect and intend is mostly reference to citations, but formatting the citations for use on lesswrong is a chore, and I suspect that most folks here don't skim as many papers as I do. (that said, folks like yourself c... (read more)

Hiring Programmers in Academia

jsteinhardt3y81

I think this might be an overstatement. It's true that NSF tends not to fund developers, but in ML the NSF is only one of many funders (lots of faculty have grants from industry partnerships, for instance).

5Adam Jermyn3y

Ah this is a good point! I’m thinking more of physics, which has much more centralized funding provided by a few actors (and where I see tons of low-hanging fruit if only some full-time SWE’s could be hired). In other fields YMMV.

Personal forecasting retrospective: 2020-2022

Thanks for writing this!

Regarding how surprise on current forecasts should factor into AI timelines, two takes I have:

* Given that all the forecasts seem to be wrong in the "things happened faster than we expected" direction, we should probably expect HLAI to happen faster than expected as well.

* It also seems like we should retreat more to outside views about general rates of technological progress, rather than forming a specific inside view (since the inside view seems to mostly end up being wrong).

I think a pure outside view would give a med... (read more)

1elifland3y

I don't think we should update too strongly on these few data points; e.g. a previous analysis of Metaculus' AI predictions found "weak evidence to suggest the community expected more AI progress than actually occurred, but this was not conclusive". MATH and MMLU feel more relevant than the average Metaculus AI prediction but not enough to strongly outweigh the previous findings. I'd be interested to check out that dataset! Hard for me to react too much to the strategy without more details, but outside-view-ish reasoning about predicting things far-ish in the future that we don't know much about (and as you say, have often been wrong on the inside view) seems generally reasonable to me. I mentioned in the post that my median is now ~2050 which is 28 years out; as for how I formed my forecast, I originally roughly start with Ajeya's report, added some uncertainty and had previously shifted further out due to intuitions I had about data/environment bottlenecks, unknown unknowns, etc. I still have lots of uncertainty but my median has moved sooner to 2050 due to MATH forcing me to adjust my intuitions some, reflections on my hesitations against short-ish timelines, and Daniel Kokotajlo's work.

Thanks! I just read over it and assuming I understood correctly, this bottleneck primarily happens for "small" operations like layer normalization and softlax, and not for large matrix multiples. In addition, these small operations are still the minority of runtime (40% in their case). So I think this is still consistent with my analysis, which assumes various things will creep in to keep GPU utilization around 40%, but that they won't ever drive it to (say) 10%. Is this correct or have I misunderstood the nature of the bottleneck?

Edit: also maybe we're ju... (read more)

jsteinhardt3y110

Short answer: If future AI systems are doing R&D, it matters how quickly the R&D is happening.

Okay, thanks! The posts actually are written in markdown, at least on the backend, in case that helps you.

2habryka3y

In that case, if the Markdown dialect matches up, everything might just work fine if you activate the Markdown editor in your Account settings, and then copy-paste the text into the editor (I would try it first in a new post, to make sure it works).

Why I'm Optimistic About Near-Term AI Risk

Question for mods (sorry if I asked this before): Is there a way to make the LaTeX render?

In theory MathJax should be enough, eg that's all I use at the original post: https://bounded-regret.ghost.io/how-fast-can-we-perform-a-forward-pass/

2habryka3y

Yeah, sorry. The difference between the rendering systems and your blog is very minor, but has annoying effects in this case. The delimiters we use in HTML are $ and $ instead of the $ on your blog, since that reduces potential errors with people using currency and other similar things. If you submit your HTML with $ and $, then it should render correctly. I also have a short script I can use to fix this, though it currently requires manual effort each time. I might add a special case to your blog or something to change it automatically, though it would probably take me a bit to get around to. Alternatively, if you write your posts in Markdown on your blog, then that would also translate straightforwardly into the right thing here.

3Said Achmiz3y

If you edit the post on GreaterWrong, you should be able to paste in the LaTeX source and have it render, e.g.: ElapsedTime=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩54C2M3: if B>23M√N8C2NB2(M−B/√N): else. (I don’t know how to do it in the Less Wrong editor, but presumably it’s possible there as well.)

jsteinhardt3y70

I was surprised by this claim. To be concrete, what's your probability of xrisk conditional on 10-year timelines? Mine is something like 25% I think, and higher than my unconditional probability of xrisk.

5Rohin Shah3y

(Ideally we'd be clearer about what timelines we mean here, I'll assume it's TAI timelines for now.) Conditional on 10-year timelines, maybe I'm at 20%? This is also higher than my unconditional probability of x-risk. I'm not sure which part of my claim you're surprised by? Given what you asked me, maybe you think that I think that 10-year timelines are safer than >10-year timelines? I definitely don't believe that. My understanding was that this post was suggesting that timelines are longer than 10 years, e.g. from sentences like this: And that's the part I agree with (including their stated views about what will happen in the next 10 years).

Early 2022 Paper Round-up

jsteinhardt3y90

Fortunately (?), I think the jury is still out on whether phase transitions happen in practice for large-scale systems. It could be that once a system is complex and large enough, it's hard for a single factor to dominate and you get smoother changes. But I think it could go either way.

Early 2022 Paper Round-up

jsteinhardt3y60

Thanks! I pretty much agree with everything you said. This is also largely why I am excited about the work, and I think what you wrote captures it more crisply than I could have.

Buck's Shortform

jsteinhardt3yΩ11170

Yup, I agree with this, and think the argument generalizes to most alignment work (which is why I'm relatively optimistic about our chances compared to some other people, e.g. something like 85% p(success), mostly because most things one can think of doing will probably be done).

It's possibly an argument that work is most valuable in cases of unexpectedly short timelines, although I'm not sure how much weight I actually place on that.

[RETRACTED] It's time for EA leadership to pull the short-timelines fire alarm.

Note the answer changes a lot based on how the question is operationalized. This stronger operationalization has dates around a decade later.

More Is Different for AI

ML Systems Will Have Weird Failure Modes

Here's a link to the version on my blog: https://bounded-regret.ghost.io/appendix-more-is-different-in-related-fields/

More Is Different for AI

jsteinhardt3yΩ120

Yup! That sounds great :)

2Ruby3y

Here it is! https://www.lesswrong.com/s/4aARF2ZoBpFZAhbbe You might want to edit the description and header image.

More Is Different for AI

jsteinhardt3yΩ120

Thanks Ruby! Now that the other posts are out, would it be easy to forward-link them (by adding links to the italicized titles in the list at the end)?

2Ruby3y

We can also make a Sequence. I assume "More Is Different for AI" should be the title of the overall Sequence too?

2Ruby3y

Done!

jsteinhardt3y50

@Mods: Looks like the LaTeX isn't rendering. I'm not sure what the right way to do that is on LessWrong. On my website, I do it with code injection. You can see the result here, where the LaTeX all renders in MathJax: https://bounded-regret.ghost.io/ml-systems-will-have-weird-failure-modes-2/

4habryka3y

Yeah, sorry, we are currently importing your post directly as HTML. We don't do code-injection, we figure out what the right HTML for displaying the LaTeX is server-side, and then store that directly in the HTML for the post. The reason why it isn't working out of the box is that we don't support single-dollar-sign delimiters for LaTeX in HTML, because they have too many false-positives with people just trying to use dollar signs in normal contexts. Everything would actually work out by default if you used the MathJax $ and $ delimiters instead, which are much less ambiguous. I will convert this one manually for now, not sure what the best way moving forward is. Maybe there is a way you can configure your blog to use the $ and $ delimiters instead, or maybe we can adjust our script to get better at detecting when people want to use the single-dollar-delimiter for MathJax purposes, versus other purposes.

2delton1373y

I just did some tests... it works if you go to settings and click "Activate Markdown Editor". Then convert to Markdown and re-save (note, you may want to back up before this, there's a chance footnotes and stuff could get messed up). $stuff$ for inline math and double dollar signs for single line math work when in Markdown mode. When using the normal editor, inline math doesn't work, but $$ works (but puts the equation on a new line).

3Mark Xu3y

I think latex renders if you're using the markdown editor, but if you're using the other editor then it only works if you use the equation editor.

What's Up With Confusingly Pervasive Goal Directedness?

What's Up With Confusingly Pervasive Goal Directedness?

I feel like you are arguing for a very strong claim here, which is that "as soon as you have an efficient way of determining whether a problem is solved, and any way of generating a correct solution some very small fraction of the time, you can just build an efficient solution that solves it all of the time"

Hm, this isn't the claim I intended to make. Both because it overemphasizes on "efficient" and because it adds a lot of "for all" statements.

If I were trying to state my claim more clearly, it would be something like "generically, for the large majority... (read more)

3habryka3y

I am a bit confused by what we mean by "of the sort you would come across in ML". Like, this situation, where we are trying to derive an algorithm that solves problems without optimizers, from an algorithm that solves problems with optimizers, is that "the sort of problem you would come across in ML"?. It feels pretty different to me from most usual ML problems. I also feel like in ML it's quite hard to actually do this in practice. Like, it's very easy to tell whether a self-driving car AI has an accident, but not very easy to actually get it to not have any accidents. It's very easy to tell whether an AI can produce a Harry Potter-level quality novel, but not very easy to get it to produce one. It's very easy to tell if an AI has successfully hacked some computer system, but very hard to get it to actually do so. I feel like the vast majority of real-world problems we want to solve do not currently follow the rule of "if you can distinguish good answers you can find good answers". Of course, success in ML has been for the few subproblems where this turned out to be easy, but clearly our prior should be on this not working out, given the vast majority of problems where this turned out to be hard. (Also, to be clear, I think you are making a good point here, and I am pretty genuinely confused for which kind of problems the thing you are saying does turn out to be true, and appreciate your thoughts here)

jsteinhardt3y100

Thanks for the push-back and the clear explanation. I still think my points hold and I'll try to explain why below.

In order to even get a single expected datapoint of approval, I need to sample 10^8 examples, which in our current sampling method would take 10^8 * 10 hours, e.g. approximately 100,000 years. I don't understand how you could do "Learning from Human Preferences" on something this sparse

This is true if all the other datapoints are entirely indistinguishable, and the only signal is "good" vs. "bad". But in practice you would compare / rank the d... (read more)

2habryka3y

Well, sure, but that is changing the problem formulation quite a bit. It's also not particularly obvious that it helps very much, though I do agree it helps. My guess is even with a rank-ordering, you won't get the 33 bits out of the system in any reasonable amount of time at 10 hours evaluation cost. I do think if you can somehow give more mechanistic and detailed feedback, I feel more optimistic in situations like this, but also feel more pessimistic that we will actually figure out how to do that in situations like this. I feel like you are arguing for a very strong claim here, which is that "as soon as you have an efficient way of determining whether a problem is solved, and any way of generating a correct solution some very small fraction of the time, you can just build an efficient solution that solves it all of the time". This sentence can of course be false without implying that the human preferences work is impossible, so there must be some confusion happening. I am not arguing that this is impossible for all problems, indeed ML has shown that this is indeed quite feasible for a lot of problems, but making the claim that it works for all of them is quite strong, but I also feel like it's obvious enough that this is very hard or impossible for a large other class of problems (like, e.g. reversing hash functions), and so we shouldn't assume that we can just do this for an arbitrary problem.

2habryka3y

I was talking about "costly" in terms of computational resources. Like, of course if I have a system that gets the right answer in 1/100,000,000 cases, and I have a way to efficiently tell when it gets the right answer, then I can get it to always give me approximately always the right answer by just running it a billion times. But that will also take a billion times longer. In-practice, I expect most situations where you have the combination of "In one in a billion cases I get the right answer and it costs me $1 to compute an answer" and "I can tell when it gets the right answer", you won't get to a point where you can compute a right answer for anything close to $1.

What's Up With Confusingly Pervasive Goal Directedness?

jsteinhardt3y30

This would imply a fixed upper bound on the number of bits you can produce (for instance, a false negative rate of 1 in 128 implies at most 7 bits). But in practice you can produce many more than 7 bits, by double checking your answer, combining multiple sources of information, etc.

4JBlack3y

Combining multiple source of information, double checking etc are ways to decrease error probability, certainly. The problem is that they're not independent. For highly complex spaces not only does the number of additional checks you need increase super-linearly, but the number of types of checks you need likely possibly also increases super-linearly. That's my intuition, at least.

Anchor Weights for ML

jsteinhardt3y80

Maybe, but I think some people would disagree strongly with this list even in the abstract (putting almost no weight on Current ML, or putting way more weight on humans, or something else). I agree that it's better to drill down into concrete disagreements, but I think right now there are implicit strong disagreements that are not always being made explicit, and this is a quick way to draw them out.

What's Up With Confusingly Pervasive Goal Directedness?

What's Up With Confusingly Pervasive Goal Directedness?

Basically the same techniques as in Deep Reinforcement Learning from Human Preferences and the follow-ups--train a neural network model to imitate your judgments, then chain it together with RL.

I think current versions of that technique could easily give you 33 bits of information--although as noted elsewhere, the actual numbers of bits you need might be much larger than that, but the techniques are getting better over time as well.

6habryka3y

Hmm, I don't currently find myself very compelled by this argument. Here are some reasons: In order to even get a single expected datapoint of approval, I need to sample 10^8 examples, which in our current sampling method would take 10^8 * 10 hours, e.g. approximately 100,000 years. I don't understand how you could do "Learning from Human Preferences" on something this sparse I feel even beyond that, this still assumes that the reason it is proposing a "good" plan is pure noise, and not the result of any underlying bias that is actually costly to replace. I am not fully sure how to convey my intuitions here, but here is a bad analogy: It seems to me that you can have go-playing-algorithms that lose 99.999% of games against an expert AI, but that doesn't mean you can distill a competitive AI that wins 50% of games, even though it's "only 33 bits of information". Like, the reason why your AI is losing has a structural reason, and the reason why the AI is proposing consequentialist plans also has a structural reason, so even if we get within 33 bits (which I do think seems unlikely), it's not clear that you can get substantially beyond that, without drastically worsening the performance of the AI. In this case, it feels like maybe an AI maybe gets lucky and stumbles upon a plan that solves the problem without creating a consequentialist reasoner, but it's doing that out of mostly luck, not because it actually has a good generator for non-consequentialist-reasoner-generating-plans, and there is no reliable way to always output those plans without actually sampling at least something like 10^4 plans. The intuition of "as soon as I have an oracle for good vs. bad plans I can chain an optimizer to find good plans" feels far too strong to me in generality, and I feel like I can come up with dozen of counterexamples where this isn't the case. Like, I feel like... this is literally a substantial part of the P vs. NP problem, and I can't just assume my algorithm just li

jsteinhardt3y50

Yes, I think I understand that more powerful optimizers can find more spurious solutions. But the OP seemed to be hypothesizing that you had some way to pick out the spurious from the good solutions, but saying it won't scale because you have 10^50, not 100, bad solutions for each good one. That's the part that seems wrong to me.

1JBlack3y

Your "harmfulness" criteria will always have some false negative rate. If you incorrectly classify a harmful plan as beneficial one time in a million, in the former case you'll get 10^44 plans that look good but are really harmful for every one that really is good. In the latter case you get 10000 plans that are actually good for each one that is harmful.

TurnTrout3y130

That part does seem wrong to me. It seems wrong because 10^50 is possibly too small. See my post Seeking Power is Convergently Instrumental in a Broad Class of Environments:

If the agent flips the first bit, it's locked into a single trajectory. None of its actions matter anymore.
But if the agent flips the second bit – this may be suboptimal for a utility function, but the agent still has lots of choices remaining. In fact, it still can induce $(n \times n)^{T - 1}$ observation histories. If $n = 100$ and $T = 50$ , then that's $(100 \times 100)^{49} = 10^{196}$

jsteinhardt3y100

I'm not sure I understand why it's important that the fraction of good plans is 1% vs .00000001%. If you have any method for distinguishing good from bad plans, you can chain it with an optimizer to find good plans even if they're rare. The main difficulty is generating enough bits--but in that light, the numbers I gave above are 7 vs 33 bits--not a clear qualitative difference. And in general I'd be kind of surprised if you could get up to say 50 bits but then ran into a fundamental obstacle in scaling up further.

habryka3y130

Can you be more concrete about how you would do this? If my method for evaluation is "sit down and think about the consequences of doing this for 10 hours", I have no idea how I would chain it with an optimizer to find good plans even if they are rare.

4JBlack3y

I think the problem is not quite so binary as "good/bad". It seems to be more effective vs ineffective and beneficial vs harmful. The problem is that effective plans are more likely to be harmful. We as a species have already done a lot of optimization in a lot of dimensions that are important to us, and the most highly effective plans almost certainly have greater side effects that make thing worse in dimensions that we aren't explicitly telling the optimizer to care about. It's not so much that there's a direct link between sparsity of effective plans and likelihood of bad outcomes, as that more complex problems (especially dealing with the real world) seem more likely to have "spurious" solutions that technically meet all the stated requirements, but aren't what we actually want. The beneficial effective plans become sparse faster than the harmful effective plans, simply because in a more complex space there are more ways to be unexpectedly harmful than good.

Thought Experiments Provide a Third Anchor

Thought Experiments Provide a Third Anchor

Thanks! Yes, this makes very similar points :) And from 4 years ago!

jsteinhardt3y80

The fear of anthropomorphising AI is one of the more ridiculous traditional mental blindspots in the LW/rationalist sphere.

You're really going to love Thursday's post :).

Jokes aside, I actually am not sure LW is that against anthropomorphising. It seems like a much stronger injunction among ML researchers than it is on this forum.

I personally am not very into using humans as a reference class because it is a reference class with a single data point, whereas e.g. "complex systems" has a much larger number of data points.

In addition, it seems like intuition ... (read more)

Okay I think I get what you're saying now--more SGD steps should increase "effective model capacity", so per the double descent intuition we should expect the validation loss to first increase then decrease (as is indeed observed). Is that right?

But if you keep training, GD should eventually find a low complexity high test scoring solution - if one exists - because those solutions have an even higher score (with some appropriate regularization term). Obviously much depends on the overparameterization and relative reg term strength - if it's too strong GD may fail or at least appear to fail as it skips the easier high complexity solution stage. I thought that explanation of grokking was pretty clear.

I think I'm still not understanding. Shouldn't the implicit regularization strength of SGD be higher... (read more)

8jacob_cannell3y

I think grokking requires explicit mild regularization (or at least, it's easier to model how that leads to grokking). The total objective is training loss + reg term. Initially the training loss totally dominates, and GD pushes that down until it overfits (finding a solution with near 0 training loss balanced against reg penalty). Then GD bounces around on that near 0 training loss surface for a while, trying to also reduce the reg term without increasing the training loss. That's hard to do, but eventually it can find rare solutions that actually generalize (still allow near 0 training loss at much lower complexity). Those solutions are like narrow holes in that surface. You can run it as long as you want, but it's never going to ascend into higher complexity regions than those which enable 0 training loss (model entropy on order data set entropy), the reg term should ensure that.

2jsteinhardt3y

jsteinhardt3y*40

I'm not sure I get what the relation would be--double descent is usually with respect to the model size (vs. amout of data), although there is some work on double descent vs. number of training iterations e.g. https://arxiv.org/abs/1912.02292. But I don't immediately see how to connect this to grokking.

(I agree they might be connected, I'm just saying I don't see how to show this. I'm very interested in models that can explain grokking, so if you have ideas let me know!)

8jacob_cannell3y

(That arxiv link isn't working btw.) It makes sense that GD will first find high complexity overfit solutions for an overcomplete model - they are most of the high test scoring solution space. But if you keep training, GD should eventually find a low complexity high test scoring solution - if one exists - because those solutions have an even higher score (with some appropriate regularization term). Obviously much depends on the overparameterization and relative reg term strength - if it's too strong GD may fail or at least appear to fail as it skips the easier high complexity solution stage. I thought that explanation of grokking was pretty clear. I was also under the impression that double descent is basically the same thing, but viewed from the model complexity dimension. Initially in the under-parameterized regime validation error decreases with model complexity up to a saturation point just below where it can start to memorize/overfit, then increases up to a 2nd worse overfitting saturation point, then eventually starts to decrease again heading into the strongly overparameterized regime (assuming appropriate mild regularization). In the strongly overparameterized regime 2 things are happening: firstly it allows the model capacity to more easily represent a distribution of solutions rather than a single solution, and it also effectively speeds up learning in proportion by effectively evaluating more potential solutions (lottery tickets) per step. Grokking can then occur, as it requires sufficient overpamaterization (whereas in the underparameterized regime there isn't enough capacity to simultaneously represent a sufficient distribution of solutions to smoothly interpolate and avoid getting stuck in local minima) Looking at it another way: increased model complexity has strong upside that scales nearly unbounded with model complexity, coupled with the single downside of overfitting which saturates at around data memoriation complexity.

San Francisco shares COVID data only when it's too late

jsteinhardt3y90

I don't think it's inferior -- I think both of them have contrasting strengths and limitations. I think the default view in ML would be to use 95% empiricism, 5% philosophy when making predictions, and I'd advocate for more like 50/50, depending on your overall inclinations (I'm 70-30 since I love data, and I think 30-70 is also reasonable, but I think neither 95-5 or 5-95 would be justifiable).

I'm curious what in the post makes you think I'm claiming philosophy is superior. I wrote this:

> Confronting emergence will require adopting mindsets that are le... (read more)

San Francisco shares COVID data only when it's too late

Also my personal take is that SF, on a pure scientific/data basis, has had one of the best responses in the nation, probably benefiting from having UCSF for in-house expertise. (I'm less enthusiastic about the political response--I think we erred way too far on the "take no risks" side, and like everyone else prioritized restaurants over schools which seems like a clear mistake. But on the data front I feel like you're attacking one of the singularly most reasonable counties in the U.S.)