All of Jsevillamol's Comments + Replies

I'm talking from a personal perspective here as Epoch director.

  • I personally take AI risks seriously, and I think they are worth investigating and preparing for.
  • I co-started Epoch AI to get evidence and clarity on AI and its risks and this is still a large motivation for me.
  • I have drifted towards a more skeptical position on risk in the last two years. This is due to a combination of seeing the societal reaction to AI, me participating in several risk evaluation processes, and AI unfolding more gradually than I expected 10 years ago.
  • Currently I am more worr
... (read more)

The ability to pay liability is important to factor in and this illustrates it well. For the largest prosaic catastrophes this might well be the dominant consideration

For smaller risks, I suspect in practice mitigation, transaction and prosecution costs are what dominates the calculus of who should bear the liability, both in AI and more generally.

What's the FATE community? Fair AI and Tech Ethics?

Fairness, Accountability, Transparency, Ethics. I think this research community/area is often also called "AI ethics"

We have conveniently just updated our database if anyone wants to investigate this further!
https://epochai.org/data/notable-ai-models

Here is a "predictable surprise" I don't discussed often: given the advantages of scale and centralisation for training, it does not seem crazy to me that some major AI developers will be pooling resources in the future, and training jointly large AI systems.

3ryan_greenblatt
Relatedly, over time as capital demands increase, we might see huge projects which are collaborations between multiple countries. I also think that investors could plausibly end up with more and more control over time if capital demands grow beyond what the largest tech companies can manage. (At least if these investors are savvy.) (The things I write in this comment are commonly discussed amongst people I talk to, so not exactly surprises.)

I've been tempted to do this sometime, but I fear the prior is performing one very important role you are not making explicit: defining the universe of possible hypothesis you consider.

In turn, defining that universe of probabilities defines how bayesian updates look like. Here is a problem that arises when you ignore this: https://www.lesswrong.com/posts/R28ppqby8zftndDAM/a-bayesian-aggregation-paradox

shrug 

I think this is true to an extent, but a more systematic analysis needs to back this up.

For instance, I recall quantization techniques working much better after a certain scale (though I can't seem to find the reference...).  It also seems important to validate that techniques to increase performance apply at large scales. Finally, note that the frontier of scale is growing very fast, so even if these discoveries were done with relatively modest compute compared to the frontier,  this is still a tremendous amount of compute!

even a pause which completely stops all new training runs beyond current size indefinitely would only ~double timelines at best, and probably less

 

I'd emphasize that we currently don't have a very clear sense of how algorithmic improvement happens, and it is likely mediated to some extent by large experiments, so I think is more likely to slow timelines more than this implies.

2johnswentworth
I mean, we can go look at the things which people do when coming up with new more-efficient transformer algorithms, or figuring out the Chinchilla scaling laws, or whatever. And that mostly looks like running small experiments, and extrapolating scaling curves on those small experiments where relevant. By the time people test it out on a big run, they generally know very well how it's going to perform. The place where I'd see the strongest case for dependence on large compute is prompt engineering. But even there, it seems like the techniques which work on GPT-4 also generally work on GPT-3 or 3.5?

I agree! I'd be quite interested in looking at TAS data, for the reason you mentioned.

I think Tetlock and cia might have already done some related work?

Question decomposition is part of the superforecasting commandments, though I can't recall off the top of my head if they were RCT'd individually or just as a whole.

ETA: This is the relevant paper (h/t Misha Yagudin). It was not about the 10 commandments. Apparently those haven't been RCT'd at all?

2niplav
I don't remember anything specific from reading their stuff, but that would of course be useful. Sadly, I haven't been able to find any more recent investigations into decomposition, e.g. Connected Papers for MacGregor 1999 gives nothing worthwhile after 2006 on a first skim, but I'll perhaps look more at it.

I cowrote a detailed response here

https://www.cser.ac.uk/news/response-superintelligence-contained/

Essentially, this type of reasoning proves too much, since it implies we cannot show any properties whatsoever of any program, which is clearly false.

Here is some data through Matthew Barnett and Jess Riedl

Number of cumulative miles driven by Cruise's autonomous cars is growing as an exponential at roughly 1 OOM per year.

https://twitter.com/MatthewJBar/status/1690102362394992640

8Daniel Kokotajlo
Oh shit! So, seems like my million rides per day metric will be reached sometime in 2025? That is indeed somewhat faster than I expected. Updating, updating... Thanks!

That is to very basic approximation correct.

Davidson's takeoff model illustrates this point, where a "software singularity" happens for some parameter settings due to software not being restrained to the same degree by capital inputs.

I would point out however that our current understanding of how software progress happens is somewhat poor. Experimentation is definitely a big component of software progress, and it is often understated in LW. 

More research on this soon!

algorithmic progress is currently outpacing compute growth by quite a bit

This is not right, at least in computer vision. They seem to be the same order of magnitude.

Physical compute has growth at 0.6 OOM/year and physical compute requirements have decreased at 0.1 to 1.0 OOM/year, see a summary here or a in depth investigation here

Another relevant quote

Algorithmic progress explains roughly 45% of performance improvements in image classification, and most of this occurs through improving compute-efficiency.

1meijer1973
Algorithmic improvement has more FOOM potential. Hardware always has a lag. 
3habryka
Cool, makes sense. Sounds like I remembered the upper bound for the algorithmic efficiency estimate. Thanks for correcting!

 is not a transpose! It is the timestep . We are raising  to the -th power.

Thanks!

Our current best guess is that this includes costs other than the amortized compute of the final training run.

If no extra information surfaces we will add a note clarifying this and/or adjust our estimate.

1Edouard Harris
Gotcha, that makes sense!

Thanks Neel!

The difference between tf16 and FP32 comes to a x15 factor IIRC. Though also ML developers seem to prioritise other characteristics than cost effectiveness when choosing GPUs like raw performance and interconnect, so you can't just multiply the top price performance we showcase by this factor and expect that to match the cost performance of the largest ML runs today.

More soon-ish.

Because there is more data available for FP32, so it's easier to study trends there.

We should release a piece soon about how the picture changes when you account for different number formats, plus considering that most runs happen with hardware that is not the most cost-efficient.

Note that Richard is not treating knightian uncertainty as special and unquantifiable, but instead is giving examples of how to treat it like any other uncertainty, that he is explicitly quantifying and incorporating in his predictions.

I'd prefer calling Richard's "model error" to separate the two, but I'm also okay appropriating the term as Richard did to point to something coherent.

7Richard_Ngo
Yeah, so I have mixed feelings about this. One problem with the Knightian uncertainty label is that it implies some level of irreducibility; as Nate points out in the sequence (and James points out below), there are in fact a bunch of ways of reducing it, or dividing it into different subcategories. On the other hand: this post is mainly not about epistemology, it's mainly about communication. And from a communication perspective, Knightian uncertainty points at a big cluster of things that form a large proportion of the blocker between rationalists and non-rationalists communicating effectively about AI. E.g. as Nate points out: So if you have the opinion that Nate and many other rationalists don't know how to do these things enough, then you could either debate them about epistemology, or you could say "we have different views about how much you should do this cluster of things that Knightian uncertainty points to, let's set those aside for now and actually just talk about AI". I wish I'd had that mental move available to me in my conversations with Eliezer so that we didn't get derailed into philosophy of science; and so that I spent more time curious and less time annoyed at his overconfidence. (And all of that applies orders of magnitude more to mainstream scientists/ML researchers hearing these arguments.)

To my knowledge, we currently don’t have a way of translating statements about “loss” into statements about “real-world capabilities”.

 

Now we do!

My intuition is that it's not a great approximation in those cases, similar to how in regular Laplace the empirical approximation is not great when you have eg N<5

Id need to run some calculations to confirm that intuition though.

This site claims that the strong SolidGoldMagikarp was the username of a moderator involved somehow with Twitch Plays Pokémon

https://infosec.exchange/@0xabad1dea/109813506433583177

4mwatkins
Partially true. SGM was a redditor, but seems to have got tokenised for other reasons, full story here: https://twitter.com/SoC_trilogy/status/1623118034960322560 "TPPStreamerBot" is definitely a Twitch Plays Pokemon connection. Its creator has shown up in the comments here to explain what it was.
4Jsevillamol
Here is a 2012 meme about SolidGoldMagikarp https://9gag.com/gag/3389221

I still don't understand - did you mean "when T/t is close to zero"?

1dust_to_must
Oops yes, sorry!
1dust_to_must
Oops, I meant lambda! edited :) 

That's exactly right, and I think the approximation holds as long as T/t>>1.

This is quite intuitive - as the amount of data goes to infinity, the rate of events should equal the number of events so far divided by the time passed.

1dust_to_must
Thanks for the confirmation! In addition to what you say, I would also guess that e−λ∗t is a reasonable guess for P(no events in time t) when t > T, if it's reasonable to assume that events are Poisson-distributed. (but again, open to pushback here :)

If you want to join the Spanish-speaking EA community, you can do so through this link!

I agree with the sentiment that indiscriminate regulation is unlikely to have good effects.

I think the step that is missing is analysing the specific policies No-AI Art Activist are likely to advocate for, and whether it is a good idea to support it.

My current sense is that data helpful for alignment is unlikely to be public right now, and so harder copyright would not impede alignment efforts. The kind of data that I could see being useful are things like scores and direct feedback. Maybe at most things like Amazon reviews could end up being useful for to... (read more)

Great work!

Stuart Armstrong gave one more example of a heuristic argument based in the presumption of independence here.

https://www.lesswrong.com/posts/iNFZG4d9W848zsgch/the-goldbach-conjecture-is-probably-correct-so-was-fermat-s

There are a huge number of examples like that floating around in the literature, we link to some of them in the writeup. I think Terence Tao's blog is the easiest place to get an overview of these arguments, see this post in particular but he discusses this kind of reasoning often.

I think it's easy to give probabilistic heuristic arguments for about 80 of the ~100 conjectures in the wikipedia category unsolved problems in number theory

About 30 of those (including the Goldbach conjecture) follow from the Cramer random model of the primes. Another 9 a... (read more)

Here are my quick takes from skimming the post.

In short, the arguments I think are best are A1, B4, C3, C4, C5, C8, C9 and D. I don't find any of them devastating.

A1. Different calls to ‘goal-directedness’ don’t necessarily mean the same concept

I am not sure I parse this one.I am reading it as "AI systems might be more like imitators than optimizers" from the example, which I find moderately persuasive

A2. Ambiguously strong forces for goal-directedness need to meet an ambiguously high bar to cause a risk

I am not sure I understand this one either.I am readi... (read more)

Eight examples, no cherry-picking:

 

Nit: Having a wall of images makes this post unnecessarily harder to read.
I'd recommend making a 4x2 collage with the photos so they don't take that much space.

9habryka
I edited it to be a table (my guess is this was primarily the result of images being displayed different by default for the AI Impacts website and LessWrong).

As it is often the case, I just found out that Jaynes was already discussing a similar issue to the paradox here in his seminal book.

This wikipedia article summarizes the gist of it.

Ah sorry for the lack of clarity - let's stick to my original submission for PVE

That would be:
 

[0,1,0,1,0,0,9,0,0,1,0,0]
 

Yes, I am looking at decks that appear in the dataset, and more particularly at decks that have faced a deck similar to the rival's.

Good to know that one gets similar results using the different scoring functions.

I guess that maybe the approach does not work that well ¯\_(ツ)_/¯ 

3aphyer
Seeking clarification here: which of these decks are you currently submitting?  If you need more time to decide, let me know.

Thank you for bringing this up!

 I think you might be right, since the deck is quite undiverse and according to the rest diversity is important. That being said, I could not find the mistake in the code at a glance :/

Do you have any opinions on [1, 1, 0, 1, 0, 1, 2, 1, 1, 3, 0, 1]? This would be the worst deck amongst the decks that played against a deck similar to the rival's in my code, according to my code.

1Measure

Marius Hobbhahn has estimated the number of parameters here. His final estimate is 3.5e6 parameters.

Anson Ho has estimated the training compute (his reasoning at the end of this answer). His final estimate is 7.8e22 FLOPs.

Below I made a visualization of the parameters vs training compute of n=108 important ML system, so you can see how DeepMind's syste (labelled GOAT in the graph) compares to other systems. 

[Final calculation]
(8 TPUs)(4.20e14 FLOP/s)(0.1 utilisation rate)(32 agents)(7.3e6 s/agent) = 7.8e22 FLOPs

==========================
NOTES BELOW

[Ha

... (read more)
4Daniel Kokotajlo
Thanks so much! So, for comparison, fruit flies have more synapses than these XLAND/GOAT agents have parameters! https://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons

Here is my very bad approach after spending ~one hour playing around with the data

  1. Filter decks that fought against a similar to the rivals deck, using a simple measure of distance (sum of absolute differences between the deck components)
  2. Compute a 'score' of the decks. The score is defined as the sum of 1/deck_distance(deck) * (1 or -1 depending on whether the deck won or lost against the challenger) 
  3. Report the deck with the maximum score

So my submission would be: [0,1,0,1,0,0,9,0,0,1,0,0]

 

Code

3Measure
2aphyer
Could you try reformatting this, please? It looks like your answer hasn't been successfully spoilered out. Thank you!

Seems like you want to include A, L, P, V, E in your decks, and avoid B, S, K. Here is the correlation between the quantity of each card and whether the deck won. The ordering is ~similar when computing the inclusion winrate for each card.

Thanks for the comment!

I am personally sympathetic to the view that AlphaGo Master and AlphaGo Zero are off-trend.

In the regression with all models the inclusion does not change the median slope, but drastically increases noise, as you can see for yourself in the visualization selecting the option 'big_alphago_action = remove' (see table below for a comparison of regressing the large model trend without vs with the big AlphaGo models).

In appendix B we study the effects of removing AlphaGo Zero and AlphaGo Master when studying record-setting models. The upp... (read more)

Following up on this: we have updated appendix F of our paper with an analysis of different choices of the threshold that separates large-scale and regular-scale systems. Results are similar independently of the threshold choice.

Thanks for engaging!

 

To use this theorem, you need both an  (your data / evidence), and a  (your parameter).

Parameters are abstractions we use to simplify modelling. What we actually care about is the probability of unkown events given past observations.

 

You start out discussing what appears to be a combination of two forecasts

To clarify: this is not what I wanted to discuss. The expert is reporting how you should update your priors given the evidence, and remaining agnostic on what the priors should be.

 

A likelihood is

... (read more)
1JonasMoss
Okay, thanks for the clarification! Let's see if I understand your setup correctly. Suppose we have the probability measures pE and p1, where pE is the probability measure of the expert. Moreover, we have an outcome x∈{A,B,C}. In your post, you use p1(x∣z)∝pE(z∣x)p1(x), where z is an unknown outcome known only to the expert. To use Bayes' rule, we must make the assumption that p1(z∣x)=pE(z∣x). This assumption doesn't sound right to be, but I suppose some strange assumption is necessary for this simple framework. In this model, I agree with your calculations. I'm not sure. When we're looking directly at the probability of an event x (instead of the probability of the probability an event), things get much simpler than I thought. Let's see what happens to the likelihood when you aggregate from the expert's point of view. Letting x∈{A,B,C}, we need to calculate the expert's likelihoods pE(z∣A) and pE(z∣B∪C). In this case, pE(z∣B∪C)=pE(B∪C∣z)pE(B∪C)pE(z),=pE(B∣z)+pE(C∣z)pE(B∪C)pE(z),=pE(z∣B)P(B)+pE(z∣C)P(C)pE(B)+pE(C), which is essentially your calculations, but from the expert's point of view. The likelihood pE(z∣B∪C) depends on pE(B∪C), the prior of the expert, which is unknown to you. That shouldn't come as a surprise, as he needs to use the prior of in order to combine the probability of the events B and C. But the calculations are exactly the same from your point of view, leading to p1(z∣B∪C)=pE(z∣B)p1(B)+pE(z∣C)p1(C)p1(B)+p1(C) Now, suppose we want to generally ensure that pE(z∣B∪C)=p1(z∣B∪C). Which is what I believe you want to do, and which seems pretty natural to do, at least since we're allowed to assume that pE(z∣x)=p1(z∣x) for all simple events x. To ensure this, we will probably have to require that your priors are the same as the expert. In other words, your joint distributions are equal, or p1(z,x)=pE(z,x). Do you agree with this summary?

Great sequence - it is a nice compendium of the theories and important thought experiments.

I will probably use this as a reference in the future, and refer other people here for an introduction.

Looking forward to future entries!

1Heighn
Awesome, thanks for your comment!

I am glad Yair! Thanks for giving it a go :)

Those I know who train large models seem to be very confident we will get 100 Trillion parameter models before the end of the decade, but do not seem to think it will happen, say, in the next 2 years. 

 

FWIW if the current trend continues we will first see 1e14 parameter models in 2 to 4 years from now.

I am pretty pumped about this. Google docs + latex support is huge game for me.

There's also a lot of research that didn't make your analysis, including work explicitly geared towards smaller models. What exclusion criteria did you use? I feel like if I was to perform the same analysis with a slightly different sample of papers I could come to wildly divergent conclusions.


It is not feasible to do an exhaustive analysis of all milestone models. We necessarily are missing some important ones, either because we are not aware of them, because they did not provide enough information to deduce the training compute or because we haven't gott... (read more)

Great questions! I think it is reasonable to be suspicious of the large-scale distinction.

I do stand by it - I think the companies discontinuously increased their training budgets around 2016 for some flagship models.[1] If you mix these models with the regular trend, you might believe that the trend was doubling very fast up until 2017 and then slowed down. It is not an entirely unreasonable interpretation, but it explains worse the discontinuous jumps around 2016. Appendix E discusses this in-depth.

The way we selected the large-scale models is half ... (read more)

Following up on this: we have updated appendix F of our paper with an analysis of different choices of the threshold that separates large-scale and regular-scale systems. Results are similar independently of the threshold choice.

Load More