Inflection.ai is a major AGI lab

Nikola Jurkovic

Update (April 2024): Due to the recent breakup of Inflection, I no longer think they're on track to be a major AGI lab.

Inflection.ai (co-founded by DeepMind co-founder Mustafa Suleyman) should be perceived as a frontier LLM lab of similar magnitude as Meta, OpenAI, DeepMind, and Anthropic based on their compute, valuation, current model capabilities, and plans to train frontier models. Compared to the other labs, Inflection seems to put less effort into AI safety.

Thanks to Laker Newhouse for discussion and feedback!

Inflection has a lot of compute dedicated to training LLMs

They plan to scale up their cluster to 3 times the capacity used to train GPT-4.
- "We'll be building a cluster of around 22,000 H100s. This is approximately three times more compute than what was used to train all of GPT4. Speed and scale are what's going to really enable us to build a differentiated product,"
- “We believe in scale as the engine of progress in AI, and we are building one of the largest supercomputers in the world to develop and deploy the new generation of AIs.”
They can apparently train a model similarly capable to GPT-2 in 11 minutes of cluster time. (see Appendix)
Side point: It seems that the actual H100s are (at least partly) owned by CoreWeave (a cloud compute provider), but that Inflection is one of CoreWeave’s main clients. The specific cluster is a joint effort between Inflection and CoreWeave.
- “They called us and said, ‘Guys, we need you to build one of the most high-performance supercomputers on the planet to support our AI company,'” McBee said. “They call us and they say, ‘This is what we’re looking for, can you do it?’

Inflection has a lot of funding

Inflection is valued at $4B and has raised $1.5B, which is similar to Anthropic ($4.1B valuation, total raised $1.3B as of May 2023) and within an order of magnitude of OpenAI ($28B valuation, $11B raised as of April 2023).

Inflection is on the cutting edge of LLMs

Their flagship LLM, Inflection-1, has similar benchmark results to GPT-3.5
They seem to be currently training a model similarly capable to GPT-4. I expect them to finish training by the end of the year.
- “We will also be releasing a technical memo detailing one of our models in the same compute class as PaLM-2 and GPT-4.”

Inflection plans to train frontier LLMs

They seem to plan to train models 10x or 100x the size of GPT-4 within 18 months.
- “We are about to train models that are 10 times larger than the cutting edge GPT-4 and then 100 times larger than GPT-4. That’s what things look like over the next 18 months.”
  - (it is unclear if “we” refers to Inflection or humanity)

Inflection doesn’t seem to acknowledge existential risks or have a sizable safety team

Their safety site has zero mention of existential or catastrophic risks. Their white house memo is not very reassuring either.
Out of 19 open job listings, only 2 are on the Safety team.
If you look at their LinkedIn (which seems to list most of their current ~40 employees), zero of their employees are listed as working on AI safety at Inflection (one person has the word “safety” in their description but it’s unclear that it’s referring to their position at Inflection).
- I think that this mostly means that the Inflection Safety team members list themselves as “Technical staff” or don’t have LinkedIns. But to me it seems like they have less than 5 people working on safety.

Appendix: Estimating Inflection’s compute

Here are some back-of-the-envelope calculations for Inflection’s current compute from three data sources. They result in estimates ranging around 2 orders of magnitude, centered around 4e18.

FLOPs = plural of “floating point operation (FLOP)”

FLOPS = floating point operations per second

The H100 route

From the H100 datasheet, it seems like different components of the H100 (of which, different models exist), have different amounts of FLOPS. I will simplify and assume one H100 provides an effective 10,000 teraFLOPS, which is 1e12 FLOPS. Inflection.ai currently has around 3.6 thousand H100s, which puts total FLOPS at 3.6e19.

The “train GPT-4 in 4 months when we triple our cluster” route

Inflection thinks they’ll be able to train GPT-4 with four months of cluster time once they triple their cluster size. This means they think they can train GPT-4 in one year of cluster time right now. Epoch thinks GPT-4 took 2.1e25 FLOPs to train, which puts Inflection’s current compute at 6.7e17 FLOPS.

The “11 minutes on the GPT-3 MLBench benchmark” route

Inflection can train GPT-3 up to 2.69 log perplexity on the C4 dataset in 11 minutes. What does this mean? I’m not sure, as I have found it hard to find any modern model’s log perplexity scores on that dataset. GPT-3's log perplexity seems to be -1.73 on some dataset. GPT-2-1.5b’s log perplexity on another dataset seems to be around 3.3. Not sure what to make of that, but let’s assume Inflection can train GPT-2 in 11 minutes on their cluster. This would put their current compute at 2.3e18 FLOPS if we use the Epoch estimate of how much compute GPT-2 took to train.

I have an intuition about Suleyman, that being that his marketing background make him an incredibly unreliable source of actual information. He makes a lot of big predictions on future AI capabilities, like for hallucinations as a recent example I can think of and engages heavily in hype drumming in his interviews and social media. The untrustworthy aura I feel around the company extends to their products. Inflection-1's technical paper (I can't find a potential longer version) is very short compared to GPT-4 or PALM-2 and is entirely pictures of condensed benchmark results with a few paragraphs of explanations.

I expect my views to be wrong, but for now while inflection definitely has the compute, I have a feeling there's a lot more limits and smoke involved that wouldn't quite put them up with OpenAI, DeepMind and Meta in terms of impact on both the market and AGI timelines.

Edit: Originally misread the computer calculations from the post and used my mistake as evidence of my first point. Though the evidence is no longer there, it was confirming an intuition I already had and still stand by.

I'm not sure what you're referring to. The "three times the compute for GPT-4" claim places their compute at 6.7e17, which is lower than the estimates from the other two methods.

You're right, I completely misread it. I'll edit my comment with that in mind.

Relevant tweet/quote from Mustafa Suleyman, the co-founder and CEO:

Powerful AI systems are inevitable. Strict licensing and regulation is also inevitable. The key thing from here is getting the safest and most widely beneficial versions of both.

Suleyman's statements are either very specific capabilities predictions or incredibly vague statements like the one you brought up that don't really inform us much. His interviews often revolve around talking about how big and smart their future models will be while also spending time putting in a good word for their financial backers (mainly NVIDIA). I find myself frustrated at seeing this company with a lot of compute and potential impact on timelines, but whose CEO and main spokesperson seems very out-of-touch with the domain he does business in.

I have been thinking roughly similar things about adept.ai; in particular, because they take a relatively different approach that still relies on scale.

"We'll be building a cluster of around 22,000 H100s. This is approximately three times more compute than what was used to train all of GPT4.

This bothers me. It's a naive way of seeing compute. It's like confusing Watts and Watt-hours

22,000 H100s is three times the amount of FLOP/s than what was used to GPT-4, so you could train it in 3x less time, of with 1/3 of your cluster and the same time.

I think this view of looking at compute helps making naive asumptions about what this compute can be used to. And FLOP/s are not a perfect unit for normal discourse when we're at x10¹⁵ scales.

10,000 teraFLOPS

Each H100 will be closer to 1,000 teraFLOPs or less. For reference, the A100 generally produces 150 teraFLOPs in real-world systems.

Note that the MLPerf benchmark for GPT-3 is not on the full C4 dataset, it's on 0.4% of the C4 dataset.

See: https://twitter.com/abhi_venigalla/status/1673813863186452480?s=20

Inflection thinks they’ll be able to train GPT-4 with four months of cluster time once they triple their cluster size. This means they think they can train GPT-4 in one year of cluster time right now. Epoch thinks GPT-4 took 2.1e25 FLOPs to train, which puts Inflection’s current compute at 6.7e17 FLOPS.

Epoch references this Colab notebook for that 2.1e25 FLOP figure, which estimates 40% hardware utilization rate ("the rest of the time is overhead, idle time spent waiting for data from another calculation/memory, or recomputing results just in time to reduce memory bottlenecks"). Adjusting for that puts Inflection's current compute at 1.7e18 FLOP/s, closer to your third estimate.

2.1e25 FLOPs figure ... that puts Inflection's current compute at 1.7e18 FLOPs

This unit ambiguity is somewhat popular but trivial to avoid. Should be "2.1e25 FLOP" (without an 's') and "1.7e18 FLOP/s" or "1.7e18 FLOPS" (with 's' in uppercase).

Thanks, edited. Left the original quote unchanged.

I'm not sure what you're referring to. The "three times the compute for GPT-4" claim places their compute at 6.7e17, which is lower than the estimates from the other two methods.

You're right, I completely misread it. I'll edit my comment with that in mind.

Relevant tweet/quote from Mustafa Suleyman, the co-founder and CEO:

Powerful AI systems are inevitable. Strict licensing and regulation is also inevitable. The key thing from here is getting the safest and most widely beneficial versions of both.

I have been thinking roughly similar things about adept.ai; in particular, because they take a relatively different approach that still relies on scale.

"We'll be building a cluster of around 22,000 H100s. This is approximately three times more compute than what was used to train all of GPT4.

This bothers me. It's a naive way of seeing compute. It's like confusing Watts and Watt-hours

10,000 teraFLOPS

Each H100 will be closer to 1,000 teraFLOPs or less. For reference, the A100 generally produces 150 teraFLOPs in real-world systems.

Note that the MLPerf benchmark for GPT-3 is not on the full C4 dataset, it's on 0.4% of the C4 dataset.

See: https://twitter.com/abhi_venigalla/status/1673813863186452480?s=20

Inflection thinks they’ll be able to train GPT-4 with four months of cluster time once they triple their cluster size. This means they think they can train GPT-4 in one year of cluster time right now. Epoch thinks GPT-4 took 2.1e25 FLOPs to train, which puts Inflection’s current compute at 6.7e17 FLOPS.

2.1e25 FLOPs figure ... that puts Inflection's current compute at 1.7e18 FLOPs

This unit ambiguity is somewhat popular but trivial to avoid. Should be "2.1e25 FLOP" (without an 's') and "1.7e18 FLOP/s" or "1.7e18 FLOPS" (with 's' in uppercase).

Thanks, edited. Left the original quote unchanged.

LESSWRONG
LW

LESSWRONG
LW

137

Inflection.ai is a major AGI lab

137

Inflection has a lot of compute dedicated to training LLMs

Inflection has a lot of funding

Inflection is on the cutting edge of LLMs

Inflection plans to train frontier LLMs

Inflection doesn’t seem to acknowledge existential risks or have a sizable safety team

Appendix: Estimating Inflection’s compute

The H100 route

The “train GPT-4 in 4 months when we triple our cluster” route

The “11 minutes on the GPT-3 MLBench benchmark” route

137

137