
Wiki Contributions


Sorted by

Scaffolded LLMs are pretty good at not just writing code, but also at refactoring it. So that means that all the tech debt in the world will disappear soon, right?

I predict "no" because

  • As writing code gets cheaper, the relative cost of making sure that a refactor didn't break anything important goes up
  • The number of parallel threads of software development will also go up, with multiple high-value projects making mutually-incompatible assumptions (and interoperability between these projects accomplished by just piling on more cose).

As such, I predict an explosion of software complexity and jank in the near future.

I suspect that it's a tooling and scaffolding issue and that e.g. claude-3-5-sonnet-20241022 can get at least 70% on the full set of 60 with decent prompting and tooling.

By "tooling and scaffolding" I mean something along the lines of

  • Naming the lists that the model submits (e.g. "round 7 list 2")
  • A tool where the LLM can submit a named hypothesis in the form of a python function which takes a list and returns a boolean and check whether the results of that function on all submitted lists from previous rounds match the answers it's seen so far
  • Seeing the round number on every round
  • Dropping everything except the seen lists and non-falsified named hypotheses in the context each round (this is more of a practical thing to avoid using absurd volumes of tokens, but I imagine it wouldn't hurt performance too much)

I'll probably play around with it a bit tomorrow.

Adapting spaced repetition to interruptions in usage: Even without parsing the user’s responses (which would make this robust to difficult audio conditions), if the reader rewinds or pauses on some answers, the app should be able to infer that the user is having some difficulty with the relevant material, and dynamically generate new content that repeats those words or grammatical forms sooner than the default.

Likewise, if the user takes a break for a few days, weeks, or months, the ratio of old to new material should automatically adjust accordingly, as forgetting is more likely, especially of relatively new material. (And of course with text to speech, an interactive app that interpreted responses from the user could and should be able to replicate LanguageZen’s ability to specifically identify (and explain) which part of a user’s response was incorrect, and why, and use this information to adjust the schedule on which material is reviewed or introduced.)

Seems like this one is mostly a matter of schlep rather than capability. The abilities you would need to make this happen are

  1. Have a highly granular curriculum for what vocabulary and what skills are required to learn the language and a plan for what order to teach them in / what spaced repetition schedule to aim for
  2. Have a granular and continuously updated model of the user's current knowledge of vocabulary, rules of grammar and acceptability, idioms, if there are any phonemes or phoneme sequences they have trouble with
  3. Given specific highly granular learning goals (e.g. "understanding when to use preterite vs imperfect when conjugating saber" in spanish) within the curriculum and the model of the user's knowledge and abilities, produce exercises which teach / evaluate those specific skills.
  4. Determine whether the user had trouble with the exercise, and if so what the trouble was
  5. Based on the type of trouble the user had, describe whay updates should be made to the model of the user's knowledge and vocabulary
  6. Correctly apply the updates from (6)
  7. Adapt to deviations from the spaced repetition plan (tbh this seems like the sort of thing you would want to do with normal code)

I expect that the hardest things here will be 1, 2, and 6, and I expect them to be hard because of the volume of required work rather than the technical difficulty. But I also expect the LanguageZen folks have already tried this and could give you a more detailed view about what the hard bits are here.

Automatic customization of content through passive listening

This sounds like either a privacy nightmare or a massive battery drain. The good language models are quite compute intensive, so running them on a battery-powered phone will drain the battery very fast. Especially since this would need to hook into the "granular model of what the user knows" piece.

(and yes, I do in fact think it's plausible that the CTF benchmark saturates before the OpenAI board of directors signs off on bumping the cybersecurity scorecard item from low to medium)

So here’s a question: When we have AGI, what happens to the price of chips, electricity, and teleoperated robots?


As measured in what units?

  • The price of one individual chip of given specs, as a fraction of the net value that can be generated by using that chip to do things that ambitious human adults do: What Principle A cares about, goes up until the marginal cost and value are equal
  • The price of one individual chip of given specs, as a fraction of the entire economy: What principle B cares about, goes down as the number of chips manufactured increases
  • The price of one individual chip of given specs, relative to some other price such as nominal US dollars, inflation-adjusted US dollars, metric tons of rice, or 2000 square foot single-family homes in Berkeley: ¯\_(ツ)_/¯, depends on lots of particulars, not sure that any high-level economic principles say anything specific here

These only contradict each other if you assume that "the value that can be generated by one ambitious adult human divided by the total size of the economy" is a roughly constant value.

Yeah, agreed - the allocation of compute per human would likely become even more skewed if AI agents (or any other tooling improvements) allow your very top people to get more value out of compute than the marginal researcher currently gets.

And notably this shifting of resources from marginal to top researchers wouldn't require achieving "true AGI" if most of the time your top researchers spend isn't spent on "true AGI"-complete tasks.

I think I misunderstood what you were saying there - I interpreted it as something like

Currently, ML-capable software developers are quite expensive relative to the cost of compute. Additionally, many small experiments provide more novel and useful insights than a few large experiments. The top practically-useful LLM costs about 1% as much per hour to run as a ML-capable software developer, and that 100x decrease in cost and the corresponding switch to many small-scale experiments would likely result in at least a 10x increase in the speed at which novel, useful insights were generated.

But on closer reading I see you said (emphasis mine)

I was trying to argue (among other things) that scaling up basically current methods could result in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you'd get as if the human employees operated 10x faster. (In other words, 10x'ing this labor input.)

So if the employees spend 50% of their time waiting on training runs which are bottlenecked on company-wide availability of compute resources, and 50% of their time writing code, 10xing their labor input (i.e. the speed at which they write code) would result in about an 80% increase in their labor output. Which, to your point, does seem plausible.

Sure, but we have to be quantitative here. As a rough (and somewhat conservative) estimate, if I were to manage 50 copies of 3.5 Sonnet who are running 1/4 of the time (due to waiting for experiments, etc), that would cost roughly 50 copies * 70 tok / s * 1 / 4 uptime * 60 * 60 * 24 * 365 sec / year * (15 / 1,000,000) $ / tok = $400,000. This cost is comparable to salaries at current compute prices and probably much less than how much AI companies would be willing to pay for top employees. (And note this is after API markups etc. I'm not including input prices for simplicity, but input is much cheaper than output and it's just a messy BOTEC anyway.)

If you were to spend equal amounts of money on LLM inference and GPUs, that would mean that you're spending $400,000 / year on GPUs. Divide that 50 ways and each Sonnet instance gets an $8,000 / year compute budget. Over the 18 hours per day that Sonnet is waiting for experiments, that is an average of $1.22 / hour, which is almost exactly the hourly cost of renting a single H100 on Vast.

So I guess the crux is "would a swarm of unreliable researchers with one good GPU apiece be more effective at AI research than a few top researchers who can monopolize X0,000 GPUs for months, per unit of GPU time spent".

(and yes, at some point it the question switches to "would an AI researcher that is better at AI research than the best humans make better use of GPUs than the best humans" but a that point it's a matter of quality, not quantity)

End points are easier to infer than trajectories

Assuming that which end point you get to doesn't depend on the intermediate trajectories at least.

Civilization has had many centuries to adapt to the specific strengths and weaknesses that people have. Our institutions are tuned to take advantage of those strengths, and to cover for those weaknesses. The fact that we exist in a technologically advanced society says that there is some way to make humans fit together to form societies that accumulate knowledge, tooling, and expertise over time.

The borderline-general AI models we have now do not have exactly the same patterns of strength and weakness as humans. One question that is frequently asked is approximately

When will AI capabilities reach or exceed all human capabilities that are load bearing in human society?

A related line of questions, though, is

  • When will AI capabilities reach a threshold where a number of agents can form a larger group that accumulates knowledge, tooling, and expertise over time?
  • Will their roles in such a group look similar to the roles that people have in human civilization?
  • Will the individual agents (if "agent" is even the right model to use) within that group have more control over the trajectory of the group as a whole than individual people have over the trajectory of human civilization?

In particular the third question seems pretty important.

Load More