All of Hjalmar_Wijk's Comments + Replies

Maybe distracting technicality:

This seems to make the simplifying assumption that the R&D automation is applied to a large fraction of all the compute that was previously driving algorithmic progress right?

If we imagine that a company only owns 10% of the compute being used to drive algorithmic progress pre-automation (and is only responsible for say 30% of its own algorithmic progress, with the rest coming from other labs/academia/open-source), and this company is the only one automating their AI R&D, then the effect on overall progress might be r... (read more)

6ryan_greenblatt
Yes, I think frontier AI companies are responsible for most of the algorithmic progress. I think its unclear how much the leading actor benefits from progress done at other slightly behind AI companies and this could make progress substantially slower. (However, it's possible the leading AI company would be able to acquire the GPUs from these other companies.)

I'm curious how good current robots are compared to where they'd need to be to automate the biggest bottlenecks in further robot production. You say we start from 10,000/year, but is it plausible that all current robots are too clumsy/incapable for many key bottleneck tasks, and that getting to 10,000 sufficiently good robots produced per year might be a long path - e.g. it would take a decade+ for humans? Or are current robots close to sufficient with good enough software?

I also imagine that even taking current robot production processes, the gap between ... (read more)

6Daniel Kokotajlo
My impression is that software has been the bottleneck here. Building a hand as dextrous as the human hand is difficult but doable (and has probably already been done, though only in very expensive prototypes); having the software to actually use that hand intelligently and deftly as a human would has not yet been done. But I'm not an expert. Power supply is different -- humans can work all day on a few Big Macs, whereas robots will need to be charged, possibly charged frequently or even plugged in constantly. But that doesn't seem like a significant obstacle. Re: WW2 vs. modern: yeah idk. I don't think the modern gap between cars and humanoid robots is that big. Tesla is making Optimus after all. Batteries, electronics, chips, electric motors, sensors... seems like the basic components are the same. And seems like the necessary tolerances are pretty similar; it's not like you need a clean room to make one but not the other, and it's not like you need hyperstrong-hyperlight exotic materials for one but not the other. In fact I can think of one very important, very expensive piece of equipment (the gigapress) that you need for cars but not for humanoid robots. All of the above is for 'minimum viable humanoid robots' e.g. robots that can replace factory and construction workers. They might need to be plugged in to the wall often, they might wear out after a year, they might need to do some kinds of manipulations 2x slower due to having fatter fingers or something. But they don't need to e.g. be capable of hiking for 48 hours in the wilderness and fording rivers all on the energy provided by a Big Mac. Nor do they need to be as strong-yet-lightweight as a human.
Hjalmar_WijkΩ330

Yeah that's right, I made too broad a claim and only meant to say it was an argument against their ability to pose a threat as rogue independent agents.

Hjalmar_Wijk*Ω792

I think in our current situation shutting down all rogue AI operations might be quite tough (though limiting their scale is certainly possible, and I agree with the critique that absent slowdowns or alignment problems or regulatory obstacles etc. it would be surprising if these rogue agents could compete with legitimate actors with many more GPUs).

Assuming the AI agents have money there are maybe three remaining constraints the AI agents need to deal with:

  • purchasing/renting the GPUs,
  • (if not rented through a cloud provider) setting them up and maintainin
... (read more)
2ryan_greenblatt
I agree that this is a fairly strong argument that this AI agent wouldn't be able be able cause problems while rogue. However, I think there is also a concern that this AI will be able to cause serious problems via the affordances it is granted through the AI lab. In particular, AIs might be given huge affordances internally with minimal controls by default. And this might pose a substantial risk even if AIs aren't quite capable enough to pull off the cybersecurity operation. (Though it seems not that bad of a risk.)
Hjalmar_Wijk*Ω790

I did some BOTECs on this and think 1 GB/s is sort of borderline, probably works but not obviously.

E.g. I assumed a ~10TB at fp8 MoE model with a sparsity factor of 4 with 32768 hidden size.

With 32kB per token you could send at most 30k tokens/second over a 1GB/s interconnect. Not quite sure what a realistic utilization would be, but maybe we halve that to 15k?

If the model was split across 20 8xH100 boxes, then each box might do ~250 GFLOP/token (2 * 10T parameters / (4*20)), so each box would do at most 3.75 PFLOP/second, which might be about ~20-25% ut... (read more)

Things that distinguish an "RSP" or "RSP-type commitment" for me (though as with most concepts, something could lack a few of the items below and still overall seem like an "RSP" or a "draft aiming to eventually be an RSP")

  • Scaling commitments: Commitments are not just about deployment but about creating/containing/scaling a model in the first place (in fact for most RSPs I think the scaling/containment commitments are the focus, much more so than the deployment commitments which are usually a bit vague and to-be-determined)
  • Red lines: The commitment spel
... (read more)
8[anonymous]
This seems like a solid list. Scaling certainly seems core to the RSP concept. IMO "red lines, iterative policy updates, and evaluations & accountability" are sort of pointing at the same thing. Roughly something like "we promise not to cross X red line until we have published Y new policy and allowed the public to scrutinize it for Z amount of time." One interesting thing here is that none of the current RSPs meet this standard. I suppose the closest is Anthropic's, where they say they won't scale to ASL-4 until they publish a new RSP (this would cover "red line" but I don't believe they commit to giving the public a chance to scrutinize it, so it would only partially meet "iterative policy updates" and wouldn't meet "evaluations and accountability.") This seems like the meat of an ideal RSP. I don't think it's done by any of the existing voluntary scaling commitments. All of them have this flavor of "our company leadership will determine when the mitigations are sufficient, and we do not commit to telling you what our reasoning is." OpenAI's PF probably comes the closest, IMO (e.g., leadership will evaluate if the mitigations have moved the model from the stuff described in the "critical risk" category to the stuff described in the "high risk" category.) As long as the voluntary scaling commitments end up having this flavor of "leadership will make a judgment call based on its best reasoning", it feels like the commitments lack most of the "teeth" of the kind of RSP you describe.  (So back to the original point– I think we could say that something is only an RSP if it has the "we won't cross this red line until we give you a new policy and let you scrutinize it and also tell you how we're going to reason about when our mitigations are sufficient" property, but then none of the existing commitments would qualify as RSPs. If we loosen the definition, then I think we just go back to "these are voluntary commitments that have to do with scaling & how the lab is t

Yeah, I agree that lack of agency skills are an important part of the remaining human<>AI gap, and that it's possible that this won't be too difficult to solve (and that this could then lead to rapid further recursive improvements). I was just pointing toward evidence that there is a gap at the moment, and that current systems are poorly described as AGI.

2Daniel Kokotajlo
Yeah I wasn't disagreeing with you to be clear. Just adding.
Hjalmar_WijkΩ183013

I agree the term AGI is rough and might be more misleading than it's worth in some cases. But I do quite strongly disagree that current models are 'AGI' in the sense most people intend.

Examples of very important areas where 'average humans' plausibly do way better than current transformers:

  • Most humans succeed in making money autonomously. Even if they might not come up with a great idea to quickly 10x $100 through entrepreneurship, they are able to find and execute jobs that people are willing to pay a lot of money for. And many of these jobs are digita
... (read more)

Current AIs suck at agency skills. Put a bunch of them in AutoGPT scaffolds and give them each their own computer and access to the internet and contact info for each other and let them run autonomously for weeks and... well I'm curious to find out what will happen, I expect it to be entertaining but not impressive or useful. Whereas, as you say, randomly sampled humans would form societies and fnd jobs etc.

This is the common thread behind all your examples Hjalmar. Once we teach our AIs agency (i.e. once they have lots of training-experience operating aut... (read more)

6abramdemski
With respect to METR, yeah, this feels like it falls under my argument against comparing performance against human experts when assessing whether AI is "human-level". This is not to deny the claim that these tasks may shine a light on fundamentally missing capabilities; as I said, I am not claiming that modern AI is within human range on all human capabilities, only enough that I think "human level" is a sensible label to apply. However, the point about autonomously making money feels more hard-hitting, and has been repeated by a few other commenters. I can at least concede that this is a very sensible definition of AGI, which pretty clearly has not yet been satisfied. Possibly I should reconsider my position further. The point about forming societies seems less clear. Productive labor in the current economy is in some ways much more complex and harder to navigate than it would be in a new society built from scratch. The Generative Agents paper gives some evidence in favor of LLM-base agents coordinating social events.

They do as far as I can tell commit to a fairly strong sort of "timeline" for implementing these things: before they scale to ASL-3 capable models (i.e. ones that pass their evals for autonomous capabilities or misuse potential).

2Zach Stein-Perlman
I read it differently; in particular my read is that they aren't currently implementing all of the ASL-2 security stuff (and they're not promising to do all of the ASL-3 stuff before scaling to ASL-3). Clarity from Anthropic would be nice. In "ASL-2 and ASL-3 Security Commitments," they say things like "labs should" rather than "we will." Almost none of their security practices are directly visible from the outside, but whether they have a bug bounty program is. They don't. But "Programs like bug bounties and vulnerability discovery should incentivize exposing flaws" is part of the ASL-2 security commitments. I guess when they "publish a more comprehensive list of our implemented ASL-2 security measures" we'll know more.

ARC evals has only existed since last fall, so for obvious reasons we have not evaluated very early versions. Going forward I think it would be valuable and important to evaluate models during training or to scale up models in incremental steps.

I work at ARC evals, and mostly the answer is that this was sort of a trial run.

For some context ARC evals spun up as a project last fall, and collaborations with labs are very much in their infancy. We tried to be very open about limitations in the system card, and I think the takeaways from our evaluations so far should mostly be that "It seems urgent to figure out the techniques and processes for evaluating dangerous capabilities of models, several leading scaling labs (including OpenAI and Anthropic) have been working with ARC on early attempts to do t... (read more)

2Jurgen Gravestein
So, basically, OpenAI has just deployed an extremely potent model and the party evaluating its potential dangerous capabilities is saying we should be more critical to the eval party + labs going forward? This makes it sounds like an experiment waiting for it all to go wrong.

I might be missing something but are they not just giving the number of parameters (in millions of parameters) on a log10 scale? Scaling laws are usually by log-parameters, and I suppose they felt that it was cleaner to subtract the constant log(10^6) from all the results (e.g. taking log(1300) instead of log(1.3B)).

The B they put at the end is a bit weird though.

Hjalmar_WijkΩ7128

As someone who has been feeling increasingly skeptical of working in academia I really appreciate this post and discussion on it for challenging some of my thinking here. 

I do want to respond especially to this part though, which seems cruxy to me:

Furthermore, it is a mistake to simply focus on efforts on whatever timelines seem most likely; one should also consider tractability and neglectedness of strategies that target different timelines. It seems plausible that we are just screwed on short timelines, and somewhat longer timelines are more tractab

... (read more)
3JanB
I had independently thought that this is one of the main parts where I disagree with the post, and wanted to write up a very similar comment to yours. Highly relevant link: https://www.fhi.ox.ac.uk/wp-content/uploads/Allocating-risk-mitigation.pdf My best guess would have been maybe 3-5x per decade, but 10x doesn't seem crazy.

Strongly agree with this, I think this seems very important.

These sorts of problems are what caused me to want a presentation which didn't assume well-defined agents and boundaries in the ontology, but I'm not sure how it applies to the above - I am not looking for optimization as a behavioral pattern but as a concrete type of computation, which involves storing world-models and goals and doing active search for actions which further the goals. Neither a thermostat nor the world outside seem to do this from what I can see? I think I'm likely missing your point.

2romeostevensit
Motivating example: consider a primitive bacterium with a thermostat, a light sensor, and a salinity detector, each of which has functional mappings to some movement pattern. You could say this system has a 3 dimensional map and appears to search over it.

Theron Pummer has written about this precise thing in his paper on Spectrum Arguments, where he touches on this argument for "transitivity=>comparability" (here notably used as an argument against transitivity rather than an argument for comparability) and its relation to 'Sorites arguments' such as the one about sand heaps.

Personally I think the spectrum arguments are fairly convincing for making me believe in comparability, but I think there's a wide range of possible positions here and it's not entirely obvious which are actually inconsistent. Pummer

... (read more)

Understanding the internal mechanics of corrigibility seems very important, and I think this post helped me get a more fine-grained understanding and vocabulary for it.

I've historically strongly preferred the type of corrigibility which comes from pointing to the goal and letting it be corrigible for instrumental reasons, I think largely because it seems very elegant and that when it works many good properties seem to pop out 'for free'. For instance, the agent is motivated to improve communication methods, avoid coercion, tile properly and even possibly i

... (read more)
7abramdemski
The 'type of corrigibly' you are referring to there is corrigibly at all; rather, it's alignment. Indeed, the term corrigibly was coined to contrast to this, motivated by the fragility of this to getting the printer right. I tend to agree. I'm hoping that thinking about myopia and related issues could help me understand more natural notions of corrigibility.

I really like this model of computation and how naturally it deals with counterfactuals, surprised it isn't talked about more often.

This raises the issue of abstraction - the core problem of embedded agency.

I'd like to understand this claim better - are you saying that the core problem of embedded agency is relating high-level agent models (represented as causal diagrams) to low-level physics models (also represented as causal diagrams)?

8johnswentworth
I'm saying that the core problem of embedded agency is relating a high-level, abstract map to the low-level territory it represents. How can we characterize the map-territory relationship based on our knowledge of the map-generating process, and what properties does the map-generating process need to have in order to produce "accurate" & useful maps? How do queries on the map correspond to queries on the territory? Exactly what information is kept and what information is thrown out when the map is smaller than the territory? Good answers to these question would likely solve the usual reflection/diagonalization problems, and also explain when and why world-models are needed for effective goal-seeking behavior. When I think about how to formalize these sort of questions in a way useful for embedded agency, the minimum requirements are something like: * need to represent physics of the underlying world * need to represent the cause-and-effect process which generates a map from a territory * need to run counterfactual queries on the map (e.g. in order to do planning) * need to represent sufficiently general computations to make agenty things possible ... and causal diagrams with symmetry seem like the natural class to capture all that.

I wonder if you can extend it to also explain non-agentic approaches to Prosaic AI Alignment (and why some people prefer those).

I'm quite confused about what a non-agentic approach actually looks like, and I agree that extending this to give a proper account would be really interesting. A possible argument for actively avoiding 'agentic' models from this framework is:

  1. Models which generalize very competently also seem more likely to have malign failures, so we might want to avoid them.
  2. If we believe then things which generalize very competently are l
... (read more)
3[anonymous]
In light of this exchange, it seems like it would be interesting to analyze how much arguments for problematic properties of superintelligent utility-maximizing agents (like instrumental convergence) actually generalize to more general well-generalizing systems.
6romeostevensit
(black) hat tip to johnswentworth for the notion that the choice of boundary for the agent is arbitrary in the sense that you can think of a thermostat optimizing the environment or think of the environment as optimizing the thermostat. Collapsing sensor/control duality for at least some types of feedback circuits.