LESSWRONG
LW

All of Hoagy's Comments + Replies

FrontierMath Score of o3-mini Much Lower Than Claimed

From the OpenAI report, they also give 9% as the no-tool pass@1:

Research-level mathematics: OpenAI o3‑mini with high reasoning performs better than its predecessor on FrontierMath. On FrontierMath, when prompted to use a Python tool, o3‑mini with high reasoning effort solves over 32% of problems on the first attempt, including more than 28% of the challenging (T3) problems. These numbers are provisional, and the chart above shows performance without tools or a calculator.

The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better

Hoagy2mo3113

~All ML researchers and academics that care have already made up their mind regarding whether they prefer to believe in misalignment risks or not. Additional scary papers and demos aren't going to make anyone budge.

Disagree. I think especially ML researchers are updating on these questions all the time. High-info outsiders less so but the contours of the arguments are getting increasing amounts of discussion.

For those who 'believe', 'believing in misalignment risks' doesn't mean thinking they are likely, at least before the point where the models are

... (read more)

Current safety training techniques do not fully transfer to the agent setting

Hoagy6mo71

I think the low-hanging fruit here is that alongside training for refusals we should be including lots of data where you pre-fill some % of a harmful completion and then train the model to snap out of it, immediately refusing or taking a step back, which is compatible with normal training methods. I don't remember any papers looking at it, though I'd guess that people are doing it

Andy Arditi6mo122

Safety Alignment Should Be Made More Than Just a Few Tokens Deep (Qi et al., 2024) does this!

Current safety training techniques do not fully transfer to the agent setting

Hoagy6mo206

Interesting, though note that it's only evidence that 'capabilities generalize further than alignment does' if the capabilities are actually the result of generalisation. If there's training for agentic behaviour but no safety training in this domain then the lesson is more that you need your safety training to cover all of the types of action that you're training your model for.

6Simon Lermen6mo

I only briefly touch on this in the discussion, but making agents safe is quite different from current refusal based safety. It would need to sometimes reevaluate the outcomes of actions while executing a task. Has somebody actually worked on this? I am not aware of anyone using a type of RLHF, DPO, RLAIF, or SFT to make agents behave safely within bounds, make agents consider negative externalities or agents reevaluating outcomes occasionally during execution. I seems easy to just training it to refuse bribing, harassing, etc. But as agents will take on more substantial tasks, how do we make sure agents don't do unethical things while let's say running a company? Or if an agent midway through a task realizes it is aiding in cyber crime, how should it behave?

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

Hoagy9moΩ110

Super interesting! Have you checked whether the average of N SAE features looks different to an SAE feature? Seems possible they live in an interesting subspace without the particular direction being meaningful.

Also really curious what the scaling factors are for computing these values are, in terms of the size of the dense vector and the overall model?

Some additional SAE thoughts

Hoagy1y10

I don't follow, sorry - what's the problem of unique assignment of solutions in fluid dynamics and what's the connection to the post?

Toward A Mathematical Framework for Computation in Superposition

Hoagy1y*Ω110

How are you setting $p$ when $d_{0} = 100$ ? I might be totally misunderstanding something but ${log}^{2} (d_{0}) / \sqrt{d} \approx 2.12$ at $d_{0} = d = 100$ - feels like you need to push $d$ up towards like 2k to get something reasonable? (and the argument in 1.4 for using $\frac{1}{{log}^{2} d_{0}}$ clearly doesn't hold here because it's not greater than $\frac{{log}^{2} d_{0}}{d^{1 / k}}$ for this range of values).

3jake_mendel1y

So, all our algorithms in the post are hand constructed with their asymptotic efficiency in mind, but without any guarantees that they will perform well at finite d. They haven't even really been optimised hard for asymptotic efficiency - we think the important point is in demonstrating that there are algorithms which work in the large d limit at all, rather than in finding the best algorithms at any particular d or in the limit. Also, all the quantities we talk about are at best up to constant factors which would be important to track for finite d. We certainly don't expect that real neural networks implement our constructions with weights that are exactly 0 or 1. Rather, neural networks probably do a messier thing which is (potentially substantially) more efficient, and we are not making predictions about the quantitative sizes of errors at a fixed d. In the experiment in my comment, we randomly initialised a weight matrix with each entry drawn from N(0,1), and set the bias to zero, and then tried to learn the readoff matrix R, in order to test whether U-AND is generic. This is a different setup to the U-AND construction in the post, and I offered a suggestion of readoff vectors for this setup in the comment, although that construction is also asymptotic: for finite d and a particular random seed, there are almost definitely choices of readoff vectors that achieve lower error. FWIW, the average error in this random construction (for fixed compositeness; a different construction would be required for inputs with varying compositeness) is (we think) Θ(1/√d) with a constant that can be found by solving some ugly gaussian integrals but I would guess is less than 10, and the max error is Θ(logd/√d) whp, with a constant that involves some even uglier gaussian integrals.

What’s up with LLMs representing XORs of arbitrary features?

Hoagy1y10

Yeah I'd expect some degree of interference leading to >50% success on XORs even in small models.

1Clément Dumas1y

You can get ~75% just by computing the or. But we found that only at the last layer and step16000 of Pythia-70m training it achieves better than 75%, see this video

Some additional SAE thoughts

Hoagy1y10

Huh, I'd never seen that figure, super interesting! I agree it's a big issue for SAEs and one that I expect to be thinking about a lot. Didn't have any strong candidate solutions as of writing the post, wouldn't even able to be able to say any thoughts I have on the topic now, sorry. Wish I'd posted this a couple of weeks ago.

What’s up with LLMs representing XORs of arbitrary features?

Hoagy1yΩ110

Well the substance of the claim is that when a model is calculating lots of things in superposition, these kinds of XORs arise naturally as a result of interference, so one thing to do might be to look at a small algorithmic dataset of some kind where there's a distinct set of features to learn and no reason to learn the XORs and see if you can still probe for them. It'd be interesting to see if there are some conditions under which this is/isn't true, e.g. if needing to learn more features makes the dependence between their calculation higher and the XORs... (read more)

1Clément Dumas1y

Would you expect that we can extract xors from small models like pythia-70m under your hypothesis?

What’s up with LLMs representing XORs of arbitrary features?

Hoagy1yΩ20362

My hypothesis about what's going on here, apologies if it's already ruled out, is that we should not think of it separately computing the XOR of A and B, but rather that features A and B are computed slightly differently when the other feature is off or on. In a high dimensional space, if the vector $(A | B)$ and the vector $(A | \neg B)$ are slightly different, then as long as this difference is systematic, this should be sufficient to successfully probe for $A \oplus B$ .

For example, if A and B each rely on a sizeable number of different attention hea... (read more)

2Sam Marks1y

Neat hypothesis! Do you have any ideas for how one would experimentally test this?

Gemini 1.0

Hoagy1y12

What assumptions is this making about scaling laws for these benchmarks? I wouldn't know how to convert laws for losses into these kind of fuzzy benchmarks.

2RogerDearnaley1y

I was simply looking at the average across bar charts of the improvement step size between Nano 1 at 1.8B and Nano 2 at 3.25B parameter count, a factor of ~2 in parameter count. The step size up to Pro is about twice that, then from Pro up to Ultra about the same as Nano 1 to Nano 2, each again averaged across bar chats. Assuming only that the scaling law is a power law, i.e a straight line on a log-linear graph, as they always are (well, unless you use a screwy nonlinear scoring scheme), that would mean that Pro's parameter count was around 22=4 times that of Nano 2, which is ~13B, and Ultra was another doubling again at ~26B. Those numbers are obviously wrong, so this is not a simple log-linear chart of something that just scales with a power law from parameter count. And I'm sure they wouldn't have published it if it was.

OpenAI: Facts from a Weekend

Hoagy1y51

There had been various clashes between Altman and the board. We don’t know what all of them were. We do know the board felt Altman was moving too quickly, without sufficient concern for safety, with too much focus on building consumer products, while founding additional other companies. ChatGPT was a great consumer product, but supercharged AI development counter to OpenAI’s stated non-profit mission.

Does anyone have proof of the board's unhappiness about speed, lack of safety concern and disagreement with founding other companies. All seem plausible but have seen basically nothing concrete.

AI Timelines

Hoagy1yΩ465

Could you elaborate on what it would mean to demonstrate 'savannah-to-boardroom' transfer? Our architecture was selected for in the wilds of nature, not our training data. To me it seems that when we use an architecture designed for language translation for understanding images we've demonstrated a similar degree of transfer.

I agree that we're not yet there on sample efficient learning in new domains (which I think is more what you're pointing at) but I'd like to be clearer on what benchmarks would show this. For example, how well GPT-4 can integrate a new domain of knowledge from (potentially multiple epochs of training on) a single textbook seems a much better test and something that I genuinely don't know the answer to.

RSPs are pauses done right

Hoagy2yΩ110

Do you know why 4x was picked? I understand that doing evals properly is a pretty substantial effort, but once we get up to gigantic sizes and proto-AGIs it seems like it could hide a lot. If there was a model sitting in training with 3x the train-compute of GPT4 I'd be very keen to know what it could do!

Deep learning models might be secretly (almost) linear

Hoagy2y10

Yes that makes a lot of sense that linearity would come hand in hand with generalization. I'd recently been reading Krotov on non-linear Hopfield networks but hadn't made the connection. They say that they're planning on using them to create more theoretically grounded transformer architectures. and your comment makes me think that these wouldn't succeed but then the article also says:

This idea has been further extended in 2017 by showing that a careful choice of the activation function can even lead to an exponential memory storage capacity. Importantly,

Hoagy2y30

Reposting from a shortform post but I've been thinking about a possible additional argument that networks end up linear that I'd like some feedback on:

the tldr is that overcomplete bases necessitate linear representations

Neural networks use overcomplete bases to represent concepts. Especially in vector spaces without non-linearity, such as the transformer's residual stream, there are just many more things that are stored in there than there are dimensions, and as Johnson Lindenstrauss shows, there are exponentially many almost-orthogonal directions to st

... (read more)

5beren2y

This is an interesting idea. I feel this also has to be related to increasing linearity with scale and generalization ability -- i.e. if you have a memorised solution, then nonlinear representations are fine because you can easily tune the 'boundaries' of the nonlinear representation to precisely delineate the datapoints (in fact the nonlinearity of the representation can be used to strongly reduce interference when memorising as is done in the recent research on modern hopfield networks) . On the other hand, if you require a kind of reasonably large-scale smoothness of the solution space, as you would expect from a generalising solution in a flat basin, then this cannot work and you need to accept interference between nearly orthogonal features as the cost of preserving generalisation of the behaviour across many different inputs which activate the same vector.

Hoagy's Shortform

Hoagy2y80

There's an argument that I've been thinking about which I'd really like some feedback or pointers to literature on:

the tldr is that overcomplete bases necessitate linear representations

Neural networks use overcomplete bases to represent concepts. Especially in vector spaces without non-linearity, such as the transformer's residual stream, there are just many more things that are stored in there than there are dimensions, and as Johnson Lindenstrauss shows, there are exponentially many almost-orthogonal directions to store them in (of course, we can't ass

Hoagy2y10

See e.g. "So I think backpropagation is probably much more efficient than what we have in the brain." from https://www.therobotbrains.ai/geoff-hinton-transcript-part-one

More generally, I think the belief that there's some kind of important advantage that cutting edge AI systems have over humans comes more from human-AI performance comparisons e.g. GPT-4 way outstrips the knowledge about the world of any individual human in terms of like factual understanding (though obv deficient in other ways) with probably 100x less params. A bioanchors based model of AI... (read more)

how do short-timeliners reason about the differences between brain and AI?

Hoagy2y10

Not totally sure but i think it's pretty likely that scaling gets us to AGI, yeah. Or more particularly, gets us to the point of AIs being able to act as autonomous researchers or act as high (>10x) multipliers on the productivity of human researchers which seems like the key moment of leverage for deciding how the development to AI will go.

Don't have a super clean idea of what self-reflective thought means. I see that e.g. GPT-4 can often say something, think further about it, and then revise its opinion. I would expect a little bit of extra reasoning quality and general competence to push this ability a lot further.

1[anonymous]2y

The point that you brought up seemed to rest a lot on Hinton's claims, so it seems that his opinions on timelines and AI progress should be quite important Do you have any recent source on his claims about AI progress?

how do short-timeliners reason about the differences between brain and AI?

Answer by HoagySep 27, 202390

1 line summary is that NNs can transmit signals directly from any part of the network to any other, while brain has to work only locally.

More broadly I get the sense that there's been a bit of a shift in at least some parts of theoretical neuroscience from understanding how we might be able to implement brain-like algorithms to understanding how the local algorithms that the brain uses might be able to approximate backprop, suggesting that artificial networks might have an easier time than the brain and so it would make sense that we could make something w... (read more)

1[anonymous]2y

So in your model how much of the progress to AGI can be made just by adding more compute + more data + working memory + algorithms that 'just' keep up with the scaling? Specifically, do you think that self-reflective thought already emerges from adding those?

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Hoagy2yΩ340

Hi Scott, thanks for this!

Yes I did do a fair bit of literature searching (though maybe not enough tbf) but very focused on sparse coding and approaches to learning decompositions of model activation spaces rather than approaches to learning models which are monosemantic by default which I've never had much confidence in, and it seems that there's not a huge amount beyond Yun et al's work, at least as far as I've seen.

Still though, I've not seen almost any of these which suggests a big hole in my knowledge, and in the paper I'll go through and add a lot more background to attempts to make more interpretable models.

Amazon to invest up to $4 billion in Anthropic

Hoagy2y81

Cheers, I did see that and wondered whether still to post the comment but I do think that having a gigantic company owning a large chunk and presumably a lot of leverage over the company is a new form of pressure so it'd be reassuring to have some discussion of how to manage that relationship.

Daniel_Eth2y112

Didn't Google previously own a large share? So now there are 2 gigantic companies owning a large share, which makes me think each has much less leverage, as Anthropic could get further funding from the other.

evhub2y106

Yeah, I agree that that's a reasonable concern, but I'm not sure what they could possibly discuss about it publicly. If the public, legible, legal structure hasn't changed, and the concern is that the implicit dynamics might have shifted in some illegible way, what could they say publicly that would address that? Any sort of "Trust us, we're super good at managing illegible implicit power dynamics." would presumably carry no information, no?

Amazon to invest up to $4 billion in Anthropic

Hoagy2y107

Would be interested to hear from Anthropic leadership about how this is expected to interact with previous commitments about putting decision making power in the hands of their Long-Term Benefit Trust.

I get that they're in some sense just another minority investor but a trillion-dollar company having Anthropic be a central plank in their AI strategy with a multi-billion investment and a load of levers to make things difficult for the company (via AWS) is a step up in the level of pressure to aggressively commercialise.

evhub2y158

From the announcement, they said (https://twitter.com/AnthropicAI/status/1706202970755649658):

As part of the investment, Amazon will take a minority stake in Anthropic. Our corporate governance remains unchanged and we’ll continue to be overseen by the Long Term Benefit Trust, in accordance with our Responsible Scaling Policy.

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Hoagy2yΩ230

Hi Charlie, yep it's in the paper - but I should say that we did not find a working CUDA-compatible version and used the scikit version you mention. This meant that the data volumes used are somewhat limited - still on the order of a million examples but 10-50x less than went into the autoencoders.

It's not clear whether the extra data would provide much signal since it can't learn an overcomplete basis and so has no way of learning rare features but it might be able to outperform our ICA baseline presented here, so if you wanted to give someone a project of making that available, I'd be interested to see it!

Influence functions - why, what and how

Hoagy2y11

How do you know?

2Gurkenglas2y

It's the same training datums I would look at to resolve an ambiguous case.

AI Probability Trees - Katja Grace

Hoagy2y10

seems like it'd be better formatted as a nested list given the volume of text

1Nathan Young2y

Maybe, but only because Lesswrong doesn't let you have wide tables.

Which possible AI systems are relatively safe?

Hoagy2y10

Why would we expect the expected level of danger from a model of a certain size to rise as the set of potential solutions grows?

Why Is No One Trying To Align Profit Incentives With Alignment Research?

Hoagy2y710

I think both Leap Labs and Apollo Research (both fairly new orgs) are trying to position themselves as offering model auditing services in the way you suggest.

The "public debate" about AI is confusing for the general public and for policymakers because it is a three-sided debate

Hoagy2y72

A useful model for why it's both appealing and difficult to say 'Doomers and Realists are both against dangerous AI and for safety - let's work together!'.

2Adam David Long2y

yes, this has been very much on my mind: if this three-sided framework is useful/valid, what does it mean for the possibility of the different groups cooperating? I suspect that the depressing answer is that cooperation will be a big challenge and may not happen at all. Especially as to questions such as "is the European AI Act in its present form a good start or a dangerous waste of time?" It strikes me that each of the three groups in the framework will have very strong feelings on this question * realists: yes, because, even if it is not perfect, it is at least a start on addressing important issues like invasion of privacy. * boosters: no, because it will stifle innovation * doomers: no, because you are looking under the lamp post where the light is better, rather than addressing the main risk, which is existential risk.

O O2y2313

AI realism also risks a Security theater that obscures existential risks of AI.

Open problems in activation engineering

Hoagy2yΩ452

Try decomposing the residual stream activations over a batch of inputs somehow (e.g. PCA). Using the principal directions as activation addition directions, do they seem to capture something meaningful?

It's not PCA but we've been using sparse coding to find important directions in activation space (see original sparse coding post, quantitative results, qualitative results).

We've found that they're on average more interpretable than neurons and I understand that @Logan Riggs and Julie Steele have found some effect using them as directions for activation pat... (read more)

Neuronpedia

Hoagy2y40

Hi, nice work! You mentioned the possibility of neurons being the wrong unit. I think that this is the case and that our current best guess for the right unit is directions in the output space, ie linear combinations of neurons.

We've done some work using dictionary learning to find these directions (see original post, recent results) and find that with sparse coding we can find dictionaries of features that are more interpretable the neuron basis (though they don't explain 100% of the variance).

We'd be really interested to see how this compares to ne... (read more)

2Johnny Lin2y

Thank you Hoagy. Expanding beyond the neuron unit is a high priority. I'd like to work with you, Logan Riggs, and others to figure out a good way to make this happen in the next major update so that people can easily view, test, and contribute. I'm now creating a new channel on the discord (#directions) to discuss this: https://discord.gg/kpEJWgvdAx, or I'll DM you my email if you prefer that.

AI-Plans.com 10-day Critique-a-Thon

Hoagy2y10

Link at the top doesn't work for me

1Iknownothing2y

Thank you! I've sorted that now!! Please let me know if you have any other feedback!!

Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity

Hoagy2y21

I still don't quite see the connection - if it turns out that LLFC holds between different fine-tuned models to some degree, how will this help us interpolate between different simulacra?

Is the idea that we could fine-tune models to only instantiate certain kinds of behaviour and then use LLFC to interpolate between (and maybe even extrapolate between?) different kinds of behaviour?

1Bogdan Ionut Cirstea2y

Yes, roughly (the next comment is supposed to make the connection clearer, though also more speculative); RLHF / supervised fine-tuned models would correspond to 'more mode-collapsed' / narrower mixtures of simulacra here (in the limit of mode collapse, one fine-tuned model = one simulacrum).

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

Hoagy2yΩ371

For the avoidance of doubt, this accounting should recursively aggregate transitive inputs.

What does this mean?

8davidad2y

Suppose Training Run Z is a finetune of Model Y, and Model Y was the output of Training Run Y, which was already a finetune of Foundation Model X produced by Training Run X (all of which happened after September 2021). This is saying that not only Training Run Y (i.e. the compute used to produce one of the inputs to Training Run Z), but also Training Run X (a “recursive” or “transitive” dependency), count additively against the size limit for Training Run Z.

Hedonic Loops and Taming RL

Hoagy2y50

Importantly, this policy would naturally be highly specialized to a specific reward function. Naively, you can't change the reward function and expect the policy to instantly adapt; instead you would have to retrain the network from scratch.

I don't understand why standard RL algorithms in the basal ganglia wouldn't work. Like, most RL problems have elements that can be viewed as homeostatic - if you're playing boxcart then you need to go left/right depending on position. Why can't that generalise to seeking food iff stomach is empty? Optimizing for a speci... (read more)

4beren2y

This is definitely possible and is essentially augmenting the state variables with additional homeostatic variables and then learning policies on the joint state space. However there are some clever experiments such as the linked Morrison and Berridge one demonstrating that this is not all that is going on -- specifically many animals appear to be able to perform zero-shot changes in policy when rewards change even if they have not experienced this specific homeostatic variable before -- I.e. mice suddenly chase after salt water which they previously disliked when put in a state of salt deprivation which they had never before experienced

Aligning AI by optimizing for "wisdom"

Hoagy2y92

On first glance I thought this was too abstract to be a useful plan but coming back to it I think this is promising as a form of automated training for an aligned agent, given that you have an agent that is excellent at evaluating small logic chains, along the lines of Constitutional AI or training for consistency. You have training loops using synthetic data which can train for all of these forms of consistency, probably implementable in an MVP with current systems.

The main unknown would be detecting when you feel confident enough in the alignment of its ... (read more)

Steering GPT-2-XL by adding an activation vector

Hoagy2yΩ130

Do you have a writeup of the other ways of performing these edits that you tried and why you chose the one you did?

In particular, I'm surprised by the method of adding the activations that was chosen because the tokens of the different prompts don't line up with each other in a way that I would have thought would be necessary for this approach to work, super interesting to me that it does.

If I were to try and reinvent the system after just reading the first paragraph or two I would have done something like:

Take multiple pairs of prompts that differ primari

... (read more)

Monthly Roundup #6: May 2023

Hoagy2y00

eedly -> feedly

Contra Yudkowsky on Doom from Foom #2

Hoagy2y32

Yeah I agree it's not in human brains, not really disagreeing with the bulk of the argument re brains but just about whether it does much to reduce foom %. Maybe it constrains the ultra fast scenarios a bit but not much more imo.

"Small" (ie << 6 OOM) jump in underlying brain function from current paradigm AI -> Gigantic shift in tech frontier rate of change -> Exotic tech becomes quickly reachable -> YudFoom

Contra Yudkowsky on Doom from Foom #2

Hoagy2y116

The key thing I disagree with is:

In some sense the Foom already occurred - it was us. But it wasn't the result of any new feature in the brain - our brains are just standard primate brains, scaled up a bit[14] and trained for longer. Human intelligence is the result of a complex one time meta-systems transition: brains networking together and organizing into families, tribes, nations, and civilizations through language. ... That transition only happens once - there are not ever more and more levels of universality or linguistic programmability. AGI does

... (read more)

7VojtaKovarik2y

To expand on the idea of meta-systems and their capability: Similarly to discussing brain efficiency, we could ask about the efficiency of our civilization (in the sense of being able to point its capability to a unified goal), among all possible ways of organising civilisations. If our civilisation is very inefficient, AI could figure out a better design and foom that way. Primarily, I think the question of our civilization's efficiency is unclear. My intuition is that our civilization is quite inefficient, with the following points serving as weak evidence: 1. Civilization hasn't been around that long, and has therefore not been optimised much. 2. The point (1) gets even more pronounced as you go from "designs for cooperation among a small group" to "designs for cooperation among milions", or even billions. (Because fewer of these were running in parallel, and for a shorter time.) 3. The fact that civilization runs on humans, who are selfish etc, might severely limit the space of designs that have been tried. 4. As a lower bound, it seems that something like Yudkowsky's ideas about dath ilan might work. (Not to be mistaken with "we can get there from here", "works for humans", or "none of Yudkowsky's ideas have holes in them".) None of this contradicts your arguments, but it adds uncertainty and should make us more cautios about AI. (Not that I interpret the post as advocating against caution.)

4jacob_cannell2y

Yes in the sense that if you zoom in you'll see language starting with simplistic low bit rate communication and steadily improving, followed by writing for external memory, printing press, telecommunication, computers, etc etc. Noosphere to technosphere. But those improvements are not happening in human brains, they are cybernetic externalized.

The Brain is Not Close to Thermodynamic Limits on Computation

Hoagy2y10

disingenuous probably the intended

An open letter to SERI MATS program organisers

Hoagy2y31

I think strategically, only automated and black-box approaches to interpretability make practical sense to develop now.

Just on this, I (not part of SERI MATS but working from their office) had a go at a basic 'make ChatGPT interpret this neuron' system for the interpretability hackathon over the weekend. (GitHub)

While it's fun, and managed to find meaningful correlations for 1-2 neurons / 50, the strongest takeaway for me was the inadequacy of the paradigm 'what concept does neuron X correspond to'. It's clear (no surprise, but I'd never had it shoved in m... (read more)

1Roman Leventov2y

Yes, I agree that automated interpretability should be based on scientific theories of DNNs, of which there are many already, and which should be weaved together with existing mech.interp (proto) theories and empirical observations. Thanks for the pointers!

Alignment of AutoGPT agents

Hoagy2y10

Agree that it's super important, would be better if these things didn't exist but since they do and are probably here to stay, working out how to leverage their own capability to stay aligned rather than failing to even try seems better (and if anyone will attempt a pivotal act I imagine it will be with systems such as these).

Only downside I suppose is that these things seem quite likely to cause an impactful but not fatal warning shot which could be net positive, v unsure how to evaluate this consideration.

Steelman / Ideological Turing Test of Yann LeCun's AI X-Risk argument?

Hoagy2y10

I've not noticed this but it'd be interesting if true as it seems that the tuning/RLHF has managed to remove most of the behaviour where it talks down to the level of the person writing as evidenced by e.g. spelling mistakes. Should be easily testable too.

The 0.2 OOMs/year target

Hoagy2y116

Moore's law is a doubling every 2 years, while this proposes doubling every 18 months, so pretty much what you suggest (not sure if you were disagreeing tbh but seemed like you might be?)

4ESRogs2y

Ah, good point!

The 0.2 OOMs/year target

Hoagy2y86

0.2 OOMs/year is equivalent to a doubling time of 8 months.

I think this is wrong, that's nearly 8 doublings in 5 years, should instead be doubling every 5 years, should instead be doubling every 5 / log2(10) = 1.5.. years

I think pushing GPT-4 out to 2029 would be a good level of slowdown from 2022, but assuming that we could achieve that level of impact, what's the case for having a fixed exponential increase? Is it to let of some level of 'steam' in the AI industry? So that we can still get AGI in our lifetimes? To make it seem more reasonable to polic... (read more)

4Cleo Nardo2y

Yep, thanks! 0.2 OOMs/year is equivalent to a doubling time of 18 months. I think that was just a typo.

6Cleo Nardo2y

The 0.2 OOMs/year target would be an effective moratorium until 2029, because GPT-4 overshot the target.

Imitation Learning from Language Feedback

Hoagy2y50

Thoughts:

Seems like useful work.
With RLHF I understand that when you push super hard for high reward you end up with nonsense results so you have to settle for quantilization or some such relaxation of maximization. Do you find similar things for 'best incorporates the feedback'?
Have we really pushed the boundaries of what language models giving themselves feedback is capable of? I'd expect SotA systems are sufficiently good at giving feedback, such that I wouldn't be surprised that they'd be capable of performing all steps, including the human feedback, i

... (read more)

Nobody’s on the ball on AGI alignment

Hoagy2y124

OpenAI would love to hire more alignment researchers, but there just aren’t many great researchers out there focusing on this problem.

This may well be true - but it's hard to be a researcher focusing on this problem directly unless you have access to the ability to train near-cutting edge models. Otherwise you're going to have to work on toy models, theory, or a totally different angle.

I've personally applied for the DeepMind scalable alignment team - they had a fixed, small available headcount which they filled with other people who I'm sure were bette... (read more)

Please help me sense-check my assumptions about the needs of the AI Safety community and related career plans

Hoagy2y20

Your first link is broken :)

My feeling with the posts is that given the diversity of situations for people who are currently AI safety researchers, there's not likely to be a particular key set of understandings such that a person could walk into the community as a whole and know where they can be helpful. This would be great but being seriously helpful as a new person without much experience or context is just super hard. It's going to be more like here are the groups and organizations which are doing good work, what roles or other things do they need now... (read more)

1peterslattery2y

Hey Hoagy, thanks for replying, I really appreciate it! I fixed that link, thanks for pointing it out. Here is a quick response to some of your points: My feeling with the posts is that given the diversity of situations for people who are currently AI safety researchers, there's not likely to be a particular key set of understandings such that a person could walk into the community as a whole and know where they can be helpful. I tend to feel that things could be much better with little effort. As an analogy, consider the difference between trying to pick a AI safety project to work on now, versus before we had curation and evaluation posts like this. I'll note that those posts seem very useful but they are now almost a year out of date and were only ever based on a small set of opinions. It wouldn't be hard to have something much better. Similarly, I think that there is room for a lot more of this "coordination work' here and lots of low-hanging fruit in general. It's going to be more like here are the groups and organizations which are doing good work, what roles or other things do they need now, and what would help them scale up their ability to produce useful work. This is exactly what I want to know! From my perspective effective movement builders can increase contributors, contributions, and coordination within the AI Safety community, by starting, sustaining, and scaling useful projects. Relatedly, I think that we should ideally have some sort of community consensus gathering process to figure out what is good and bad movement building (e.g., who are the good/bad groups, and what do the collective set of good groups need). The shared language stuff and all of what I produced in my post is mainly a means to that end. I really just want to make sure that before I survey the community to understand who wants what and why, there is some sort of standardised understanding and language about movement building so that people don't just write it off as a