LESSWRONG
LW

All of jsd's Comments + Replies

Win/continue/lose scenarios and execute/replace/audit protocols

jsd5moΩ990

This distinction reminds me of Evading Black-box Classifiers Without Breaking Eggs, in the black box adversarial examples setting.

4Buck5mo

Thanks, this is closely related!

Fabien's Shortform

jsd9mo45

Well that was timely

Scaling of AI training runs will slow down after GPT-5

jsd1y96

Amazon recently bought a 960MW nuclear-powered datacenter.

I think this doesn't contradict your claim that "The largest seems to consume 150 MW" because the 960MW datacenter hasn't been built (or there is already a datacenter there but it doesn't consume that much energy for now)?

The Best Tacit Knowledge Videos on Every Subject

jsd1y30

Related: Film Study for Research

The Best Tacit Knowledge Videos on Every Subject

jsd1y20

Domain: Mathematics

Link: vEnhance

Person: Evan Chen

Background: math PhD student, math olympiad coach

Why: Livestreams himself thinking about olympiad problems

5Neel Nanda1y

Oh nice, I didn't know Evan had a YouTube channel. He's one of the most renowned olympiad coaches and seems highly competent

1Parker Conley1y

Added, thanks! (x2)

The Best Tacit Knowledge Videos on Every Subject

jsd1y20

Domain: Mathematics

Link: Thinking about math problems in real time

Person: Tim Gowers

Background: Fields medallist

Why: Livestreams himself thinking about math problems

1Parker Conley1y

Added, thanks!

Scenario Forecasting Workshop: Materials and Learnings

jsd1y*10

From the Rough Notes section of Ajeya's shared scenario:

Meta and Microsoft ordered 150K GPUs each, big H100 backlog. According to Lennart's BOTECs, 50,000 H100s would train a model the size of Gemini in around a month (assuming 50% utilization)

Just to check my understanding, here's my BOTEC of the number of FLOPs for 50k H100s during a month: 5e4 H100s * 1e15 bf16 FLOPs/second * 0.5 utilization * (3600 * 24 * 30) seconds/month = 6.48e25 FLOPs.

This is indeed close enough to Epoch's median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites a... (read more)

3elifland1y

FYI at the time that doc was created, Epoch had 9e25. Now the notebook says 7.7e25 but their webpage says 5e25. Will ask them about it.

Some heuristics I use for deciding how much I trust scientific results

jsd1y10

Thanks, I think this is a useful post, I also use these heuristics.

I recommend Andrew Gelman’s blog as a source of other heuristics. For example, the Piranha problem and some of the entries in his handy statistical lexicon.

jsd's Shortform

jsd1y10

Mostly I care about this because if there's a small number of instances that are trying to take over, but a lot of equally powerful instances that are trying to help you, this makes a big difference. My best guess is that we'll be in roughly this situation for "near-human-level" systems.

I don't think I've seen any research about cross-instance similarity

I think mode-collapse (update) is sort of an example.

How would you say humanity does on this distinction? When we talk about planning and goals, how often are we talking about "all humans", vs "repres

... (read more)

jsd's Shortform

jsd1y10

I'm not even sure what it would mean for a non-instantiated model without input to do anything.

For goal-directedness, I'd interpret it as "all instances are goal-directed and share the same goal".

As an example, I wish Without specific countermeasures had made the distinction more explicit.

More generally, when discussing whether a model is scheming, I think it's useful to keep in mind worlds where some instances of the model scheme while others don't.

2Dagon1y

I don't think I've seen any research about cross-instance similarity, or even measuring the impact of instance-differences (including context and prompts) on strategic/goal-oriented actions. It's an interesting question, but IMO not as interesting as "if instances are created/selected for their ability to make and execute long-term plans, how do those instances behave". How would you say humanity does on this distinction? When we talk about planning and goals, how often are we talking about "all humans", vs "representative instances"?

jsd's Shortform

jsd1y10

When talking about AI risk from LLM-like models, when using the word "AI" please make it clear whether you are referring to:

A model
An instance of a model, given a prompt

For example, there's a big difference between claiming that a model is goal-directed and claiming that a particular instance of a model given a prompt is goal-directed.

I think this distinction is obvious and important but too rarely made explicit.

2Dagon1y

Can you give a few examples where it's both confusing and important? Almost all concrete experiments and examples I've seen are the latter (an instance with a context and prompt(s)), because that's really the point of interaction and existence for LLMs. I'm not even sure what it would mean for a non-instantiated model without input to do anything.

How do you feel about LessWrong these days? [Open feedback thread]

Answer by jsdDec 06, 2023*90

Here are the Latest Posts I see on my front page and how I feel about them (if I read them, what I remember, liked or disliked, if I didn't read them, my expectations and prejudices)

Shallow review of live agendas in alignment & safety: I think this is a pretty good overview, I've heard that people in the field find these useful. I haven't gotten much out of it yet, but I will probably refer to it or point others to it in the future. (I made a few very small contributions to the post)
Social Dark Matter: I read this a week or so ago. I think I remember t

jsd1y30

According to SemiAnalysis in July:

OpenAI regularly hits a batch size of 4k+ on their inference clusters, which means even with optimal load balancing between experts, the experts only have batch sizes of ~500. This requires very large amounts of usage to achieve.

Our understanding is that OpenAI runs inference on a cluster of 128 GPUs. They have multiple of these clusters in multiple datacenters and geographies. The inference is done at 8-way tensor parallelism and 16-way pipeline parallelism. Each node of 8 GPUs has only ~130B parameters, or less tha

jsd1y31

I'm grateful for this post: it gives simple concrete advice that I intend to follow, and that I hadn't thought of. Thanks.

TurnTrout's shortform feed

jsd1y10

For onlookers, I strongly recommend Gabriel Peyré and Marco Cuturi's online book Computational Optimal Transport. I also think this is a case where considering discrete distributions helps build intuition.

AI Timelines

jsd1y63

As previously discussed a couple times on this website

For context, Daniel wrote Is this a good way to bet on short timelines? (which I didn't know about when writing this comment) 3 years ago.

HT Alex Lawsen for the link.

AI Timelines

jsd1y*20

@Daniel Kokotajlo what odds would you give me for global energy consumption growing 100x by the end of 2028? I'd be happy to bet low hundreds of USD on the "no" side.

ETA: to be more concrete I'd put $100 on the "no" side at 10:1 odds but I'm interested if you have a more aggressive offer.

3Daniel Kokotajlo1y

As previously discussed a couple times on this website, it's not rational for me to make bets on my beliefs about these things. Because I either won't be around to collect if I win, or won't value the money nearly as much. And because I can get better odds on the public market simply by taking out a loan.

Thoughts on open source AI

jsd1yΩ253

If they are right then this protocol boils down to “evaluate, then open source.” I think there are advantages to having a policy which specializes to what AI safety folks want if AI safety folks are correct about the future and specializes to what open source folks want if open source folks are correct about the future.

In practice, arguing that your evaluations show open-sourcing is safe may involve a bunch of paperwork and maybe lawyer fees. If so, this would be a big barrier for small teams, so I expect open-source advocates not to be happy with such a trajectory.

1Shankar Sivarajan1y

Isn't that the point of this exercise?

On Frequentism and Bayesian Dogma

jsd1y10

I'd be interested in @Radford Neal's take on this dialogue (context).

6Radford Neal1y

OK. My views now are not far from those of some time ago, expressed at https://glizen.com/radfordneal/res-bayes-ex.html With regard to machine learning, for many problems of small to moderate size, some Bayesian methods, such as those based on neural networks or mixture models that I've worked on, are not just theoretically attractive, but also practically superior to the alternatives. This is not the case for large-scale image or language models, for which any close approximation to true Bayesian inference is very difficult computationally. However, I think Bayesian considerations have nevertheless provided more insight than frequentism in this context. My results from 30 years ago showing that infinitely-wide neural networks with appropriate priors work well without overfitting have been a better guide to what works than the rather absurd discussions by some frequentist statisticians of that time about how one should test whether a network with three hidden units is sufficient, or whether instead the data justifies adding a fourth hidden unit. Though as commented above, recent large-scale models are really more a success of empirical trial-and-error than of any statistical theory. One can also look at Vapnik's frequentist theory of structural risk minimization from around the same time period. This was widely seen as justifying use of support vector machines (though as far as I can tell, there is no actual formal justification), which were once quite popular for practical applications. But SVMs are not so popular now, being perhaps superceded by the mathematically-related Bayesian method of Gaussian process regression, whose use in ML was inspired by my work on infinitely-wide neural networks. (Other methods like boosted decision trees may also be more popular now.) One reason that thinking about Bayesian methods can be fruitful is that they involve a feedback process: 1. Think about what model is appropriate for your problem, and what prior for its parame

unRLHF - Efficiently undoing LLM safeguards

jsd1yΩ220

I'd be curious about how much more costly this attack is on LMs Pretrained with Human Preferences (including when that method is only applied to "a small fraction of pretraining tokens" as in PaLM 2).

Thomas Kwa's MIRI research experience

jsd1y50

I don't have the energy to contribute actual thoughts, but here are a few links that may be relevant to this conversation:

Sequencing is the new microscope, by Laura Deming
On whether neuroscience is primarily data, rather than theory, bottlenecked:
- Could a neuroscientist understand a microprocessor?, by Eric Jonas
- This footnote on computational neuroscience, by Jascha Sohl-Dickstein

Response to Quintin Pope's Evolution Provides No Evidence For the Sharp Left Turn

jsd1y102

Quinton’s

His name is Quintin (EDIT: now fixed)

6Zvi1y

Man, I was thinking to myself the whole time 'do NOT mess up this name, it's Quintin.' Glad it got fixed.

Answer by jsdSep 27, 202320

You may want to check out Benchmarks for Detecting Measurement Tampering:

Detecting measurement tampering can be thought of as a specific case of Eliciting Latent Knowledge (ELK): When AIs successfully tamper with measurements that are used for computing rewards, they possess important information that the overseer doesn't have (namely, that the measurements have been tampered with). Conversely, if we can robustly elicit an AI's knowledge of whether the measurements have been tampered with, then we could train the AI to avoid measurement tampering. In

... (read more)

1Oliver Daniels2y

Thanks! Hadn't seen that

Notes on Teaching in Prison

jsd2y11

Thank you Ruby. Two other posts I like that I think fit this category are A Brief Introduction to Container Logistics and What it's like to dissect a cadaver.

Notes on Teaching in Prison

jsd2y100

How did you end up doing this work? Did you deliberately seek it out?

I went to a French engineering school which is also a military school. During the first year (which corresponds to junior year in US undergrad), each student typically spends around six months in an armed forces regiment after basic training.

Students get some amount of choice of where to spend these six months among a list of options, and there are also some opportunities outside of the military: these include working as a teaching assistant in some high schools, working for s... (read more)

Notes on Teaching in Prison

jsd2y41

How is it logistically possible for the guards to go on strike?
Who was doing all the routine work of operating cell doors, cameras, and other security facilities?

This is a good question. In France, prison guards are not allowed to strike (like most police, military, and judges). At the time, the penitentiary administration asked for sanctions against guards who took part in the strike, but I think most were not applied because there was a shortage of guards.

In practice, guards were replaced by gendarmes, and work was reduced to the basics... (read more)

When To Stop

jsd2y20

You may enjoy: https://arxiv.org/abs/2207.08799

Utilitarianism Meets Egalitarianism

jsd2y20

I was a little bit confused about Egalitarianism not requiring (1). As an egalitarian, you may not need a full distribution over who you could be, but you do need the support of this distribution, to know what you are minimizing over?

Please don't throw your mind away

jsd2y100

Thanks for this. I’ve been thinking about what to do, as well as where and with whom to live over the next few years. This post highlights important things missing from default plans.

It makes me more excited about having independence, space to think, and a close circle of trusted friends (vs being managed / managing, anxious about urgent todos, and part of a scene).

I’ve spent more time thinking about math completely unrelated to my work after reading this post.

The theoretical justifications are more subtle, and seem closer to true, than previous justificat... (read more)

3TsviBT2y

Good to hear. Definitely. A second-order hope I have is to make more space available for people to be more head-down intense about urgent thinking. The idea being: Alice wants to be heads-down intense about urgent thinking. She does so. Then Bob sees Alice, and feels pressured to also be intense in that way. When Bob tries to be intense in that way, he throws away his mind, and that's not good for him. He doesn't fully understand what's going wrong, but he knows that what's going wrong has something to do with him being pressured to be intense. He correctly identifies that the pressure is partly caused by Alice (whether or not it's Alice who really ought to change her behavior). Not understanding the situation in detail, Bob only has blunt actions available; and not having an explicit justification for pushing back against some pressure he feels, he doesn't push back explicitly, out in the open, but instead puts some of his force toward implicitly pressuring Alice to not be so intense. If Bob were more able to defend important things from implicit pressure that he feels, he's less pushed to pressure Alice to not be intense. And so Alice is more freed to be intense, as is suitable for her. (This is fairly theoretical, but would explain some of my experiences.)

On sincerity

jsd2y10

Thanks for this comment, I found it useful.

What did you want to write at the end of the penultimate paragraph?

2paulfchristiano2y

Thanks for pointing that out (I cut the hanging sentence)

Six (and a half) intuitions for KL divergence

jsd2y90

Thanks for this post! Relatedly, Simon DeDeo had a thread on different ways the KL-divergence pops up in many fields:

Kullback-Leibler divergence has an enormous number of interpretations and uses: psychological, epistemic, thermodynamic, statistical, computational, geometrical... I am pretty sure I could teach an entire graduate seminar on it.
Psychological: an excellent predictor of where attention is directed. http://ilab.usc.edu/surprise/
Epistemic: a normative measure of where you ought to direct your experimental efforts (maximize expected model-breakin

... (read more)

1CallumMcDougall2y

This is awesome, I love it! Thanks for sharing (-:

The case for becoming a black-box investigator of language models

jsd3y10

Some of the most interesting black box investigations I've found are Riley Goodside's.

jsd's Shortform

jsd3y20

A few ways that StyleGAN is interesting for alignment and interpretability work:

It was much easier to interpret than previous generative models, without trading off image quality.
It seems like an even better example of "capturing natural abstractions" than GAN Dissection, which Wentworth mentions in Alignment By Default.
- First, because it's easier to map abstractions to StyleSpace directions than to go through the procedure in GAN Dissection.
- Second, the architecture has 2 separate ways of generating diverse data: changing the style vectors, or adding

... (read more)

jsd's Shortform

jsd3y40

I've been thinking about these two quotes from AXRP a lot lately:

From Richard Ngo's interview:

Richard Ngo: Probably the main answer is just the thing I was saying before about how we want to be clear about where the work is being done in a specific alignment proposal. And it seems important to think about having something that doesn’t just shuffle the optimization pressure around, but really gives us some deeper reason to think that the problem is being solved. One example is when it comes to Paul Christiano’s work on amplification, I think one core insi

... (read more)

1Michaël Trazzi3y

Great quotes. Posting podcast excerpts is underappreciated. Happy to read more of them.

[Link] A minimal viable product for alignment

jsd3y10

Your link redirects back to this page. The quote is from one of Eliezer's comments in Reply to Holden on Tool AI.

2paulfchristiano3y

Thanks, fixed.

[ASoT] Some thoughts about deceptive mesaoptimization

jsd3y60

It's an example first written about by Paul Christiano here (at the beginning of Part III).

The idea is this: suppose we want to ensure that our model has acceptable behavior even in worst-case situations. One idea would be to do adversarial training: at every step during training, train an adversary model to find inputs on which the model behaves unacceptably, and penalize the model accordingly.

If the adversary is able to uncover all the worst-case inputs, this penalization ensures we end up with a model that behaves acceptably on all inputs.

RSA-2048... (read more)

1TLW3y

Right. And it's not even sufficient to set up an adversary that goes 'the model behaves badly if you satisfy X condition', because there's no guarantee that X condition is actually satisfiable. You could ask the adversary to come up with an existence proof... but then the agent could do e.g. 'do X if you see something with SHA512 hash of all zeroes', which is strongly conjectured to exist but isn't proven.

2Quintin Pope3y

Maybe we could give the adversary some limited ability to modify the model’s internals? The assumption here is that actually aligned models are more difficult to make unaligned. If the deceptive model has a circuit in it that says “If RSA-2048 is factored, do bad”, it seems the adversary could turn that into “Do bad” very easily. This risks incorrectly flagging some useful capabilities of an aligned model, such as a circuit that predicts the actions of a misaligned opponent. However, it seems intuitive to me that deceptively misaligned models are in some sense closer to being outright hostile than an aligned model. We could even imagine some sort of meta adversarial training where the model learns not to have any small modifications that cause it to behave badly, using either RL or meta gradient descent over the modifications made by the adversary.

The Best Software For Every Need

jsd3y40

Software: streamlit.io

Need: making small webapps to display or visualize results

Other programs I've tried: R shiny, ipywidgets

I find streamlit extremely simple to use, it interoperates well with other libraries (eg pandas or matplotlib), the webapps render well and are easy to share, either temporarily through ngrok, or with https://share.streamlit.io/.

Some thoughts on why adversarial training might be useful

jsd3y30

Another way adversarial training might be useful, that's related to (1), is that it may make interpretability easier. Given that it weeds out some non-robust features, the features that remain (and the associated feature visualizations) tend to be clearer, cf e.g. Adversarial Robustness as a Prior for Learned Representations. One example of people using this is Leveraging Sparse Linear Layers for Debuggable Deep Networks (blog post, AN summary).

The above examples are from vision networks - I'd be curious about similar phenomena when adversarially training ... (read more)

Emergent modularity and safety

jsd3yΩ340

Relevant related work : NNs are surprisingly modular
https://arxiv.org/abs/2003.04881v2?ref=mlnew

I believe Richard linked to Clusterability in Neural Networks, which has superseded Pruned Neural Networks are Surprisingly Modular.

The same authors also recently published Detecting Modularity in Deep Neural Networks.

3DanielFilan3y

It's true! Altho I think of putting something up on arXiv as a somewhat lower bar than 'publication' - that paper has a bit of work left.

Emergent modularity and safety

jsd3y50

On one hand, Olah et al.’s (2020) investigations find circuits which implement human-comprehensible functions.

At a higher level, they also find that different branches (when the modularity is enforced already by the architecture) tend to contain different features.

18 possible meanings of "I Like Red"

jsd4y70

Another meaning could be: I want to raise the salience of the issue ‘Red vs Not Red’, I want to convey that ‘Red vs Not Red’ is an underrated axis. I think this is also an example of level 4?