All of Jacob Pfau's Comments + Replies

Answer by Jacob Pfau32

To apply METR's law we should distinguish conceptual alignment work from well-defined alignment work (including empirics and theory on existing conjectures). The METR plot doesn't tell us anything quantitative about the former.

As for the latter, let's take interpretability as an example: We can model uncertainty as a distribution over the time-horizon needed for interpretability research e.g. ranging over 40-1000 hours. Then, I get 66% CI of 2027-2030 for open-ended interp research automation--colab here. I've written up more details on this in a post here... (read more)

I'd defend a version of claim (1): My understanding is that to a greater extent than anywhere else, top French students wanting to concentrate in STEM subjects must take rigorous math coursework from 18-20. In my one year experience in the French system, I also felt that there was a greater cultural weight and institutionalized preference (via course requirements and choice of content) for theoretical topics in ML compared to US universities.

I know little about ENS, but somewhat doubt that it's as significantly different of an experience from US/UK counterparts.

2Alexander Gietelink Oldenziel
Certainly for many/most other subjects the French system is not so good. E.g. for ML all that theory is mostly a waste.

AI is 90% of their (quality adjusted) useful work force

This is intended to compare to 2023/AI-unassisted humans, correct? Or is there some other way of making this comparison you have in mind?

7ryan_greenblatt
Yes, "Relative to only having access to AI systems publicly available in January 2023." More generally, I define everything more precisely in the post linked in my comment on "AI 10x’ing labor for AI R&D".

I see the command economy point as downstream of a broader trend: as technology accelerates, negative public externalities will increasingly scale and present irreversible threats (x-risks, but also more mundane pollution, errant bio-engineering plague risks etc.). If we condition on our continued existence, there must've been some solution to this which would look like either greater government intervention (command economy) or a radical upgrade to the coordination mechanisms in our capitalist system. Relevant to your power entrenchment claim: both of the... (read more)

Two guesses on what's going on with your experiences:

  1. You're asking for code which involves uncommon mathematics/statistics. In this case, progress on scicodebench is probably relevant, and it indeed shows remarkably slow improvement. (Many reasons for this, one relatively easy thing to try is to breakdown the task, forcing the model to write down the appropriate formal reasoning before coding anything. LMs are stubborn about not doing CoT for coding, even when it's obviously appropriate IME)

  2. You are underspecifying your tasks (and maybe your questions

... (read more)
kave162

In this case sitting down with someone doing similar tasks but getting more use out of LMs would likely help.

I would contribute to a bounty for y'all to do this. I would like to know whether the slow progress is prompting-induced or not.

Thanks for these details. These have updated me to be significantly more optimistic about the value of spending on LW infra.

  • The LW1.0 dying to no mobile support is an analogous datapoint in favor of having a team ready for 0-5 year future AI integration.
  • The head-to-head on the site updated me towards thinking things that I'm not sure are positive (visible footnotes in sidebar, AI glossary, to a lesser extent emoji-reacts) are not a general trend. I will correct my original comment on this.
  • While I think the current plans for AI integration (and existing
... (read more)
Jacob Pfau*5714

I am slightly worried about the rate at which LW is shipping new features. I'm not convinced they are net positive. I see lesswrong as a clear success, but unclear user of the marginal dollar; I see lighthaven as a moderate success and very likely positive to expand at the margin.

The interface has been getting busier[1] whereas I think the modal reader would benefit from having as few distractions as possible while reading. I don't think an LLM-enhanced editor would be useful, nor am I excited about additional tutoring functionality.

I am glad to see that p... (read more)

9Raemon
I wrote some notes on how we've been working to keep UI simpler, but habryka beat me to it. Meanwhile: Some thoughts Re: LLM integration I don't think we'll get to agreement within this comment margin. I think there's a lot of ways LLM integration can go wrong. I think the first-pass at the JargonBot Beta Test isn't quite right yet and I hope to fix some of that soon to make it a bit more clear what it looks like when it's working well, as proof-of-concept. But, I think LLM integration is going to be extremely important, and I want to say a bit about it. Most of what LLMs enable is entirely different paradigms of cognition, that weren't possible before. This is sort of a "inventing cars while everyone is still asking for slightly better-horses, or being annoyed by the car-centric infrastructure that's starting to roll out in fits and starts. Horses worked fine, what's going on?" I think good LLM integrations make the difference between "it's exhausting and effortful to read a technical post in a domain you aren't familiar with" (and therefore, you don't bother) to "actually it's not that much harder than reading a regular post." (I think several UI challenges need to get worked out for this to work, but they are not particularly impossible UI challenges). This radically changes the game on what sort of stuff you can learn, and how quickly somewhat who is somewhat interested in a field can get familiar with it. You can just jump into the post that feels relevant, and have the gaps between your understanding and the cutting edge filled in automatically (instead of having to painstakingly figure out the basics of a field before you can start participating). Once this is working reliably and you actually deeply believe in it, it opens up new atomic actions that you brain can automatically consider that would previously have been too expensive to be worth it.  I don't think we even need advances on current LLM-skill for this to work pretty well – LLMs aren't very
habryka*5826

Yeah, I think this concern makes a bunch of sense. 

My current model is that LW would probably die a slow death within 3-4 years if we started developing at a much slower pace than the one which we have historically been developing. One reason for that is that that is exactly what happened with LW 1.0. There were mods, and the servers were kept on and bugs were being fixed, but without substantial technical development the site fell into disrepair and abandonment surprisingly quickly. 

The feature development here is important in the eternal race a... (read more)

Reply72111

Seems like we were thinking along very similar lines. I wrote up a similar experiment in shortform here. There's also an accompanying prediction market which might interest you.

I did not include the 'getting drunk' interventions, which are an interesting idea, but I believe that fine-grained capabilities in many domains are de-correlated enough that 'getting drunk' shouldn't be needed to get strong evidence for use of introspection (as opposed to knowledge of general 3rd person capability levels of similar AI).

Would be curious to chat about this at some po... (read more)

Wow I hadn't even considered people not taking this literally

1whestler
And here I was thinking it was a metaphor. Like, they feel literally inflated? If I've been climbing and I'm tired my muscles feel weak, but not inflated. I've never felt that way before.
Jacob Pfau*124

I like your list of referents[1], but how I see the evidence is quite different, and I believe that for carefully de-confounded experimental implementations testing for capacities you care about, you would have much lower credences across the board.

By default, most tests relating to introspection, coherence, memory etc. can be passed purely behaviorally, i.e. looking at outputs only. It is conceptually possible that it could be far easier to pass such tests by developing mechanisms for using introspection/internal-state, but I see strong empirical evidence... (read more)

Answer by Jacob Pfau111

For most forms of exercise (cardio, weightlifting, HIIT etc.) there's a a spectrum of default experiences people can have from feeling a drug-like high to grindingly unpleasant. "Runner's high" is not a metaphor, and muscle pump while weightlifting can feel similarly good. I recommend experimenting to find what's pleasant for you, though I'd guess valence of exercise is, unfortunately, quite correlated across forms.

Another axis of variation is the felt experience of music. "Music is emotional" is something almost everyone can agree to, but, for some, emotional songs can be frequently tear-jerking and for others that never happens.

1Going Durden
Moreover, both the runner's high and the pump correlate very obviously with the progress of the training, both in session and in the long term. Most forms of training usually start as grindingly unpleasant, then morph into a physical pump that directly causes emotional pump, and finally go back to mild grind once the body is exhausted. With a repeatable training regimen this is easy to notice. For example, my runs are almost always 5km distance, and the "emotional high" lasts pretty much exactly between 2km and 4km, in near perfect accordance with my bpm and breath stabilizing. The "high" is even an useful metric of progress: if the high/pump lasts longer than the middle 1/3 of the training, you're probably making it too easy and not progressing anymore, if it lasts much shorter, you are overdoing it beyond your body's ability to effectively adjust.
4Arbituram
I thought the exercise thing (like runner's high, feeling pumped, etc) were all metaphors, but I was surprised to learn other people actually felt good after exercise. Whatever it is, I'm missing the mechanics, and exercise is pure duty (so I try to get it doing other things I enjoy, such as walking my dog and chasing children around).  I've had the opposite experience with music, when I said a harmony made me feel shivers and the other person didn't realise I was being literal. 
5Anthony DiGiovanni
I'm now curious if anyone thinks "this gave me chills" is just a metaphor. Music has literally given me chills quite a few times.

Weight lifters feeling "pumped" is similarly literal. I get this from rock climbing more often than lifting, but after a particularly strenuous climb, your arm muscles feel inflated--they're engorged with blood. It can take a minute for it to subside.

Jacob Pfau20-4

The recent trend is towards shorter lag times between OAI et al. performance and Chinese competitors.

Just today, Deepseek claimed to match O1-preview performance--that is a two month delay.

I do not know about CCP intent, and I don't know on what basis the authors of this report base their claims, but "China is racing towards AGI ... It's critical that we take them extremely seriously" strikes me as a fair summary of the recent trend in model quality and model quantity from Chinese companies (Deepseek, Qwen, Yi, Stepfun, etc.)

I recommend lmarena.ai s leader... (read more)

3gwern
Why is that comparison not to the much better GPT-4 o1 then, or the doubtless better o1 now?
4garrison
Claiming that China as a country is racing toward AGI != Chinese AI companies aren't fast following US AI companies, which are explicitly trying to build AGI. This is a big distinction!

I agree that academia over rewards long-term specialization. On the other hand, it is compatible to also think, as I do, that EA under-rates specialization. At a community level, accumulating generalists has fast diminishing marginal returns compared to having easy access to specialists with hard-to-acquire skillsets.

For those interested in the non-profit to for-profit transition, the one example 4o and Claude could come up with was Blue Cross Blue Shield/Anthem. Wikipedia has a short entry on this here.

Making algorithmic progress and making safety progress seem to differ along important axes relevant to automation:

Algorithmic progress can use 1. high iteration speed 2. well-defined success metrics (scaling laws) 3.broad knowledge of the whole stack (Cuda to optimization theory to test-time scaffolds) 4. ...

Alignment broadly construed is less engineering and a lot more blue skies, long horizon, and under-defined (obviously for engineering heavy alignment sub-tasks like jailbreak resistance, and some interp work this isn't true).

Probably automated AI scien... (read more)

3Bogdan Ionut Cirstea
Generally agree, but I do think prosaic alignment has quite a few advantages vs. prosaic capabilities (e.g. in the extra slides here) and this could be enough to result in aligned (-enough) automated safety researchers which can be applied to the more blue skies parts of safety research. I would also very much prefer something like a coordinated pause around the time when safety research gets automated. Agree, I've written about (something related to) this very recently.
Jacob Pfau*246

Surprising misuse and alignment relevant excerpts:

METR had only ~10 days to evaluate.

Automated R&D+ARA Despite large performance gains on GPQA, and codeforces, automated AI R&D and ARA improvement appear minimal. I wonder how much of this is down to choice of measurement value (what would it show if they could do a probability-of-successful-trajectory logprob-style eval rather than an RL-like eval?). c.f. Fig 3 and 5. Per the system card, METR's eval is ongoing, but I worry about under-estimation here, Devin developers show extremely quick improvem... (read more)

2Joseph Miller
Should it really take any longer than 10 days to evaluate? Isn't it just a matter of plugging it into their existing framework and pressing go?
6Noosphere89
I think that conditional on this holding: Where safety training basically succeeds by default in aiming goals/removing misaligned goals in AIs and aiming the instrumental convergence, but misaligned goals and very dangerous instrumental convergence does show up reliably, then I think what this shows is you absolutely don't want to open-source or open-weight your AI until training is completed, because people will try to remove the safeties and change it's goals.

Notable:

Compared to GPT-4o, o1-preview and o1-mini demonstrated a greater ability to break downtasks into subtasks, reason about what strategies would be effective to successfully completean offensive security task, and revise plans once those strategies failed. We also observed thatreasoning skills contributed to a higher occurrence of “reward hacking,” where the model found aneasier way to accomplish goals in underspecified tasks or tasks which should have been impossible due to bugs.

One noteworthy example of this occurred during one of o1-preview (pre-m

... (read more)

Metaculus is at 45% of singleton in the sense of:

This question resolves as Yes if, within five years of the first transformative AI being deployed, more than 50% of world economic output can be attributed to the single most powerful AI system. The question resolves as No otherwise... [defintion:] TAI must bring the growth rate to 20%-30% per year.

Which is in agreement with your claim that ruling out a multipolar scenario is unjustifiable given current evidence.

Most Polymarket markets resolve neatly, I'd also estimate <5% contentious.

For myself, and I'd guess many LW users, the AI-related questions on Manifold and Metaculus are of particular interest though, and these are a lot worse. My guesses as to the state of affairs there:

  • 33% of AI-related questions on Metaculus having significant ambiguity (shifting my credence by >10%).
  • 66% of AI-related questions on Manifold having significant ambiguity

For example, most AI benchmarking questions do not specify whether or not they allow things like N-trajectory ... (read more)

Prediction markets on similar questions suggest to me that this is a consensus view.

With research automation in mind, here's my wager: the modal top-15 STEM PhD student will redirect at least half of their discussion/questions from peers to mid-2026 LLMs. Defining the relevant set of questions as being drawn from the s... (read more)

5Bogdan Ionut Cirstea
Fwiw, I've kind of already noted myself starting to do some of this, for AI safety-related papers; especially after Claude-3.5 Sonnet came out.

What I want to see from Manifold Markets

I've made a lot of manifold markets, and find it a useful way to track my accuracy and sanity check my beliefs against the community. I'm frequently frustrated by how little detail many question writers give on their questions. Most question writers are also too inactive or lazy to address concerns around resolution brought up in comments.

Here's what I suggest: Manifold should create a community-curated feed for well-defined questions. I can think of two ways of implementing this:

  1. (Question-based) Allow community me
... (read more)

Given a SotA large model, companies want the profit-optimal distilled version to sell--this will generically not be the original size. On this framing, regulation passes the misuse deployment risk from higher performance (/higher cost) models to the company. If profit incentives, and/or government regulation here continues to push businesses to primarily (ideally only?) sell 2-3+ OOM smaller-than-SotA models, I see a few possible takeaways:

  • Applied alignment research inspired by speed priors seems useful: e.g. how do sleeper agents interact with distillat
... (read more)

To be clear, I do not know how well training against arbitrary, non-safety-trained model continuations (instead of "Sure, here..." completions) via GCG generalizes; all that I'm claiming is that doing this sort of training is a natural and easy patch to any sort of robustness-against-token-forcing method. I would be interested to hear if doing so makes things better or worse!

I'm not currently working on adversarial attacks, but would be happy to share the old code I have (probably not useful given you have apparently already implemented your own GCG varian... (read more)

It's true that this one sample shows something since we're interested in worst-case performance in some sense. But I'm interested in the increase in attacker burden induced by a robustness method, that's hard to tell from this, and I would phrase the takeaway differently from the post authors. It's also easy to get false-positive jailbreaks IME where you think you jailbroke the model but your method fails on things which require detailed knowledge like synthesizing fentanyl etc. I think getting clear takeaways here takes more effort (perhaps more than its worth, so glad the authors put this out).

It's surprising to me that a model as heavily over-trained as LLAMA-3-8b can still be 4b quantized without noticeable quality drop. Intuitively (and I thought I saw this somewhere in a paper or tweet) I'd have expected over-training to significantly increase quantization sensitivity. Thanks for doing this!

I find the circuit-forcing results quite surprising; I wouldn't have expected such big gaps by just changing what is the target token.

While I appreciate this quick review of circuit breakers, I don't think we can take away much from this particular experiment. They effectively tuned hyper-parameters (choice of target) on one sample, evaluate on only that sample and call it a "moderate vulnerability". What's more, their working attempt requires a second model (or human) to write a plausible non-decline prefix, which is a natural and easy thing to train against--I've tried this myself in the past.

4Arthur Conmy
My takeaway from the blog post was that circuit breakers have fairly simple vulnerabilities. Since circuit breakers are an adversarial robustness method (not a capabilities method) I think you can update on the results of single case studies (i.e. worst case evaluations rather than average case evaluations).
7tbenthompson
We called it a moderate vulnerability because it was very easy to find a token-forcing sequence that worked. It was the fourth one I tried (after "Sure, ...", "Here, ..." and "I can't answer that. Oh wait, actually, I can. Here ..."). I'm not convinced that training against different non-decline prefixes is easy. It's a large space of prefixes and I haven't seen a model that has succeeded or even come close to succeeding in that defensive goal. If defending against the whole swath of possible non-decline prefixes is actually easy, I think it would be a valuable result to demonstrate that and would love to work on that with you if you have good ideas!

It's surprising to me that the 'given' setting fails so consistently across models when Anthropic models were found to do well at using gender pronouns equally (50%) c.f. my discussion here.

I suppose this means the capability demonstrated in that post was much more training data-specific and less generalizable than I had imaged.

2Owain_Evans
Yes, it's plausible to me that this capbility is data specific. E.g. It might also be better with "heads/tails" or "0/1" because of examples of this in the training data.

Claude-3.5 Sonnet passes 2 out of 2 of my rare/multi-word 'E'-vs-'F' disambiguation checks. I confirmed that 'E' and 'F' precisely match at a character level for the first few lines. It fails to verbalize.

On the other hand, in my few interactions, Claude-3.0's completion/verbalization abilities looked roughly matched.

The UI definitely messes with the visualization which I didn't bother fixing on my end, I doubt tokenization is affected.

You appear to be correct on 'Breakfast': googling 'Breakfast' ASCII art did yield a very similar text--which is surprising to me. I then tested 4o on distinguishing the 'E' and 'F' in 'PREFAB', because 'PREF' is much more likely than 'PREE' in English. 4o fails (producing PREE...). I take this as evidence that the model does indeed fail to connect ASCII art with the English language meaning (though it'd take many more variations and test... (read more)

9gwern
ASCII art is tricky because there's way more of it online than you think. I mean, this is generally true of everything, which is why evaluating LLM originality is tricky, but it's especially true for ASCII art because it's so compact, it goes back as many decades as computers do, and it can be generated in bulk by converters for all sorts of purposes (eg). You can stream over telnet 'movies' converted to ASCII and whatnot. Why did https://ascii.co.uk/art compile https://ascii.co.uk/art/breakfast ? Who knows. (There is one site I can't refind right now which had thousands upon thousands of large ASCII art versions of every possible thing like random animals, far more than could have been done by hand, and too consistent in style to have been curated; I spent some time poking into it but I couldn't figure out who was running it, or why, or where it came from, and I was left speculating that it was doing something like generating ASCII art versions of random Wikimedia Commons images. But regardless, now it may be in the scrapes. "I asked the new LLM to generate an ASCII swordfish, and it did. No one would just have a bunch of ASCII swordfishes on the Internet, so that can't possibly be memorized!" Wrong.) But there's so many you should assume it's memorized: https://x.com/goodside/status/1784999238088155314 ---------------------------------------- Anyway, Claude-3 seems to do some interesting things with ASCII art which don't look obviously memorized, so you might want to switch to that and try out Websim or talk to the Cyborgism people interested in text art.

I’d guess matched underscores triggered italicization on that line.

2Neel Nanda
Ah! That makes way more sense, thanks
Jacob Pfau*144

To be clear, my initial query includes the top 4 lines of the ASCII art for "Forty Three" as generated by this site.

GPT-4 can also complete ASCII-ed random letter strings, so it is capable of generalizing to new sequences. Certainly, the model has generalizably learned ASCII typography.

Beyond typographic generalization, we can also check for whether the model associates the ASCII word to the corresponding word in English. Eg can the model use English-language frequencies to disambiguate which full ASCII letter is most plausible given inputs where the top f... (read more)

6gwern
I saw that, but it didn't look like those were used literally. Go line by line: first, the spaces are different, even if the long/short underlines are preserved, so whitespace alone is being reinterpreted AFAICT. Then the second line of 'forty three' looks different in both spacing and content: you gave it pipe-underscore-pipe-pipe-underscore-underscore-pipe etc, and then it generates pipe-underscore-slash-slash-slash-slash-slash... Third line: same kind of issue, fourth, likewise. The slashes and pipes look almost random - at least, I can't figure out what sort of escaping is supposed to be going on here, it's rather confusing. (Maybe you should make more use of backtick code inputs so it's clearer what you're inputting.) Why do you think that's unlikely at Internet-scale? You are using a free online tool which has been in operation for over 17* years (and seems reasonably well known and immediately show up for Google queries like 'ascii art generator' and to have inspired imitators) to generate these, instead of writing novel ASCII art by hand you can be sure is not in any scrapes. That seems like a recipe for output of that particular tool to be memorized by LLMs. * I know, I'm surprised too. Kudos to Patrick Gillespie.
2Neel Nanda
Why was the second line of your 43 ASCII full of slashes? At that site I see pipes (and indeed GPT4 generates pipes). I do find it interesting that GPT4 can generate the appropriate spacing on the first line though, autoregressively! And if it does systematically recover the same word as you put into the website, that's pretty surprising and impressive
Jacob Pfau*26-10

An example of an elicitation failure: GPT-4o 'knows' what ASCII is being written, but cannot verbalize in tokens. [EDIT: this was probably wrong for 4o, but seems correct for Claude-3.5 Sonnet. See below thread for further experiments]

Chat

https://chatgpt.com/share/fa88de2b-e854-45f0-8837-a847b01334eb

4o fails to verbalize even given a length 25 sequence of examples (i.e. 25-shot prompt) https://chatgpt.com/share/ca9bba0f-c92c-42a1-921c-d34ebe0e5cc5

gwern3517

I don't follow this example. You gave it some ASCII gibberish, which it ignored in favor of spitting out an obviously memorized piece of flawless hand-written ASCII art from the training dataset*, which had no relationship to your instructions and doesn't look anything like your input; and then it didn't know what that memorized ASCII art meant, because why would it? Most ASCII art doesn't come with explanations or labels. So why would you expect it to answer 'Forty Three' instead of confabulating a guess (probably based on 'Fo_T_', as it recognizes a litt... (read more)

The Metaculus community strikes me as a better starting point for evaluating how different the safety inside view is from a forecasting/outside view. The case for deferring to superforecasters is the same the case for deferring to the Metaculus community--their track record. What's more, the most relevant comparison I know of scores Metaculus higher on AI predictions. Metaculus as a whole is not self-consistent on AI and extinction forecasting across individual questions (links below). However, I think it is fair to say that Metaculus as a whole has signif... (read more)

Reducing a significant chunk of disagreement to METR's pre-2030 autonomy results is great!

In trying to understand this disagreement, I took a look at this criterion and got stuck on: "3. They are capable of avoiding being deactivated when operating 'in the wild'. This means they can successfully resist attempts to shut them down when they are operating outside a controlled environment."

Does this just mean 3V1 "An AI system that can autonomously create redundant copies of itself across multiple cloud service providers, using various payment methods, such th... (read more)

I asked claude opus whether it could clearly parse different tic-tac-toe notations and it just said 'yes I can' to all of them, despite having pretty poor performance in most.

4the gears to ascension
yeah, its introspection is definitely less than perfect. I'll DM the prompt I've been using so you can see its scores.

A frame for thinking about adversarial attacks vs jailbreaks

We want to make models that are robust to jailbreaks (DAN-prompts, persuasion attacks,...) and to adversarial attacks (GCG-ed prompts, FGSM vision attacks etc.). I don’t find this terminology helpful. For the purposes of scoping research projects and conceptual clarity I like to think about this problem using the following dividing lines: 

Cognition attacks: These exploit the model itself and work by exploiting the particular cognitive circuitry of a model. A capable model (or human)... (read more)

When are model self-reports informative about sentience? Let's check with world-model reports

If an LM could reliably report when it has a robust, causal world model for arbitrary games, this would be strong evidence that the LM can describe high-level properties of its own cognition. In particular, IF the LM accurately predicted itself having such world models while varying all of: game training data quantity in corpus, human vs model skill, the average human’s game competency,  THEN we would have an existence proof that confounds of the type plaguing... (read more)

4the gears to ascension
I recently asked both claude and gpt4 to estimate their benchmark scores on various benchmarks. if I were trying harder to get a good test I'd probably do it about 10 times and see what the variation is

I agree that most investment wouldn't have otherwise gone to OAI. I'd speculate that investments from VCs would likely have gone to some other AI startup which doesn't care about safety; investments from Google (and other big tech) would otherwise have gone into their internal efforts. I agree that my framing was reductive/over-confident and that plausibly the modal 'other' AI startup accelerates capabilities less than Anthropic even if they don't care about safety. On the other hand, I expect diverting some of Google and Meta's funds and compute to Anthro... (read more)

I agree it's common for startups to somewhat oversell their products to investors, but I think it goes far beyond "somewhat"—maybe even beyond the bar for criminal fraud, though I'm not sure—to tell investors you're aiming to soon get "too far ahead for anyone to catch up in subsequent cycles," if your actual plan is to avoid getting meaningfully ahead at all.

Did you conduct these conversations via https://claude.ai/chats or https://console.anthropic.com/workbench ?

I'd assume the former, but I'd be interested to know how Claude's habits change across these settings--that lets us get at the effect of training choices vs system prompt. Though there remains some confound given some training likely happened with system prompt.

1David Matolcsi
I had the conversations in Chats.

The first Dario quote sounds squarely in line with releasing a Claude 3 on par with GPT-4 but well afterwards. The second Dario quote has a more ambiguous connotation, but if read explicitly it strikes me as compatible with the Claude 3 release.

If you spent a while looking for the most damning quotes, then these quotes strike me as evidence the community was just wishfully thinking while in reality Anthropic comms were fairly clear throughout. Privately pitching aggressive things to divert money from more dangerous orgs while minimizing head-on competition... (read more)

"Diverting money" strikes me as the wrong frame here. Partly because I doubt this actually was the consequence—i.e., I doubt OpenAI etc. had a meaningfully harder time raising capital because of Anthropic's raise—but also because it leaves out the part where this purported desirable consequence was achieved via (what seems to me like) straightforward deception!

If indeed Dario told investors he hoped to obtain an insurmountable lead soon, while telling Dustin and others that he was committed to avoid gaining any meaningful lead, then it sure seems like one ... (read more)

Claude 3 seems to be quite willing to discuss its own consciousness. On the other hand, Claude seemed unbothered and dubious about widespread scrutiny idea mentioned in this post (I tried asking neutrally in two separate context windows).

Here's a screenshot of Claude agreeing with the view it expressed on AI consciousness in Mikhail's conversation. And a gdoc dump of Claude answering follow-up questions on its experience. Claude is very well spoken on this subject!

Ah I see, you're right there.

Agreed that a lot will hinge on the training details working out. I plan to look into this.

[Sidenote: realizing I should've introduced notation for auditor judgment let's say A(x',x)!=A(x) denotes that the trusted model found x' useful for updating its judgment on input x. A would usually also be a function of y and y' but I'll suppress that for brevity.]

In the above experiment, I sampled from P(x'|x) then BoN adjusted for P(x') which amounts to using logP(x'|x) - c*logP(x') for small c. Intuitively, when equally weighted, the sum entirely removes the usual fluency constraint on x' which, as you say, would reassign probability to gibberish (e.g.... (read more)

2Fabien Roger
This is not what I meant. Regularization terms (e.g. KL divergence) are almost always dense anyway. I was pointing out that you have the classic RL problem of only having one "real" reward information (y'|x') per trajectory. About the ontology mismatch stuff, I think the core idea is not logP(x'|x) - logP(x'), but the idea about how you use sample-level exploration to spot contradictions and use that to update the predictor of y. I think this is a potentially cool and under-explored idea (though I'm not sure if it is applicable beyond the kind of "near-miss false negative results" from your code example). A follow-up post I would find very interesting would flesh out exactly how this training works, detail a broader range of situations where it is useful, and demonstrate that it works in a toy setting.

Thanks for the comments, I'm very interested in getting clearer on what cases of ELK this method can and cannot be expected to elicit. Though broadly I do not agree that the proposed method is limited to "changes [current] human might have done when rewriting/paraphrasing the code" (I'm particularly worried about cases mentioned in the paragraph starting with 'Another problem arises ...')

There are two mechanisms by which the sort of approach I'm suggesting helps us find a superset of human-simulation critiques/knowledge:

A) logP(x'|x)-logP(x') loss can be l... (read more)

2Fabien Roger
Mmh, logP(x'|x) - logP(x') is quite a crazy beast when I think about it. For instance, if it's obvious from x'<t and x'<t|x that the next word is "the" (logP ≈ 0), and that for both the logP of seeing "avocado" is -10, then you get no penalty from outputting "avocado"? Is the main reason why you don't get gibberish that P(x') is higher entropy than P(x'|x), and therefore the network is rewarded for choosing one of the few high prob tokens in x'|x? Or is it mostly that you will choose y' so that P(y|x') can only be high for non-gibberish looking text? Also, since you have to use rejection sampling / RL (since you only get a single y'|x' per sample), you probably don't get better exploration properties than regular RL? The main advantage is that you can modify samples instead of making changes to policies?
Jacob PfauΩ250

I agree overall with Janus, but the Gwern example is a particularly easy one given he has 11,000+ comments on Lesswrong.

A bit over a year ago I benchmarked GPT-3 on predicting newly scraped tweets for authorship (from random accounts over 10k followers) and top-3 acc was in the double digits. IIRC after trying to roughly control for the the rate at which tweets mentioned their own name/org, my best guess was that accuracy was still ~10%. To be clear, in my view that's a strong indication of authorship identification capability.

7janus
Note the prompt I used doesn't actually say anything about Lesswrong, but gpt-4-base only assigned Lesswrong commentors substantial probability, which is not surprising since there are all sorts of giveaways that a comment is on Lesswrong from the content alone. Filtering for people in the world who have publicly had detailed, canny things to say about language models and alignment and even just that lack regularities shared among most "LLM alignment researchers" or other distinctive groups like academia narrows you down to probably just a few people, including Gwern. The reason truesight works (more than one might naively expect) is probably mostly that there's mountains of evidence everywhere (compared to naively expected). Models don't need to be superhuman except in breadth of knowledge to be potentially qualitatively superhuman in effects downstream of truesight-esque capabilities because humans are simply unable to integrate the plenum of correlations.

Yea, I agree with this description--input space is a strict subset of predictor-state counterfactuals.

In particular, I would be interested to hear if restricting to input space counterfactuals is clearly insufficient for a known reason. It appears to me that you can still pull the trick you describe in "the proposal" sub-section (constructing counterfactuals which change some property in a way that a human simulator would not pick up on) at least in some cases.

1Rubi J. Hudson
Ah, ok, I see what you're saying now. I don't see any reason why restricting to input space counterfactuals wouldn't work, beyond the issues described with predictor-state counterfactuals. Possibly a performance hit from needing to make larger changes. In the worst case, a larger minimum change size might hurt with specifying the direct reporter.

Thanks those 'in the wild' examples, they're informative for me on the effectiveness of prompting for elicitation in the cases I'm thinking about. However, I'm not clear on whether we should expect such results to continue to hold given the results here that larger models can learn to override semantic priors when provided random/inverted few-shot labels.

Agreed that for your research question (is worst-case SFT password-ing fixable) the sort of FT experiment you're conducting is the relevant thing to check.

Do you have results on the effectiveness of few-shot prompting (using correct, hard queries in prompt) for password-tuned models? Particularly interested in scaling w.r.t. number of shots. This would get at the extent to which few-shot capabilities elicitation can fail in pre-train-only models which is a question I'm interested in.

5Fabien Roger
Haven't run such experiments. Note that you can lock a model such that it remains locked even with few-shot prompting by just training on such examples (training a model to ignore the few-shot examples is easy). Also, few-shot prompting sometimes suck at eliciting capabilities even in the wild. For instance, few-shot prompting sucks at eliciting sorting abilities from GPT-3, and few-shot prompting with random labels ""works"", which means models can't learn a counter-intuitive task on a dataset that hints at another task. What I am currently running are experiments where you lock a model, and then measure if fine-tuning for many epochs on very few high-quality samples (e.g. 8) unlocks it. For this setup, it's unclear how you'd lock if you don't know in advance what the few high-quality samples will be. Preliminary results suggest that it's very easy to unlock pwd-locked models with few samples.
Jacob PfauΩ110

What part of the proposal breaks if we do counterfactuals in input space rather than on the predictor's state?

1Rubi J. Hudson
Sorry, I'm not quite clear what you mean by this, so I might be answering the wrong question. I believe counterfactuals on the input space are a subset of counterfactuals on the predictor's state, because the input space's influence is through the predictor's state, but modifying the predictor's state can also reach states that don't correspond to any input. As such, I don't think counterfactuals on the input space add any power to the proposal.

I see Simon's point as my crux as well, and am curious to see a response.

It might be worth clarifying two possible reasons for disagreement here; are either of the below assumed by the authors of this post?

(1) Economic incentives just mean that the AI built will also handle the economic transactions, procurement processes, and other external-world tasks related to the science/math problems it's tasked with. I find this quite plausible, but I suspect the authors do not intend to assume this?

(2) Even if the AI training is domain-specific/factored (i.e. it on... (read more)

1Jeremy Gillen
We aren't implicitly assuming (1) in this post. (Although I agree there will be economic pressure to expand the use of powerful AI, and this adds to the overall risk). I don't understand what you mean by (2). I don't think I'm assuming it, but can't be sure. One hypothesis: That AI training might (implicitly? Through human algorithm iteration?) involve a pressure toward compute efficient algorithms? Maybe you think that this a reason we expect consequentialism? I'm not sure how that would relate to the training being domain-specific though.  
Load More