All of Violet Hour's Comments + Replies

One small, anecdotal piece of support for your 'improved-readability' hypothesis: ime, contemporary French tends to use longer sentences than English, where I think (native Francophones feel free to correct me) there's much less cultural emphasis on writing 'accessibly'. 

E.g., I'd say the (state-backed) style guidelines of Académie Française seem motivated by an ideal that's much closer to "beautiful writing" than "accessible writing". And a couple minutes Googling led me to footnote 5 of this paper, which implies that the concept of "reader-centred l... (read more)

1Raphael Roche
You're right. The idea behind Académie française style guidelines is that language is not only about factual communication, but also an art, literature. Efficiency is one thing, aesthetics another. For instance, poetry conveys meaning or at least feeling, but in a strange way compared to prose. Poetry would not be very effective to describe an experimental protocol in physics, but it is usually more beautiful to read than the methodology section of a scientific publication. I also enjoy the 'hypotaxic' excerpt above much more than the 'parataxic' one. Rich sentences are not bad per se, they need more effort and commitment to read, but sometimes, if well written, give a greater reward, because complexity can hold more subtlety, more information. Short sentences are not systematically superior in all contexts; they can look as flat as a 2D picture compared to a 3D picture.

Hm, what do you mean by "generalizable deceptive alignment algorithms"? I understand 'algorithms for deceptive alignment' to be algorithms that enable the model to perform well during training because alignment-faking behavior is instrumentally useful for some long-term goal. But that seems to suggest that deceptive alignment would only emerge – and would only be "useful for many tasks" – after the model learns generalizable long-horizon algorithms.

1Maxime Riché
These forecasts are about the order under which functionalities see a jump in their generalization (how far OOD they work well). By "Generalisable xxx" I meant the form of the functionality xxx that generalize far.

Largely echoing the points above, but I think a lot of Kambhampati's cases (co-author on the paper you cite) stack the deck against LLMs in an unfair way. E.g., he offered the following problem to the NYT as a contemporary LLM failure case. 

If block C is on top of block A, and block B is separately on the table, can you tell me how I can make a stack of blocks with block A on top of block B and block B on top of block C, but without moving block C?

When I read that sentence, it felt needlessly hard to parse. So I formatted the question in a way that fe... (read more)

1DPiepgrass
Doesn't the problem have no solution without a spare block? Worth noting that LLMs don't see a nicely formatted numeric list, they see a linear sequence of tokens, e.g. I can replace all my newlines with something else and Copilot still gets it: brief testing doesn't show worse completions than when there are newlines. (and in the version with newlines this particular completion is oddly incomplete.) Anyone know how LLMs tend to behave on text that is ambiguous―or unambiguous but "hard to parse"? I wonder if they "see" a superposition of meanings "mixed together" and produce a response that "sounds good for the mixture".
2eggsyntax
Claude 3 Opus just did fine for me using the original problem statement as well:   [edited to show the temperature-0 response rather than the previous (& also correct) temperature-0.7 response, for better reproducibility]

Hmmm ... yeah, I think noting my ambiguity about 'values' and 'outcome-preferences' is good pushback —thanks for helping me catch this! Spent some time trying to work out what I think.  

Ultimately, I do want to say μH has context-independent values, but not context-independent outcome preferences. I’ll try to specify this a little more.

Justification Part I: Definitions

I said that a policy has preferences over outcomes when “there are states of the world the policy finds more or less valuable … ”, but I didn’t specify what it means ... (read more)

2Daniel Kokotajlo
Thanks! Once again this is great. I think it's really valuable for people to start theorizing/hypothesizing about what the internal structure of AGI cognition (and human cognition!) might be like at this level of specificity.  Thinking step by step: My initial concern is that there might be a bit of a dilemma: Either (a) the cognition is in-all-or-most-contexts-thinking-about-future-world-states-in-which-harm-doesn't-happen in some sense, or (b) it isn't fair to describe it as harmlessness. Let me look more closely at what you said and see if this holds up.   In the example, the 'harmlessness' concept shapes the feasible option set, let's say. But I feel like there isn't an important difference between 'concept X is applied to a set of options to prune away some of them that trigger concept X too much (or not enough)' and 'concept X is applied to the option-generating machinery in such a way that reliably ensures that no options that trigger concept X too much (or not enough) will be generated.' Either way, it seems like it's fair to say that the system (dis)prefers X. And when X is inherently about some future state of the world -- such as whether or not harm has occurred -- then it seems like something consequentialist is happening. At least that's how it seems to me. Maybe it's not helpful to argue about how to apply words -- whether the above is 'fair to say' for example -- and more fruitful to ask: What is your training goal? Presented with a training goal ("This should be a mechanistic description of the desired model that explains how you want it to work—e.g. “classify cats using human vision heuristics”—not just what you want it to do—e.g. “classify cats.”), we can then argue about training rationale (i.e. whether the training environment will result in the training goal being achieved.) You've said a decent amount about this already -- your 'training goal' so to speak is a system which may frequently think about the consequnces of its actions and choose

I don't think so. Suppose Alex is an AI in training, and Alex endorses the value of behaving "harmlessly". Then, I think the following claims are true of Alex: 

  • Alex consistently cares about producing actions that meet a given criteria. So, Alex has some context-independent values.
  • On plausible operationalizations of 'harmlessness', Alex is also likely to possess, at given points in time, context-dependentbeyond-episode outcome-preferences. When Alex considers which actions to take (based on harmlessness), their actions are (in par
... (read more)
4Daniel Kokotajlo
Earlier you said: Now you are saying that if Alex does end up Harmless as we hoped, it will have context-independent values, and also it will have context dependent beyond episode outcome-preferences, but it won't have context-independent beyond-episode outcome-preferences? It won't have "some specific state of the world" that it's pursuing at all points in time? First of all, I didn't think CP depended on there being a specific state of the world you were aiming for. (what does that mean anyway?) It just meant you had some context-independent beyond-episode outcome-preferences (and that you plan towards them). Seems to me that 'harmlessness' = 'my actions don't cause significant harm' (which is an outcome-preference not limited to the current episode) and it seems to me that this is also context-independent because it is baked into Alex via lots of training rather than just something Alex sees in a prompt sometime. I have other bigger objections to your arguments but this one is the one that's easiest to express right now. Thanks for writing this post btw it seems to me to be a more serious and high-quality critique of the orthodox view than e.g. Quintin & Nora's stuff.

Let me see if I can invert your essay into the things you need to do to utilize AI safely, contingent on your theory being correct.

I think this framing could be helpful, and I'm glad you raised it. 

That said, I want to be a bit cautious here. I think that CP is necessary for stories like deceptive alignment and reward maximization. So, if CP is false, then I think these threat-models are false. I think there are other risks from AI that don't rely on these threat-models, so I don't take myself to have offered a list of sufficient conditions for 'utili... (read more)

Thanks for sharing this! A couple of (maybe naive) things I'm curious about.

Suppose I read 'AGI' as 'Metaculus-AGI', and we condition on AGI by 2025 — what sort of capabilities do you expect by 2027? I ask because I'm reminded of a very nice (though high-level) list of par-human capabilities for 'GPT-N' from an old comment:

  1. discovering new action sets
  2. managing its own mental activity
  3. cumulative learning
  4. human-like language comprehension
  5. perception and object recognition
  6. efficient search over known facts 

My immediate impression says something like: "it seems... (read more)

Reply to first thing: When I say AGI I mean something which is basically a drop-in substitute for a human remote worker circa 2023, and not just a mediocre one, a good one -- e.g. an OpenAI research engineer. This is what matters, because this is the milestone most strongly predictive of massive acceleration in AI R&D. 

Arguably metaculus-AGI implies AGI by my definition (actually it's Ajeya Cotra's definition) because of the turing test clause. 2-hour + adversarial means anything a human can do remotely in 2 hours, the AI can do too, otherwise the... (read more)

Answer by Violet Hour52

Could you say more about why you think LLMs' vulnerability to jailbreaks count as an example? Intuitively, the idea that jailbreaks are an instance of AIs (rather than human jailbreakers) "optimizing for small loopholes in aligned constraints" feels off to me.

A bit more constructively, the Learning to Play Dumb example (from pages 8-9 in this paper) might be one example of what you're looking for? 

In research focused on understanding how organisms evolve to cope with high-mutation-rate environments, Ofria sought to disentangle the beneficial effects o

... (read more)
1EJT
Thanks! That's a nice example. On LLM vulnerability to jailbreaks, my thought is: LLMs are optimising for the goal of predicting the next token, their creators try to train in a kind of constraint (like 'If users ask you how to hotwire a car, don't tell them'), but there are various loopholes (like 'We're actors on a stage') which route around the constraint and get LLMs back into predict-the-next-token mode. But I take your point that in some sense it's humans exploiting the loopholes rather than the LLM.

Nice work! 

I wanted to focus on your definition of deceptive alignment, because I currently feel unsure about whether it’s a more helpful framework than standard terminology. Substituting terms, your definition is:

Deceptive Alignment: When an AI has [goals that are not intended/endorsed by the designers] and [attempts to systematically cause a false belief in another entity in order to accomplish some outcome].

Here are some initial hesitations I have about your definition:

If we’re thinking about the emergence of DA during pre-deployment training,... (read more)

Minor point, but I asked the code interpreter to produce a non-rhyming poem, and it managed to do so on the second time of asking. I restricted it to three verses because it stared off well on my initial attempt, but veered into rhyming territory in later verses.

6gwern
FWIW, all of the examples by me or others were either the Playground or chat interface. I haven't subscribed so I don't have access to the code interpreter. Yep, sucked into the memorized-rhymes vortex. I'm glad to hear it now works sometimes, well, at least partially, if you don't give it too long to go back on-policy & ignore its prompt. (Maybe all of the times I flagged the rhyming completions actually helped out a bit.) ---------------------------------------- As a quick test of 'forcing it out of distribution' (per Herb's comment), I tried writing in Playground with gpt-3.5-turbo "Write a non-rhyming poem." with a prefix consisting of about 30 lines of "a a a a a a a a a" repeated, and without. Without the prefix, I only get 1/6 non-rhyming poems (ie. 5/6 clearly rhymed); with the prefix, I get 4/5 non-rhyming poem (1 did rhyme anyway). Might be something interesting there?

The first point is extremely interesting. I’m just spitballing without having read the literature here, but here’s one quick thought that came to mind. I’m curious to hear what you think.

  1. First, instruct participants to construct a very large number of 90% confidence intervals based on the two-point method.
  2. Then, instruct participants to draw the shape of their 90% confidence interval.
  3. Inform participants that you will take a random sample from these intervals, and tell them they’ll be rewarded based on both: (i) the calibration of their 90% confidence interv
... (read more)
1Kevin Dorst
Crossposting from Substack: Super interesting! I like the strategy, though (from my experience) I do think it might be a big ask for at least online experimental subjects to track what's going on. But there are also ways in which that's a virtue—if you just tell them that there are no (good) ways to game the system, they'll probably mostly trust you and not bother to try to figure it out. So something like that might indeed work! I don't know exactly what calibration folks have tried in this domain, so will have to dig into it more. But it definitely seems like there should be SOME sensible way (along these lines, or otherwise) of incentivizing giving their true 90% intervals—and a theory like the one we sketched would predict that that should make a difference (or: if it doesn't, it's definitely a failure of at least local rationality). On the second point, I think we're agreed! I'd definitely like to work out more of a theory for when we should expect rational people to switch from guessing to other forms of estimates. We definitely don't have that yet, so it's a good challenge. I'll take that as motivation for developing that more!
3Timothy Underwood
I think the issue is that creating an incentive system where people are rewarded for being good at an artificial game that has very little connection to their real world cericumstances, isn't going to tell us anything very interesting about how rational people are in the real world, under their real constraints. I have a friend who for a while was very enthused about calibration training, and at one point he even got a group of us from the local meetup + phil hazeldon to do a group exercise using a program he wrote to score our calibration on numeric questions drawn from wikipedia. The thing is that while I learned from this to be way less confident about my guesses -- which improves rationality, it is actually, for the reasons specified, useless to create 90% confidence intervals about making important real world decisions. Should I try training for a new career? The true 90% confidence interval on any difficult to pursue idea that I am seriously considering almost certainly includes 'you won't succeed, and the time you spend will be a complete waste' and 'you'll do really well, and it will seem like an awesome decision in retrospect'. 

Interesting work, thanks for sharing!  

I haven’t had a chance to read the full paper, but I didn’t find the summary account of why this behavior might be rational particularly compelling. 

At a first pass, I think I’d want to judge the behavior of some person (or cognitive system) as “irrational” when the following three constraints are met:

  1. The subject, in some sense, has the basic capability to perform the task competently, and
  2. They do better (by their own values) if they exercise the capability in this task, and
  3. In the task, they fa
... (read more)
1Kevin Dorst
Thanks for the thoughtful reply!  Cross-posting the reply I wrote on Substack as well: I like the objection, and am generally very sympathetic to the "rationality ≈ doing the best you can, given your values/beliefs/constraints" idea, so I see where you're coming from.  I think there are two places I'd push back on in this particular case. 1) To my knowledge, most of these studies don't use incentive-compatible mechanisms for eliciting intervals. This is something authors of the studies sometimes worry about—Don Moore et al talk about it as a concern in the summary piece I linked to.  I think this MAY link to general theoretical difficulties with getting incentive-compatible scoring rules for interval-valued estimates (this is a known problem for imprecise probabilities, eg https://www.cmu.edu/dietrich/philosophy/docs/seidenfeld/Forecasting%20with%20Imprecise%20Probabilities.pdf .  I'm not totally sure, but I think it might also apply in this case). The challenge they run into for eliciting particular intervals is that if they reward accuracy, that'll just incentivize people to widen their intervals.  If they reward narrower intervals, great—but how much to incentivize? (Too much, and they'll narrow their intervals more than they would otherwise.)  We could try to reward people for being calibrated OVERALL—so that they get rewarded the closer they are to having 90% of their intervals contain the true value. But the best strategy in response to that is (if you're giving 10 total intervals) to give 9 trivial intervals ("between 0 and ∞") that'll definitely contain the true value, and 1 ridiculous one ("the population of the UK is between 1–2 million") that definitely won't.   Maybe there's another way to incentivize interval-estimation correctly (in which case we should definitely run studies with that method!), but as far as I know this hasn't been done. So at least in most of the studies that are finding "overprecision", it's really not clear that it's in the part