just not do gradient descent on the internal chain of thought, then its just a worse scratchpad.
This seems like a misunderstanding. When OpenAI and others talk about not optimising the chain of thought, they mean not optimising it for looking nice. That still means optimising it for its contribution to the final answer i.e. for being the best scratchpad it can be (that's the whole paradigm).
If what you mean is you can't be that confident given disagreement, I dunno, I wish I could have that much faith in people.
In another way, being that confident despite disagreement requires faith in people — yourself and the others who agree with you.
I think one reason I have a much lower p(doom) than some people is that although I think the AI safety community is great, I don’t have that much more faith in its epistemics than everyone else’s.
When I was a kid I used to love playing RuneScape. One day I had what seemed like a deep insight. Why did I want to kill enemies and complete quests? In order to level up and get better equipment. Why did I want to level up and get better equipment? In order to kill enemies and complete quests. It all seemed a bit empty and circular. I don't think I stopped playing RuneScape after that, but I would think about it every now and again and it would give me pause. In hindsight, my motivations weren't really circular — I was playing Ru...
Maybe you could spell this out a bit more? What concretely do you mean when you say that anything that outputs decisions implies a utility function — are you thinking of a certain mathematical result/procedure?
anything that outputs decisions implies a utility function
I think this is only true in a boring sense and isn't true in more natural senses. For example, in an MDP, it's not true that every policy maximises a non-constant utility function over states.
I think this generalises too much from ChatGPT, and also reads to much into ChatGPT's nature from the experiment, but it's a small piece of evidence.
I think you've hidden most of the difficulty in this line. If we knew how to make a consequentialist sub-agent that was acting "in service" of the outer loop, then we could probably use the same technique to make a Task-based AGI acting "in service" of us.
Later I might try to flesh out my currently-very-loose picture of why consequentialism-in-service-of-virtues seems like a plausible thing we could end up with. I'm not sure whether it implies that you should be able to make a task-based AGI.
...Obvious nitpick: It's just "gain as much power as is helpful
I think this gets at the heart of the question (but doesn't consider the other possible answer). Does a powerful virtue-driven agent optimise hard now for its ability to embody that virtue in the future? Or does it just kinda chill and embody the virtue now, sacrificing some of its ability to embody it extra-hard in the future?
I guess both are conceivable, so perhaps I do need to give an argument why we might expect some kind of virtue-driven AI in the first place, and see which kind that argument suggests.
If you have the time look up “Terence Tao” on Gwern’s website.
In case anyone else is going looking, here is the relevant account of Tao as a child and here is a screenshot of the most relevant part:
I use ChatGPT voice-to-text all the time. About 1% of the time, the message I record in English gets seemingly-roughly-correctly translated into Welsh, and ChatGPT replies in Welsh. Sometimes my messages go back to English on the next message, and sometimes they stay in Welsh for a while. Has anyone else experienced this?
Example: https://chatgpt.com/share/67e1f11e-4624-800a-b9cd-70dee98c6d4e
I think the commenter is asking something a bit different - about the distribution of tasks rather than the success rate. My variant of this question: is your set of tasks supposed to be an unbiased sample of the tasks a knowledge worker faces, so that if I see a 50% success rate on 1 hour long tasks I can read it as a 50% success rate on average across all of the tasks any knowledge worker faces?
Or is it more impressive than that because the tasks are selected to be moderately interesting, or less impressive because they’re selected to be measurable, etc
I vaguely remember a LessWrong comment from you a couple of years ago saying that you included Agent Foundations in the AGISF course as a compromise despite not thinking it’s a useful research direction.
Could you say something about why you’ve changed your mind, or what the nuance is if you haven’t?
I'd be curious to know if conditioning on high agreement alone had less of this effect than conditioning on high karma alone (because something many people agree on is unlikely to be a claim of novel evidence, and more likely to be a take.
Flagging for posterity that we had a long discussion about this via another medium and I was not convinced.
More or less, yes. But I don't think it suggests there might be other prompts around that unlock similar improvements -- chain-of-thought works because it allows the model to spend more serial compute on a problem, rather than because of something really important about the words.
Agree that pauses are a clearer line. But even if a pause and tool-limit are both temporary, we should expect the full pause to have to last longer.
One difference is that keeping AI a tool might be a temporary strategy until you can use the tool AI to solve whatever safety problems apply to non-tool AI. In that case the co-ordination problem isn't as difficult because you might just need to get the smallish pool of leading actors to co-ordinate for a while, rather than everyone to coordinate indefinitely.
I now suspect that there is a pretty real and non-vacuous sense in which deep learning is approximated Solomonoff induction.
Even granting that, do you think the same applies to the cognition of an AI created using deep learning -- is it approximating Solomonoff induction when presented with a new problem at inference time?
I think it's not, for reasons like the ones in aysja's comment.
Agreed, this only matters in the regime where some but not all of your ideas will work. But even in alignment-is-easy worlds, I doubt literally everything will work, so testing would still be helpful.
I think it's downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we've made progress eliminating hypotheses from this list.
Fair enough, yeah -- this seems like a very reasonable angle of attack.
It seems to me that the consequentialist vs virtue-driven axis is mostly orthogonal to the hypotheses here.
As written, aren't Hypothesis 1: Written goal specification, Hypothesis 2: Developer-intended goals, and Hypothesis 3: Unintended version of written goals and/or human intentions all compatible with either kind of AI?
Hypothesis 4: Reward/reinforcement does assume a consequentialist, and so does Hypothesis 5: Proxies and/or instrumentally convergent goals as written, although it seems like 'proxy virtues' could maybe be a thing too?
(Unrelatedly, it's n...
One thing that might be missing from this analysis is explicitly thinking about whether the AI is likely to be driven by consequentialist goals.
In this post you use 'goals' in quite a broad way, so as to include stuff like virtues (e.g. "always be honest"). But we might want to carefully distinguish scenarios in which the AI is primarily motivated by consequentialist goals from ones where it's motivated primarily by things like virtues, habits, or rules.
This would be the most important axis to hypothesise about if it was the case that instrumental converge...
You know you’re feeling the AGI when a compelling answer to “What’s the best argument for very short AI timelines?” lengthens your timelines
Interesting. My handwavey rationalisation for this would be something like:
I agree that it can be possible to turn such a system into an agent. I think the original comment is defending a stronger claim that there's a sort of no free lunch theorem: either you don't act on the outputs of the oracle at all, or it's just as much of an agent as any other system.
I think the stronger claim is clearly not true. The worrying thing about powerful agents is that their outputs are selected to cause certain outcomes, even if you try to prevent those outcomes. So depending on the actions you're going to take in response to its outputs, its ou...
“It seems silly to choose your values and behaviors and preferences just because they’re arbitrarily connected to your social group.”
If you think this way, then you’re already on the outside.
I don’t think this is true — your average person would agree with the quote (if asked) and deny that it applies to them.
Finetuning generalises a lot but not to removing backdoors?
Seems like we don’t really disagree
The arguments in the paper are representative of Yoshua's views rather than mine, so I won't directly argue for them, but I'll give my own version of the case against
the distinctions drawn here between RL and the science AI all break down at high levels.
It seems commonsense to me that you are more likely to create a dangerous agent the more outcome-based your training signal is, the longer time-horizon those outcomes are measured over, the tighter the feedback loop between the system and the world, and the more of the world lies between the model you'r...
Pre-training, finetuning and RL are all types of training. But sure, expand 'train' to 'create' in order to include anything else like scaffolding. The point is it's not what you do in response to the outputs of the system, it's what the system tries to do.
Seems mistaken to think that the way you use a model is what determines whether or not it’s an agent. It’s surely determined by how you train it?
(And notably the proposal here isn’t to train the model on the outcomes of experiments it proposes, in case that’s what you’re thinking.)
I roughly agree, but it seems very robustly established in practice that the training-validation distinction is better than just having a training objective, even though your argument mostly applies just as well to the standard ML setup.
You point out an important difference which is that our ‘validation metrics’ might be quite weak compared to most cases, but I still think it’s clearly much better to use some things for validation than training.
Like, I think there are things that are easy to train away but hard/slow to validate away (just like when trainin...
Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that's unlikely to get rid of it.
I’m not necessarily asserting that this solves the problem, but seems important to note that the obviously-superior alternative to training against it is validating against it. i.e., when you observe scheming you train a new model, ideally with different techniques that you reckon have their own chance of working.
However doomed you think training against the signal is, you should...
For some reason I've been muttering the phrase, "instrumental goals all the way up" to myself for about a year, so I'm glad somebody's come up with an idea to attach it to.
One time I was camping in the woods with some friends. We were sat around the fire in the middle of the night, listening to the sound of the woods, when one of my friends got out a bluetooth speaker and started playing donk at full volume (donk is a kind of funny, somewhat obnoxious style of dance music).
I strongly felt that this was a bad bad bad thing to be doing, and was basically pleading with my friend to turn it off. Everyone else thought it was funny and that I was being a bit dramatic -- there was nobody around for hundreds of metres, so we weren't...
Yes, I don't think this will let you get away with no specification bits in goal space at the top level like John's phrasing might suggest. But it may let you get away with much less precision?
The things we care about aren't convergent instrumental goals for all terminal goals, the kitchen chef's constraints aren't doing that much to keep the kitchen liveable to cockroaches. But it seems to me that this maybe does gesture at a method to get away with pointing at a broad region of goal space instead of a near-pointlike region.
I'd like to do some experiments using your loan application setting. Is it possible to share the dataset?
But - what might the model that AGI uses to downright visibility and serve up ideas look like?
What I was meaning to get at is that your brain is an AGI that does this for you automatically.
Fine, but it still seems like a reason one could give for death being net good (which is your chief criterion for being a deathist).
I do think it's a weaker reason than the second one. The following argument in defence of it is mainly for fun:
I slightly have the feeling that it's like that decision theory problem where the devil offers you pieces of a poisoned apple one by one. First half, then a quarter, then an eighth, than a sixteenth... You'll be fine unless you eat the whole apple, in which case you'll be poisoned. Each time you're offered a piece it'...
Other (more compelling to me) reasons for being a "deathist":
Notes systems are nice for storing ideas but they tend to get clogged up with stuff you don't need, and you might never see the stuff you do need again. Wouldn't it be better if
Your brain is that notes system. On the other hand, writing notes is a great way to come up with new ideas.
and nobody else ever seems to do anything useful as a result of such fights
I would guess a large fraction of the potential value of debating these things comes from its impact on people who aren’t the main proponents of the research program, but are observers deciding on their own direction.
Is that priced in to the feeling that the debates don’t lead anywhere useful?
It's usually the case that online conversations aren't for persuading the person you're talking to, they're for affecting the beliefs of onlookers.
The notion of ‘fairness’ discussed in e.g. the FDT paper is something like: it’s fair to respond to your policy, i.e. what you would do in any counterfactual situation, but it’s not fair to respond to the way that policy is decided.
I think the hope is that you might get a result like “for all fair decision problems, decision-making procedure A is better than decision-making procedure B by some criterion to do with the outcomes it leads to”.
Without the fairness assumption you could create an instant counterexample to any such result by writing down a decision problem where decision-making procedure A is explicitly penalised e.g. omega checks if you use A and gives you minus a million points if so.
a Bayesian interpretation where you don't need to renormalize after every likelihood computation
How does this differ from using Bayes' rule in odds ratio form? In that case you only ever have to renormalise if at some point you want to convert to probabilities.
I think all of the following:
I found this really useful, thanks! I especially appreciate details like how much time you spent on slides at first, and how much you do now.
Relevant keyword: I think the term for interactions like this where players have an incentive to misreport their preferences in order to bring about their desired outcome is “not strategyproof”.
The coin coming up heads is “more headsy” than the expected outcome, but maybe o3 is about as headsy as Thane expected.
Like if you had thrown 100 coins and then revealed that 80 were heads.
Suppose there is 10 years of monumental progress in mechanistic interpretability. We can roughly - not exactly - explain any output that comes out of a neural network. We can do experiments where we put our AIs in interesting situations and make a very good guess at their hidden reasons for doing the things they do.
Doesn't this sound a bit like where we currently are with models that operate with a hidden chain of thought? If you don't think that an AGI built with the current fingers-crossed-its-faithful paradigm would be safe, what percentage an outcome would mech interp have to hit to beat that?
Seems like 99+ to me.
I think it may or may not diverge from meaningful natural language in the next couple of years, and importantly I think we’ll be able to roughly tell whether it has. So I think we should just see (although finding other formats for interpretable autogression could be good too).