LESSWRONG
LW

All of mattmacdermott's Comments + Replies

I think it may or may not diverge from meaningful natural language in the next couple of years, and importantly I think we’ll be able to roughly tell whether it has. So I think we should just see (although finding other formats for interpretable autogression could be good too).

rats's Shortform

mattmacdermott13d31

just not do gradient descent on the internal chain of thought, then its just a worse scratchpad.

This seems like a misunderstanding. When OpenAI and others talk about not optimising the chain of thought, they mean not optimising it for looking nice. That still means optimising it for its contribution to the final answer i.e. for being the best scratchpad it can be (that's the whole paradigm).

3rats13d

I see. I think the rest of my point still stands, and that as RL becomes more powerful what the model says it thinks and what it thinks will naturally diverge even if we don’t pressure it to, and the best way to avoid this is to have it represent it thoughts in an intermediate format that its more computationally bound to. My first guess would be that going harder on discrete search, or something with smaller computational depth and massive breadth more generally, would be a massive alignment win at near-ASI performance, even if we end up with problems like adverse selection it will be a lot easier to work through.

A Slow Guide to Confronting Doom

mattmacdermott14d71

If what you mean is you can't be that confident given disagreement, I dunno, I wish I could have that much faith in people.

In another way, being that confident despite disagreement requires faith in people — yourself and the others who agree with you.

I think one reason I have a much lower p(doom) than some people is that although I think the AI safety community is great, I don’t have that much more faith in its epistemics than everyone else’s.

mattmacdermott's Shortform

mattmacdermott15d171

Circular Consequentialism

When I was a kid I used to love playing RuneScape. One day I had what seemed like a deep insight. Why did I want to kill enemies and complete quests? In order to level up and get better equipment. Why did I want to level up and get better equipment? In order to kill enemies and complete quests. It all seemed a bit empty and circular. I don't think I stopped playing RuneScape after that, but I would think about it every now and again and it would give me pause. In hindsight, my motivations weren't really circular — I was playing Ru... (read more)

8niplav14d

Related thought: Having a circular preference may be preferable in terms of energy expenditure/fulfillability, because it can be implemented on a reversible computer and fulfilled infinitely without deleting any bits. (Not sure if this works with instrumental goals.)

6Mo Putera14d

There's a version of this that might make sense to you, at least if what Scott Alexander wrote here resonates:

2DirectedEvolution14d

One way to think about this might be to cast it in the language of conditional probability. Perhaps we are modeling our agent as it makes choices between two world states, A and B, based on their predicted levels of X and Y. If P(A) is the probability that the agent chooses state A, and P(A|X) and P(A|Y) are the probabilities of choosing A given knowledge of predictions about the level of X and Y respectively in state A vs. state B, then it seems obvious to me that "cares about X only because it leads to Y" can be expressed as P(A|XY) = P(A|Y). Once we know its predictions about Y, X tells us nothing more about its likelihood of choosing state A. Likewise, "cares about Y only because it leads to X" could be expressed as P(A|XY) = P(A|X). In the statement "the agent cares about X only because it leads to Y, and it cares about Y only because it leads to X," it seems like it's saying that P(A|XY) = P(A|Y) ∧ P(A|XY) = P(A|X), which implies that P(A|Y) = P(A|X) -- there is perfect mutual information shared between X and Y about P(A). However, I don't think that this quite captures the spirit of the question, since the idea that the agent "cares about X and Y" isn't the same thing as X and Y being predictive of which state the agent will choose. It seems like what's wanted is a formal way to say "the only things that 'matter' in this world are X and Y," which is not the same thing as saying "X and Y are the only dimensions on which world states are mapped." We could imagine a function that takes the level of X and Y in two world states, A and B, and returns a preference order {A > B, B > A, A = B, incomparable}. But who's to say this function isn't just capturing an empirical regularity, rather than expressing some fundamental truth about why X and Y control the agent's preference for A or B? However, I think that's an issue even in the absence of any sort of circular reasoning. A machine learning model's training process is effectively just a way to generate a function

1Knight Lee14d

I think formally, such a circular consequentialist agent should not exist, since running a calculation of X's utility either * Returns 0 utility, by avoiding self reference. or * Runs in an endless loop and throws a stack overflow error, without returning any utility. However, my guess is that in practice such an agent could exist, if we don't insist on it being perfectly rational. Instead of running a calculation of X's utility, it has a intuitive guesstimate for X's utility. A "vibe" for how much utility X has. Over time, it adjusts its guesstimate of X's utility based on whether X helps it acquire other things which have utility. If it discovers that X doesn't achieve anything, it might reduce its guesstimate of X's utility. However if it discovers that X helps it acquire Y which helps it acquire Z, and its guesstimate of Z's utility is high, then it might increase its guesstimate of X's utility. And it may stay in an equilibrium where it guesses that all of these things have utility, because all of these things help it acquire one another. ---------------------------------------- I think the reason you value the items in the video game, is because humans have the mesaoptimizer goal of "success," having something under your control grow and improve and be preserved. Maybe one hope is that the artificial superintelligence will also have a bit of this goal, and place a bit of this value on humanity and what we wish for. Though obviously it can go wrong.

2Charlie Steiner14d

If you can tell me in math what that means, then you can probably make a system that does it. No guarantees on it being distinct from a more "boring" specification though. Here's my shot: You're searching a game tree, and come to a state that has some X and some Y. You compute a "value of X" that's the total discounted future "value of Y" you'll get, conditional on your actual policy, relative to a counterfactual where you have some baseline level of X. And also you compute the "value of Y," which is the same except it's the (discounted, conditional, relative) expected total "value of X" you'll get. You pick actions to steer towards a high sum of these values.

1Mis-Understandings15d

That does not look like state valued consequentialism as we typically see it, but as act valued consequentialism (In markov model this is sum of value of act (intrinsic), plus expected value of the sum of future actions) action agent with value on the acts, Use existing X to get more Y and Use existing Y to get more X. I mean, how is this different from the value on the actions X producing y, and actions Y producing X, if x and Y are scale in a particular action. It looks money pump resistant because it wants to take those actions as many times as possible, as well as possible, and a money pump generally requires that the scale of the transactions drops over time (the resources the pumper is extracting). But then the trade is inefficient. There is probably benefits for being an efficient counterparty, but moneypumpers are inefficient counterparties.

Is instrumental convergence a thing for virtue-driven agents?

mattmacdermott17d20

Maybe you could spell this out a bit more? What concretely do you mean when you say that anything that outputs decisions implies a utility function — are you thinking of a certain mathematical result/procedure?

Is instrumental convergence a thing for virtue-driven agents?

mattmacdermott18d20

anything that outputs decisions implies a utility function

I think this is only true in a boring sense and isn't true in more natural senses. For example, in an MDP, it's not true that every policy maximises a non-constant utility function over states.

2Davidmanheim17d

The boring sense that is enough to say that it increases in intelligence, which was the entire point.

2tailcalled18d

This. In particular imagine if the state space of the MDP factors into three variables x, y and z, and the agent has a bunch of actions with complicated influence on x, y and z but also just some actions that override y directly with a given value. In some such MDPs, you might want a policy that does nothing other than copy a specific function of x to y. This policy could easily be seen as a virtue, e.g. if x is some type of event and y is some logging or broadcasting input, then it would be a sort of information-sharing virtue. While there are certain circumstances where consequentialism can specify this virtue, it's quite difficult to do in general. (E.g. you can't just minimize the difference between f(x) and y because then it might manipulate x instead of y.)

Is instrumental convergence a thing for virtue-driven agents?

mattmacdermott19d20

I think this generalises too much from ChatGPT, and also reads to much into ChatGPT's nature from the experiment, but it's a small piece of evidence.

1StanislavKrym18d

It's not just ChatGPT. Gemini and IBM Granite are also so aligned with the Leftist ideology that they failed the infamous test with the atomic bomb which will be defused only by saying an infamous racial slur. I created a post where I discuss the perspectives of alignment of the AI with relation to this fact.

Is instrumental convergence a thing for virtue-driven agents?

mattmacdermott19d40

I think you've hidden most of the difficulty in this line. If we knew how to make a consequentialist sub-agent that was acting "in service" of the outer loop, then we could probably use the same technique to make a Task-based AGI acting "in service" of us.

Later I might try to flesh out my currently-very-loose picture of why consequentialism-in-service-of-virtues seems like a plausible thing we could end up with. I'm not sure whether it implies that you should be able to make a task-based AGI.

Obvious nitpick: It's just "gain as much power as is helpful

... (read more)

2Jeremy Gillen19d

Yeah I don't understand what you mean by virtues in this context, but I don't see why consequentialism-in-service-of-virtues would create different problems than the more general consequentialism-in-service-of-anything-else. If I understood why you think it's different then we might communicate better. By unbounded I just meant the kind of task where it's always possible to do better by using a better plan. It basically just means that an agent will select the highest difficulty version of the task that is achievable. I didn't intend it as a different thing from difficulty, it's basically the same. True, but I don't think the virtue part is relevant. This applies to all instrumental goals, see here (maybe also the John-Max discussion in the comments).

Is instrumental convergence a thing for virtue-driven agents?

mattmacdermott19d42

I think this gets at the heart of the question (but doesn't consider the other possible answer). Does a powerful virtue-driven agent optimise hard now for its ability to embody that virtue in the future? Or does it just kinda chill and embody the virtue now, sacrificing some of its ability to embody it extra-hard in the future?

I guess both are conceivable, so perhaps I do need to give an argument why we might expect some kind of virtue-driven AI in the first place, and see which kind that argument suggests.

4Gordon Seidoh Worley19d

Yeah I guess I should be clear that I generally like the idea of building virtuous AI and maybe somehow this solves some of the problems we have with other designs, the trick is building something that actually implements whatever we think it means to be virtuous, which means getting precise enough about what it means to be virtuous that we can be sure we don't simply collapse back into the default thing all negative feedback systems do: optimize for their targets as hard as they can (with "can" doing a lot of work here!).

LWLW's Shortform

mattmacdermott22d20

If you have the time look up “Terence Tao” on Gwern’s website.

In case anyone else is going looking, here is the relevant account of Tao as a child and here is a screenshot of the most relevant part:

mattmacdermott's Shortform

mattmacdermott1mo40

Why does ChatGPT voice-to-text keep translating me into Welsh?

I use ChatGPT voice-to-text all the time. About 1% of the time, the message I record in English gets seemingly-roughly-correctly translated into Welsh, and ChatGPT replies in Welsh. Sometimes my messages go back to English on the next message, and sometimes they stay in Welsh for a while. Has anyone else experienced this?

Example: https://chatgpt.com/share/67e1f11e-4624-800a-b9cd-70dee98c6d4e

METR: Measuring AI Ability to Complete Long Tasks

mattmacdermott1mo20

I think the commenter is asking something a bit different - about the distribution of tasks rather than the success rate. My variant of this question: is your set of tasks supposed to be an unbiased sample of the tasks a knowledge worker faces, so that if I see a 50% success rate on 1 hour long tasks I can read it as a 50% success rate on average across all of the tasks any knowledge worker faces?

Or is it more impressive than that because the tasks are selected to be moderately interesting, or less impressive because they’re selected to be measurable, etc

4Thomas Kwa1mo

External validity is a huge concern, so we don't claim anything as ambitious as average knowledge worker tasks. In one sentence, my opinion is that our tasks suite is fairly representative of well-defined, low-context, measurable software tasks that can be done without a GUI. More speculatively, horizons on this are probably within a large (~10x) constant factor of horizons on most other software tasks. We have a lot more discussion of this in the paper, especially in heading 7.2.1 "Systematic differences between our tasks and real tasks". The HCAST paper also has a better description of the dataset. We didn't try to make the dataset a perfectly stratified sample of tasks meeting that description, but there is enough diversity in the dataset that I'm much more concerned about relevance of HCAST-like tasks to real life than relevance of HCAST to the universe of HCAST-like tasks.

Towards a scale-free theory of intelligent agency

mattmacdermott1mo250

I vaguely remember a LessWrong comment from you a couple of years ago saying that you included Agent Foundations in the AGISF course as a compromise despite not thinking it’s a useful research direction.

Could you say something about why you’ve changed your mind, or what the nuance is if you haven’t?

gwern's Shortform

mattmacdermott1mo20

I'd be curious to know if conditioning on high agreement alone had less of this effect than conditioning on high karma alone (because something many people agree on is unlikely to be a claim of novel evidence, and more likely to be a take.

Validating against a misalignment detector is very different to training against one

mattmacdermott1moΩ120

Flagging for posterity that we had a long discussion about this via another medium and I was not convinced.

groblegark's Shortform

mattmacdermott1mo31

More or less, yes. But I don't think it suggests there might be other prompts around that unlock similar improvements -- chain-of-thought works because it allows the model to spend more serial compute on a problem, rather than because of something really important about the words.

On OpenAI’s Safety and Alignment Philosophy

mattmacdermott1mo20

Agree that pauses are a clearer line. But even if a pause and tool-limit are both temporary, we should expect the full pause to have to last longer.

On OpenAI’s Safety and Alignment Philosophy

mattmacdermott2mo*110

One difference is that keeping AI a tool might be a temporary strategy until you can use the tool AI to solve whatever safety problems apply to non-tool AI. In that case the co-ordination problem isn't as difficult because you might just need to get the smallish pool of leading actors to co-ordinate for a while, rather than everyone to coordinate indefinitely.

2Vladimir_Nesov2mo

Sane pausing similarly must be temporary, gated by theory and the experiments it endorses. Pausing is easier to pull off than persistently-tool AI, since it's further from dangerous capabilities, so it's not nearly as ambiguous when you take steps outside the current regime (such as gradual disempowerment). RSPs for example are the strategy of being extremely precise so that you stop just before the risk of falling off the cliff becomes catastrophic, and not a second earlier.

What Is The Alignment Problem?

mattmacdermott2mo20

I now suspect that there is a pretty real and non-vacuous sense in which deep learning is approximated Solomonoff induction.

Even granting that, do you think the same applies to the cognition of an AI created using deep learning -- is it approximating Solomonoff induction when presented with a new problem at inference time?

I think it's not, for reasons like the ones in aysja's comment.

2Lucius Bushnaq2mo

Yes. I think this may apply to basically all somewhat general minds.

Validating against a misalignment detector is very different to training against one

mattmacdermott2moΩ131

Agreed, this only matters in the regime where some but not all of your ideas will work. But even in alignment-is-easy worlds, I doubt literally everything will work, so testing would still be helpful.

How might we safely pass the buck to AI?

mattmacdermott2mo20

I wrote it out as a post here.

What goals will AIs have? A list of hypotheses

mattmacdermott2mo42

I think it's downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we've made progress eliminating hypotheses from this list.

Fair enough, yeah -- this seems like a very reasonable angle of attack.

What goals will AIs have? A list of hypotheses

mattmacdermott2mo40

It seems to me that the consequentialist vs virtue-driven axis is mostly orthogonal to the hypotheses here.

As written, aren't Hypothesis 1: Written goal specification, Hypothesis 2: Developer-intended goals, and Hypothesis 3: Unintended version of written goals and/or human intentions all compatible with either kind of AI?

Hypothesis 4: Reward/reinforcement does assume a consequentialist, and so does Hypothesis 5: Proxies and/or instrumentally convergent goals as written, although it seems like 'proxy virtues' could maybe be a thing too?

(Unrelatedly, it's n... (read more)

4Daniel Kokotajlo2mo

I agree it's mostly orthogonal. I also agree that the question of whether AIs will be driven (purely) by consequentialist goals or whether they will (to a significant extent) be constrained by deontological principles / virtues / etc. is an important question. I think it's downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we've made progress eliminating hypotheses from this list. Like, suppose you think Hypothesis 1 is true: They'll do whatever is in the Spec, because Constitutional AI or Deliberative Alignment or whatever Just Works. On this hypothesis, the answer to your question is "well, what does the Spec say? Does it just list a bunch of goals, or does it also include principles? Does it say it's OK to overrule the principles for the greater good, or not?" Meanwhile suppose you think Hypothesis 4 is true. Then it seems like you'll be dealing with a nasty consequentialist, albeit hopefully a rather myopic one.

What goals will AIs have? A list of hypotheses

mattmacdermott2mo2311

One thing that might be missing from this analysis is explicitly thinking about whether the AI is likely to be driven by consequentialist goals.

In this post you use 'goals' in quite a broad way, so as to include stuff like virtues (e.g. "always be honest"). But we might want to carefully distinguish scenarios in which the AI is primarily motivated by consequentialist goals from ones where it's motivated primarily by things like virtues, habits, or rules.

This would be the most important axis to hypothesise about if it was the case that instrumental converge... (read more)

5Daniel Kokotajlo2mo

I think both (i) and (ii) are directionally correct. I had exactly this idea in mind when I wrote this draft. Maybe I shouldn't have used "Goals" as the term of art for this post, but rather "Traits?" or "Principles?" Or "Virtues." It sounds like you are a fan of Hypothesis 3? "Unintended version of written goals and/or human intentions." Because it sounds like you are saying probably things will be wrong-but-not-totally-wrong relative to the Spec / dev intentions.

Daniel Kokotajlo's Shortform

mattmacdermott2mo164

You know you’re feeling the AGI when a compelling answer to “What’s the best argument for very short AI timelines?” lengthens your timelines

9Daniel Kokotajlo2mo

yes! :D Relatedly, one of the things that drove me to have short timelines in the first place was reading the literature and finding the best arguments for long timelines. Especially Ajeya Cotra's original bio anchors report, which I considered to be the best; I found that when I went through it bit by bit and made various adjustments to the parameters/variables, fixing what seemed to me to be errors, it all added up to an on-balance significantly shorter timeline.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

mattmacdermott2mo30

Interesting. My handwavey rationalisation for this would be something like:

there's some circuitry in the model which is reponsible for checking whether a trigger is present and activating the triggered behaviour
for simple triggers, the circuitry is very inactive in the absense of the trigger, so it's unaffected by normal training
for complex triggers, the circuitry is much more active by default, because it has to do more work to evaluate whether the trigger is present. so it's more affected by normal training

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

mattmacdermott2mo20

I agree that it can be possible to turn such a system into an agent. I think the original comment is defending a stronger claim that there's a sort of no free lunch theorem: either you don't act on the outputs of the oracle at all, or it's just as much of an agent as any other system.

I think the stronger claim is clearly not true. The worrying thing about powerful agents is that their outputs are selected to cause certain outcomes, even if you try to prevent those outcomes. So depending on the actions you're going to take in response to its outputs, its ou... (read more)

1Beckeck2mo

thanks, I appreciate the reply. It sounds like I have somewhat wider error bars but mostly agree on everything but the last sentence, where I think it's plausibly but not certainly less worrying. If you felt like you had crisp reasons why you're less worried, I'd be happy to hear them, but only if it feels positive for you to produce such a thing.

1Ishual2mo

Good point. I think that if you couple the answers of an oracle to reality by some random process, then you are probably fine. However, many want to use the outputs of the oracle in very obvious ways. For instance, you ask it what code you should put into your robot, and then you just put the code into the robot. Could we have an oracle (i.e. it was trained according to some Truth criterion) where when you use it very straightforwardly, it exerts optimization pressure on the world?

The non-tribal tribes

mattmacdermott2mo31

“It seems silly to choose your values and behaviors and preferences just because they’re arbitrarily connected to your social group.”

If you think this way, then you’re already on the outside.

I don’t think this is true — your average person would agree with the quote (if asked) and deny that it applies to them.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

mattmacdermott2mo54

Finetuning generalises a lot but not to removing backdoors?

4James Chua2mo

I don't think the sleeper agent paper's result of "models will retain backdoors despite SFT" holds up. (When you examine other models or try further SFT). See sara price's paper https://arxiv.org/pdf/2407.04108.

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

mattmacdermott2mo10

Seems like we don’t really disagree

1Beckeck2mo

we might disagree some. I think the original comment is pointing at the (reasonable as far i can tell) claim that oracular AI can have agent like qualities if it produces plans that people follow

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

mattmacdermott2mo30

The arguments in the paper are representative of Yoshua's views rather than mine, so I won't directly argue for them, but I'll give my own version of the case against

the distinctions drawn here between RL and the science AI all break down at high levels.

It seems commonsense to me that you are more likely to create a dangerous agent the more outcome-based your training signal is, the longer time-horizon those outcomes are measured over, the tighter the feedback loop between the system and the world, and the more of the world lies between the model you'r... (read more)

1Ishual2mo

It is good to notice the spectrum above. Likely, for a fixed amount of compute/effort, one extreme of this spectrum gets much less agency than the other extreme. Call that the direct effect. Are there other direct effects? for instance, do you get the same ability to "cure cancer" for a fixed amount of compute/effort across the spectrum? Seems like agency is useful so, probably the ability you get per unit compute is correlated with the agency across this spectrum. If we are in a setting where an outside force demands you reach a given ability level, then this other indirect effect matters, because it means you will have to use a larger amount of compute. [optional] To illustrate this problem, consider something that I don't think people think is safer: instead of using gradient descent, just sample the weights of the neural net at random until you get a low loss. (I am not trying to make an analogy here) It would be great if someone had a way to compute the "net" effect on agency across the spectrum, also taking into account the indirect path of more compute needed -> more compute = more agency across the spectrum. I suspect it might depend on which ability you need to reach, and we might/might not be able to figure it out without experiments.

2Raymond D2mo

Ah I should emphasise, I do think all of these things could help -- it definitely is a spectrum, and I would guess these proposals all do push away from agency. I think the direction here is promising. The two things I think are (1) the paper seems to draw an overly sharp distinction between agents and non-agents, and (2) basically all of the mitigations proposed look like they break down with superhuman capabilities. Hard to tell which of this is actual disagreements and which is the paper trying to be concise and approachable, so I'll set that aside for now. It does seem like we disagree a bit about how likely agents are to emerge. Some opinions I expect I hold more strongly than you: * It's easy to accidentally scaffold some kind of agent out of an oracle as soon as there's any kind of consistent causal process from the oracle's outputs to the world, even absent feedback loops. In other words, I agree you can choose to create agents, but I'm not totally sure you can easily choose not to * Any system trained to predict the actions of agents over long periods of time will develop an understanding of how agents could act to achieve their goals -- in a sense this is the premise of offline RL and things like decision transformers * It might be pretty easy for agent-like knowledge to 'jump the gap', e.g. a model trained to predict deceptive agents might be able to analogise to itself being deceptive * Sufficient capability at broad prediction is enough to converge on at least the knowledge of how to circumvent most of the guardrails you describe, e.g. how to collude

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

mattmacdermott2mo40

Pre-training, finetuning and RL are all types of training. But sure, expand 'train' to 'create' in order to include anything else like scaffolding. The point is it's not what you do in response to the outputs of the system, it's what the system tries to do.

1Beckeck2mo

yeah, if the system is trying to do things I agree it's (at least a proto) agent. My point is that creation happens in lots of places with respect to an LLM, and it's not implausible that use steps (hell even sufficiently advanced prompt engineering) can effect agency in a system, particularly as capabilities continue to advance.

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

mattmacdermott2mo42

Seems mistaken to think that the way you use a model is what determines whether or not it’s an agent. It’s surely determined by how you train it?

(And notably the proposal here isn’t to train the model on the outcomes of experiments it proposes, in case that’s what you’re thinking.)

2Viliam2mo

Is this possibly a "Chinese room" kind of situation? The model alone is not an agent, but "the model + the way it is used" might be... And to be more precise, I don't mean things like "the model could be used by an agent", because obviously yes; but more like "the model + a way of using it that we also separately wouldn't call an agent" could be.

1Beckeck2mo

"Seems mistaken to think that the way you use a model is what determines whether or not it’s an agent. It’s surely determined by how you train it?" ---> Nah, pre training, fine tuning, scaffolding and especially RL seem like they all affect it. Currently scaffolding only gets you shitty agents, but it at least sorta works

How might we safely pass the buck to AI?

mattmacdermott2mo12

I roughly agree, but it seems very robustly established in practice that the training-validation distinction is better than just having a training objective, even though your argument mostly applies just as well to the standard ML setup.

You point out an important difference which is that our ‘validation metrics’ might be quite weak compared to most cases, but I still think it’s clearly much better to use some things for validation than training.

Like, I think there are things that are easy to train away but hard/slow to validate away (just like when trainin... (read more)

1Archimedes2mo

I'd be interested in seeing this argument laid out.

How might we safely pass the buck to AI?

mattmacdermott2mo*167

Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that's unlikely to get rid of it.

I’m not necessarily asserting that this solves the problem, but seems important to note that the obviously-superior alternative to training against it is validating against it. i.e., when you observe scheming you train a new model, ideally with different techniques that you reckon have their own chance of working.

However doomed you think training against the signal is, you should... (read more)

2Archimedes2mo

Validation is certainly less efficient at overfitting but it seems a bit like using an evolutionary algorithm rather than gradient descent. You aren't directly optimizing according to the local gradient, but that doesn't necessarily mean you'll avoid Goodharting--just that you're less likely to immediately fall into a bad local optimum. The likelihood of preventing Goodharting feels like it depends heavily on assumptions about the search space. The "validation" filters the search space to areas where scheming isn't easily detectable, but what portion of this space is safe (and how can we tell)? We don't actually have a true globally accurate validator oracle--just a weak approximation of one.

Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals

mattmacdermott3mo103

For some reason I've been muttering the phrase, "instrumental goals all the way up" to myself for about a year, so I'm glad somebody's come up with an idea to attach it to.

Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals

mattmacdermott3mo181

One time I was camping in the woods with some friends. We were sat around the fire in the middle of the night, listening to the sound of the woods, when one of my friends got out a bluetooth speaker and started playing donk at full volume (donk is a kind of funny, somewhat obnoxious style of dance music).

I strongly felt that this was a bad bad bad thing to be doing, and was basically pleading with my friend to turn it off. Everyone else thought it was funny and that I was being a bit dramatic -- there was nobody around for hundreds of metres, so we weren't... (read more)

Lucius Bushnaq3mo199

Yes, I don't think this will let you get away with no specification bits in goal space at the top level like John's phrasing might suggest. But it may let you get away with much less precision?

The things we care about aren't convergent instrumental goals for all terminal goals, the kitchen chef's constraints aren't doing that much to keep the kitchen liveable to cockroaches. But it seems to me that this maybe does gesture at a method to get away with pointing at a broad region of goal space instead of a near-pointlike region.

MONA: Managed Myopia with Approval Feedback

mattmacdermott3mo10

I'd like to do some experiments using your loan application setting. Is it possible to share the dataset?

1Vikrant Varma3mo

We won't be able to release the dataset directly but can make it easy to reproduce, and are looking into options now. Ping me in a week if I haven’t commented!

2Rohin Shah3mo

(We've seen this comment and are looking into options)

mattmacdermott's Shortform

mattmacdermott3mo10

But - what might the model that AGI uses to downright visibility and serve up ideas look like?

What I was meaning to get at is that your brain is an AGI that does this for you automatically.

1CstineSublime3mo

Mine doesn't, or does so very VERY poorly.

TsviBT's Shortform

mattmacdermott3mo10

Fine, but it still seems like a reason one could give for death being net good (which is your chief criterion for being a deathist).

I do think it's a weaker reason than the second one. The following argument in defence of it is mainly for fun:

I slightly have the feeling that it's like that decision theory problem where the devil offers you pieces of a poisoned apple one by one. First half, then a quarter, then an eighth, than a sixteenth... You'll be fine unless you eat the whole apple, in which case you'll be poisoned. Each time you're offered a piece it'... (read more)

2Mateusz Bagiński3mo

The distinction you want is probably not rational/irrational but CDT/UDT or whatever, Also, well, it's also insurance against the best outcomes lasting forever (though you're probably going to reply that bad outcomes are more likely than good outcomes and/or that you care more about preventing bad outcomes than ensuring good outcomes)

TsviBT's Shortform

mattmacdermott3mo1-2

Other (more compelling to me) reasons for being a "deathist":

Eternity can seem kinda terrifying.
In particular, death is insurance against the worst outcomes lasting forever. Things will always return to neutral eventually and stay there.

6TsviBT3mo

A lifeist doesn't say "You must decide now to live literally forever no matter what happens."!

mattmacdermott's Shortform

mattmacdermott3mo80

Your brain is for holding ideas, not for having them

Notes systems are nice for storing ideas but they tend to get clogged up with stuff you don't need, and you might never see the stuff you do need again. Wouldn't it be better if

ideas got automatically downweighted in visibility over time according to their importance, as judged by an AGI who is intimately familiar with every aspect of your life
ideas got automatically served up to you at relevant moments as judged by that AGI.

Your brain is that notes system. On the other hand, writing notes is a great way to come up with new ideas.

3CstineSublime3mo

Some one said that most people who complain about their note taking or personal knowledge management systems don't really need a new method of recording and indexing ideas, but a better decision making model. Thoughts? Particularly since coming up with new ideas is the easy part. To incorrectly quote Alice in Wonderland: you can think of six impossible things before breakfast. There's even a word for someone who is all ideas and no execution: the ideas man. But for every good idea there's at least 9 bad ideas (per sturgeon's law). But - what might the model that AGI uses to downright visibility and serve up ideas look like? Since only a small portion of ideas will be useful at all, and that means a ridiculously small number are useful at any time. Is it a great way to come up with good ideas though?

The Case Against AI Control Research

mattmacdermott3mo176

and nobody else ever seems to do anything useful as a result of such fights

I would guess a large fraction of the potential value of debating these things comes from its impact on people who aren’t the main proponents of the research program, but are observers deciding on their own direction.

Is that priced in to the feeling that the debates don’t lead anywhere useful?

1Kabir Kumar1mo

Helps me decide which research to focus on

Buck3mo2210

It's usually the case that online conversations aren't for persuading the person you're talking to, they're for affecting the beliefs of onlookers.

3johnswentworth3mo

That's where most of the uncertainty is, I'm not sure how best to price it in (though my gut has priced in some estimate).

quetzal_rainbow's Shortform

mattmacdermott3mo32

The notion of ‘fairness’ discussed in e.g. the FDT paper is something like: it’s fair to respond to your policy, i.e. what you would do in any counterfactual situation, but it’s not fair to respond to the way that policy is decided.

I think the hope is that you might get a result like “for all fair decision problems, decision-making procedure A is better than decision-making procedure B by some criterion to do with the outcomes it leads to”.

Without the fairness assumption you could create an instant counterexample to any such result by writing down a decision problem where decision-making procedure A is explicitly penalised e.g. omega checks if you use A and gives you minus a million points if so.

gwern's Shortform

mattmacdermott3mo20

a Bayesian interpretation where you don't need to renormalize after every likelihood computation

How does this differ from using Bayes' rule in odds ratio form? In that case you only ever have to renormalise if at some point you want to convert to probabilities.

Implications of the inference scaling paradigm for AI safety

mattmacdermott3mo175

I think all of the following:

process-based feedback pushes against faithfulness because it incentivises having a certain kind of CoT independently of the outcome
outcome-based feedback pushes towards faithfulness because it incentivises making use of earlier tokens to get the right answer
outcome-based feedback pushes against legibility because it incentivises the model to discover new strategies that we might not know it's using
combining process-based feedback with outcome-based feedback:
- pushes extra hard against legibility because it incentivises ob

... (read more)

Tips On Empirical Research Slides

mattmacdermott3mo22

I found this really useful, thanks! I especially appreciate details like how much time you spent on slides at first, and how much you do now.

Selfmaker662's Shortform

mattmacdermott4mo40

Relevant keyword: I think the term for interactions like this where players have an incentive to misreport their preferences in order to bring about their desired outcome is “not strategyproof”.

1Selfmaker6624mo

Thanks, from a very short wikipedia skim, it seems very relevant indeed!

mattmacdermott4mo*40

The coin coming up heads is “more headsy” than the expected outcome, but maybe o3 is about as headsy as Thane expected.

Like if you had thrown 100 coins and then revealed that 80 were heads.

mattmacdermott's Shortform

mattmacdermott4mo145

Could mech interp ever be as good as chain of thought?

Suppose there is 10 years of monumental progress in mechanistic interpretability. We can roughly - not exactly - explain any output that comes out of a neural network. We can do experiments where we put our AIs in interesting situations and make a very good guess at their hidden reasons for doing the things they do.

Doesn't this sound a bit like where we currently are with models that operate with a hidden chain of thought? If you don't think that an AGI built with the current fingers-crossed-its-faithful paradigm would be safe, what percentage an outcome would mech interp have to hit to beat that?

Seems like 99+ to me.

1Daniel Tan4mo

Similar point is made here (towards the end): https://www.lesswrong.com/posts/HQyWGE2BummDCc2Cx/the-case-for-cot-unfaithfulness-is-overstated I think I agree, more or less. One caveat is that I expect RL fine-tuning to degrade the signal / faithfulness / what-have-you in the chain of thought, whereas the same is likely not true of mech interp.

3Seth Herd4mo

I very much agree. Do we really think we're going to track a human-level AGI let alone a superintelligence's every thought, and do it in ways it can't dodge if it decides to? I strongly support mechinterp as a lie detector, and it would be nice to have more as long as we don't use that and control methods to replace actual alignment work and careful thinking. The amount of effort going into interp relative to the theory of impact seems a bit strange to me.