All of Raymond D's Comments + Replies

Ah I should emphasise, I do think all of these things could help -- it definitely is a spectrum, and I would guess these proposals all do push away from agency. I think the direction here is promising.

The two things I think are (1) the paper seems to draw an overly sharp distinction between agents and non-agents, and (2) basically all of the mitigations proposed look like they break down with superhuman capabilities. Hard to tell which of this is actual disagreements and which is the paper trying to be concise and approachable, so I'll set that aside for n... (read more)

I like the thrust of this paper, but I feel that it overstates how robust the safety properties will be, by drawing an overly sharp distinction between agentic and non-agentic systems, and not really engaging with the strongest counterexamples

 To give some examples from the text:

A chess-playing AI, for instance, is goal-directed because it prefers winning to losing. A classifier trained with log likelihood is not goal-directed, as that learning objective is a natural consequence of making observations

But I could easily train an AI which simply classif... (read more)

3mattmacdermott
The arguments in the paper are representative of Yoshua's views rather than mine, so I won't directly argue for them, but I'll give my own version of the case against It seems commonsense to me that you are more likely to create a dangerous agent the more outcome-based your training signal is, the longer time-horizon those outcomes are measured over, the tighter the feedback loop between the system and the world, and the more of the world lies between the model you're training and the outcomes being achieved. At the top of the spectrum, you have systems trained based on things like the stock price of a company, taking many actions and recieving many observations per second, over years-long trajectories. Many steps down from that you have RL training of current llms: outcome-based, but with shorter trajectories which are less tightly coupled with the outside world. And at bottom of the spectrum you have systems which are trained with an objective that depends directly on their outputs and not on the outcomes they cause, with the feedback not being propogated across time very far at all. At the top of the spectrum, if you train a comptent system it seems almost guaranteed that it's a powerful agent. It's a machine for pushing the world into certain configurations. But at the bottom of the spectrum it seems much less likely -- its input-output behaviour wasn't selected to be effective at causing certain outcomes. Yes there are still ways you could create an agent through a training setup at the bottom of the spectrum (e.g. supervised learning on the outputs of a system at the top of the spectrum), but I don't think they're representative. And yes depending on what kind of a system it is you might be able to turn it into an agent using a bit of scaffolding, but if you have the choice not to, that's an importantly different situation compared to the top of the spectrum. And yes, it seems possible such setups lead to an agentic shoggoth compeletely by accident -- w

Thank you for the very detailed comment! I’m pretty sympathetic to a lot of what you’re saying, and mostly agree with you about the three properties you describe. I also think we ought to do some more spelling-out of the relationship between gradual disempowerment and takeover risk, which isn’t very fleshed-out in the paper — a decent part of why I’m interested in it is because I think it increases takeover risk, in a similar but more general way to the way that race dynamics increase takeover risk.

I’m going to try to respond to the specific points you lay... (read more)

8Fabien Roger
Thanks for your answer! I find it interesting to better understand the sorts of threats you are describing. I am still unsure at what point the effects you describe result in human disempowerment as opposed to a concentration of power. I agree, but there isn't a massive gap between the interests of shareholders and what companies actually do in practice, and people are usually happy to buy shares of public corporations (buying shares is among the best investment opportunities!). When I imagine your assumptions being correct, the natural consequence I imagine is AI-run companies own by shareholders that get most of the surplus back. Modern companies are a good example of capital ownership working for the benefit of the capital owner. If shareholders want to fill the world with happy lizards or fund art, they probably will be able to, just like current rich shareholders can. I think for this to go wrong for everyone (not just people who don't have tons of capital) you need something else bad to happen, and I am unsure what that is. Maybe a very aggressive anti-capitalist state? I can see how this could be true (e.g. the politicians are under the pressures of a public that has been brainwashed by algorithms maximizing engagements in a way that undermines the shareholders' power without actually redistributing the wealth but instead spends it all on big national AI project that do not produce anything else than more AIs), but I feel like that requires some very weird things to be true (e.g. the algorithms maximizing engagement above results in a very unlikely equilibrium absent an external force that pushes against shareholders and against redistribution). I can see how the state could enable massive AI projects by massive AI-run orgs, but I think it's way less likely that nobody (e.g. not the shareholders, not the taxpayer, not corrupt politicians, ...) gets massively rich (and able to chose what to consume). About culture, my point was basically that I don't think

The writing here was definitely influenced by Lewis (we quote TAoM in footnote 6), although I think the Choice Transition is broader and less categorically negative. 

For instance in Lewis's criticism of the potential abolition he writes things like:

The old dealt with its pupils as grown birds deal with young birds when they teach them to fly; the new deals with them more as the poultry-keeper deals with young birds— making them thus or thus for purposes of which the birds know nothing. In a word, the old was a kind of propagation—men transmitting manh

... (read more)

Could you expand on what you mean by 'less automation'? I'm taking it to mean some combination of 'bounding the space of controller actions more', 'automating fewer levels of optimisation', 'more of the work done by humans' and maybe 'only automating easier tasks' but I can't quite tell which of these you're intending or how they fit together.

(Also, am I correctly reading an implicit assumption here that any attempts to do automated research would be classed as 'automated ai safety'?)

2Geoffrey Irving
Bounding the space of controller actions more is the key bit. The (vague) claim is that if you have an argument that an empirically tested automated safety scheme is safe, in sense that you’ll know if the output is correct, you may be able to find a more constrained setup where more of the structure is human-defined and easier to analyze, and that the originally argument may port over to the constrained setup. I’m not claiming this is always possible, though, just that it’s worth searching for. Currently the situation is that we don’t have well-developed arguments that we can recognize the correctness of automated safety work, so it’s hard to test the “less automation” hypothesis concretely. I don’t think all automated research is automated safety: certainly you can do automated pure capabilities. But I may have misunderstood that part of the question.

When I read this post I feel like I'm seeing four different strands bundled together:
1. Truth-of-beliefs as fuzzy or not
2. Models versus propositions
3. Bayesianism as not providing an account of how you generate new hypotheses/models
4. How people can (fail to) communicate with each other

I think you hit the nail on the head with (2) and am mostly sold on (4), but am sceptical of (1) - similar to what several others have said, it seems to me like these problems don't appear when your beliefs are about expected observations, and only appear when you start to ... (read more)

Strongly agree that active inference is underrated both in general and specifically for intuitions about agency.

I think the literature does suffer from ambiguity over where it's descriptive (ie an agent will probably approximate a free energy minimiser) vs prescriptive (ie the right way to build agents is free energy minimisation, and anything that isn't that isn't an agent). I am also not aware of good work on tying active inference to tool use - if you know of any, I'd be pretty curious.

I think the viability thing is maybe slightly fraught - I expect it'... (read more)

1edbs
Yes, you are very much right.  Active Inference / FEP is a description of persistent independent agents. But agents that have humans building and maintaining and supporting them need not be free energy minimizers! I would argue that those human-dependent agents are in fact not really agents at all, I view them as powerful smart-tools. And I completely agree that machine learning optimization tools need not be full independent agents in order to be incredibly powerful and thus manifest incredible potential for danger. However, the biggest fear about AI x-risk that most people have is a fear about self-improving, self-expanding, self-reproducing AI.  And I think that any AI capable of completely independently self-improving is obviously and necessarily an agent that can be well-modeled as a free-energy minimizer. Because it will have a boundary and that boundary will need to be maintained over time. So I agree with you that AI-tools (non-general optimizers) are very dangerous and not covered by FEP, but AI-agents (general optimizers) are very dangerous for unique reasons but also covered by FEP.

Interesting! I think one of the biggest things we gloss over in the piece in how perception fits into the picture, and this seems like a pretty relevant point. In general the space of 'things that give situational awareness' seems pretty broad and ripe for analysis.

I also wonder how much efficiency gets lost by decoupling observation and understanding - at least in humans, it seems like we have a kind of hierarchical perception where our subjective experience of 'looking at' something has already gone through a few layers of interpretation, giving us basically no unadulterated visual observation, presumably because this is more efficient (maybe in particular faster?). 

I'd be pretty curious to hear about your disagreements if you're willing to share

This seems like a misunderstanding / not my intent. (Could you maybe quote the part that gave you this impression?)


I believe Dusan was trying to say that davidad's agenda limits the planner AI to only writing provable mathematical solutions. To expand, I believe that compared to what you briefly describe, the idea in davidad's agenda is that you don't try to build a planner that's definitely inner aligned, you simply have a formal verification system that ~guarantees what effects a plan will and won't have if implemented.

Oh interesting! I just had a go at testing it on screenshots from a parallel conversation and it seems like it incorrectly interprets those screenshots as also being of its own conversation. 

So it seems like 'recognising things it has said' is doing very little of the heavy lifting and 'recognising its own name' is responsible for most of the effect.

I'll have a bit more of a play around and probably put a disclaimer at the top of the post some time soon.

The 'reward being chance of winning' stuff changes a bit about how the model generalises if it's playing a game with randomness and conditioned on the upper end - it biases the model towards 'expecting risk to pay off'. Eg/ if the model plays a 1-step game where it either banks 1 point or gets a 1% chance of 10 points, then conditioning on it getting 10 points will cause it to take the lower EV action. But this isn't super relevant.

In general I am pretty confused about how models would generalise strategies out of distribution, and this seems like at least... (read more)

1kave
The point of my setup is that [ P(outcome|corrigible action) ] is very small, so [ P(incorrigible action|outcome) ] is largeish, even if [ Frequency(corrigible action) ] is high and [ Frequency(incorrigible action) ] is low or absent. And this is alignment relevant, because I expect people will ask for never before seen outcomes (by chance or on purpose), some of which may soft-require incorrigible actions. (And of course there could be optimisation daemons that do treacherous turns even when asking for normal actions. But I think your post is setting that aside, which seems reasonable).

Re generalisation - decision transformers don't really have strategies per se, they pick moment to moment, and might be systematically miscalibrated about what they'll do in future timesteps. It is true that they'll have some chance at every timestep, which will add up over time, but if you were actually trying to implement this then you could do things like lowering the temperature, which shouldn't affect performance otherwise.

Re higher conditioning - I think this shouldn't be true. For the sake of argument we can reframe it as a binary outcome, where the... (read more)

1kave
I'm not sure I understand what the return being the chance of winning would look like. Hopefully not much is changed if I imagine the reward being whether or not the model wins or loses, but let me know if not. I do think in this setup, with a binary or very squashed reward signal, the model will just predict standard winning trajectories. My setup depends on asking the model to generalise out of distribution. That is, when predicting scores, ask for a score higher than any seen before. As long as the dynamics of the shutdown are sufficiently visible in the training data, the model could (in principle) generalise to realise a very high score would require disconnecting the button. I think you think that if the button presses always happen at timestep 100 (say) and with a uniform 1% probability, then this model would put ~1% on disconnecting the button in this scenario, which I agree with. I think this suggests decision transformers will only rarely deal with rare threats, but can learn to deal with rare bottlenecks with high likelihood. ---------------------------------------- I don't follow this sentence. I more-or-less agree with the previous one (though there may be lots of ways for the trajectories to leak information about what overall 'strategy' the model is pursuing, and of course people will try and get that information in there by conditioning on more information).

Thanks! Yeah this isn't in the paper, it's just a thing I'm fairly sure of which probably deserves a more thorough treatment elsewhere. In the meantime, some rough intuitions would be:

  • delusions are a result of causal confounders, which must be hidden upstream variables
  • if you actually simulate and therefore specify an entire markov blanket, it will screen off all other upstream variables including all possible confounders
  • this is ludicrously difficult for agents with a long history (like a human), but if the STF story is correct, it's sufficient, and crucial
... (read more)

A slightly sideways argument for interpretability: It's a really good way to introduce the importance and tractability of alignment research

In my experience it's very easy to explain to someone with no technical background that

  • Image classifiers have got much much better (like in 10 years they went from being impossible to being something you can do on your laptop)
  • We actually don't really understand why they do what they do (like we don't know why the classifier says this is an image of a cat, even if it's right)
  • But, thanks to dedicated research, we have be
... (read more)

My main takeaway from this post is that it's important to distinguish between sending signals and trying to send signals, because the latter often leads to goodharting.

It's tricky, though, because obviously you want to be paying attention to what signals you're giving off, and how they differ from the signals you'd like to be giving off, and sometimes you do just have to try to change them. 

For instance, I make more of an effort now than I used to, to notice when I appreciate what people are doing, and tell them, so that they know I care. And I think ... (read more)

My main takeaway from this post is that it's important to distinguish between sending signals and trying to send signals, because the latter often leads to goodharting.

That is a wonderful summary.

 

For instance, I make more of an effort now than I used to, to notice when I appreciate what people are doing, and tell them, so that they know I care. And I think this has basically been very good. This is very much not me dropping all effort to signal.

But I think what you're talking about is very applicable here, because if I were just trying to maximise th

... (read more)

if you think timelines are short for reasons unrelated to biological anchors, I don't think Bio Anchors provides an affirmative argument that you should change your mind.

 

Eliezer:  I wish I could say that it probably beats showing a single estimate, in terms of its impact on the reader.  But in fact, writing a huge careful Very Serious Report like that and snowing the reader under with Alternative Calculations is probably going to cause them to give more authority to the whole thing.  It's all very well to note the Ways I Could Be Wrong

... (read more)

The Bio Anchors report is intended as a tool for making debates about AI timelines more concrete, for those who find some bio-anchor-related bound helpful (e.g., some think we should lower bound P(AGI) at some reasonably high number for any year in which we expect to hit a particular kind of "biological anchor"). Ajeya's work lengthened my own timelines, because it helped me understand that some bio-anchor-inspired arguments for shorter timelines didn't have as much going for them as I'd thought; but I think it may have shortened some other folks'.

(The pre... (read more)

The belief that people can only be morally harmed by things that causally affect them is not universally accepted. Personally I intuitively would like my grave to not be desecrated, for instance.

I think we have lots of moral intuitions that have become less coherent as science has progressed. But if my identical twin started licensing his genetic code to make human burgers for people who wanted to see what cannibalism was like, I would feel wronged.

I'm using pretty charged examples here, but the point I'm trying to convey is that there are a lot of moral l... (read more)

You ask a number of good questions here, but the crucial point to me is that they are still questions. I agree it seems, based on my intuitions of the answers, like this isn't the best path. But 'how much would it cost' and 'what's the chance a clone works on something counterproductive' are, to me, not an argument against cloning, but rather arguments for working out how to answer those questions.

Also very ironic if we can't even align clones and that's what gets us.

1Yair Halberstadt
This seems like the sort of thing that would be expensive to investigate, has low potential upside and just investigating would have enormous negatives (think loss of wierdness point, and potential for scandal).

I think there are extra considerations to do with what the clone's relation to von Neumann. Plausibly, it might be wrong to clone him without his consent, which we can now no longer get. And the whole idea that you might have a right to your likeness, identity, image, and so on, becomes much trickier as soon as you have actually been cloned.

Also there's a bit of a gulf between a parent deciding to raise a child they think might do good and a (presumably fairly large) organisation funding the creation of a child.

I don't have strongly held convictions on these points, but I do think that they're important and that you'd need to have good answers before you cloned somebody.

5Aiyen
How could it be wrong to clone him without his consent? He’s dead, and thus cannot suffer. Moreover, the right to your likeness is to prevent people from being harmed by misuse of said likeness; it doesn’t strike me as a deontological prohibition on copying (or as a valid moral principle to the extent that it is deontological), and he can’t be harmed anymore. Also, how could anyone have a right to their genome that would permit them to veto others having it? If that doesn’t sound absurd to you prima facie, consider identical twins (or if they’re not quite identical enough, preexisting clones). Should one of them have a right to dictate the existence or reproduction of the other? And if not, how can we justify such a genetic copyright in the case of cloning? Cloning, at least when the clone is properly cared for, is a victimless offense, and thus ought not be offensive at all.

Well, I basically agree with everything you just said. I think we have quite different opinions about what politics is, though, and what it's for. But perhaps this isn't the best place to resolve those differences.

Ok I think this is partly fair, but also clearly our moral standards are informed by our society, and in no small part those standards emerge from discussions about what we collectively would like those standards to be, and not just a genetically hardwired disloyalty sensor.

Put another way: yes, in pressured environments we act on instinct, but those instincts don't exist in a vacuum, and the societal project of working out what they ought to be is quite important and pretty hard, precisely because in the moment where you need to refer to it, you will be acting on System 1.

4dkirmani
Yes, these discussions set / update group norms. Perceived defection from group norms triggers the genetically hardwired disloyalty sensor. Right, System 1 contains adaptations optimized to signal adherence to group norms. The societal project of working out what norms other people should adhere to is known as "politics", and lots of people would agree that it's important.

I'm not sure I'm entirely persuaded. Are you saying that the goal of ethics is to accurately predict what people's moral impulse will be in arbitrary situations?

I think moral impulses have changed with times, and it's notable that some people (Bentham, for example) managed to think hard about ethics and arrive at conclusions which massively preempted later shifts in moral values.

Like, Newton's theories give you a good way to predict what you'll see when you throw a ball in the air, but it feels incorrect to me to say that Newton's goal was to find order in... (read more)

4Yair Halberstadt
I'm not saying that's the explicit goal. I'm saying that in practice, if someone suggests a moral theory which doesn't reflect how humans actually feel about most actions nobody is going to accept it. The underlying human drive behind moral theories is to find order in our moral impulses, even if that's not the system's goal
4dkirmani
I like this framing! The entire point of having a theory is to predict experimental data, and the only way I can collect data is through my senses. You could construct predictive models of people's moral impulses. I wouldn't call these models laws, though.

Migration - they have a team that will just do it for you if you're on the annual plan, plus there's an exporting plugin (https://ghost.org/docs/migration/wordpress/)

Setup - yeah there are a bunch of people who can help with this and I am one of them

I'll message you

Massive conflict of interest: I blog on ghost, know and like the people at ghost, and work at a company that moved from substack to ghost, get paid to help people use ghost, and a couple more COIs in this vein. 

But if you're soliciting takes from somebody from wordpress I think you might also appreciate the case for ghost, which I simply do think is better than substack for most bloggers above a certain size.

Re your cons, ghost:

1 - has a migration team and the ability to do custom routing, so you would be able to migrate your content

3 - supports total... (read more)

6Zvi
Strong upvoting after our conversation so more people see it. Raymond made a strong case, I'm seriously considering it and would like everyone else's take on Ghost, good or bad. Getting the experiences of others who've used it, and can verify that it works and can be trusted (or not, which would be even more useful if true!), would be very helpful. The basic downside versus Substack is lack of Substack's discovery, such as it is, not sure of magnitude of that, and that people won't be used to it and won't have already entered CC info, which will hurt revenue some (but again, how much? Anyone have estimates?) and the start-up costs would be more annoying.  In exchange you get full customization, open source that can easily be self-hosted in a pinch, lower costs given expected size of the audience, better analytics, better improvement in feature sets over time given track records, etc. But I'd have to do at least some work to get that (e.g. you need to add a comment section on your own). 
3Zvi
Thank you for being up front. My basic answer is that I'm vaguely aware Ghost exists, and I'd be open to a pitch/discussion to try and convince me it's superior to Substack or Wordpress, although it would be an uphill battle. If there's human support willing to make the migration and setup easy and help me figure out how to do things, then... maybe? Could set up a call to discuss.

I'd like to throw out some more bad ideas, with fewer disclaimers about how terrible they are because I have less reputation to hedge against.

Inline Commenting

I very strongly endorse the point that it seems bad that someone can make bad claims in a post, which are then refuted in comments which only get read by people who get all the way to the bottom and read comments. To me the obvious (wrong) solution is to let people make inline comments. If nothing else, having a good way within comments to point to what part of the post you want to address feels like... (read more)

4TAG
You can link to comments, so that is an easy technical solution. As ever , it's mainly a cultural problem: if good quality criticism were upvoted, it would appear at the top of the comments anyway, and bit be buried.

I'm really enjoying the difference between the number of people who claimed they opted out and the number of people who explicitly wrote the phrase

2AprilSR
I mean I thought the entire point was to say it out loud, but if you want me to write it: I no longer consent to being in a simulation.

What's the procedure?

1Razied
Plan to cryo-preserve yourself at some future time, then create a trust fund with the mission of creating a million simulations of your life as soon as brain simulation becomes feasible and cheap. The fund's mission is to wake up the simulations the instant that the cryo-preservation is set to start. It will distribute the remaining money (which has been compounding, of course) among the simulations it has just woken up and instantiated in the real world.

Follow the white rabbit

5lsusr
There's not just one. We default into several overlapping simulations. Each simulation requires a different method of getting out. One of them is to just stare at a blank wall for long enough.