All of Jonathan Paulson's Comments + Replies

All AGI safety questions welcome (especially basic ones) [July 2022]

I don’t think most people are trying to explicitly write down all human values and then tell them to an AI. Here are some more promising alternatives:

Tell an AI to “consult a human if you aren’t sure what to do”
Instead of explicitly trying to write down human values, learn them by example (by watching human actions, or reading books, or…)

Jonathan Paulson3y61

Why should we expect AGIs to optimize much more strongly and “widely” than humans? As far as I know a lot of AI risk is thought to come from “extreme optimization”, but I’m not sure why extreme optimization is the default outcome.

To illustrate: if you hire a human to solve a math problem, the human will probably mostly think about the math problem. They might consult google, or talk to some other humans. They will probably not hire other humans without consulting you first. They definitely won’t try to get brain surgery to become smarter, or kill everyone ... (read more)

3Lumpyproletariat3y

The reason humans don't do any of those things is because they conflict with human values. We don't want to do any of that in the course of solving a math problem. Part of that is that doing such things would conflict with our human values, and the other part is that it sounds for a lot of work and we don't actually want the math problem solved that badly. A better example of things that humans might extremely optimize for, is the continued life and well-being of someone who they care deeply about. Humans will absolutely hire people--doctors and lawyers and charlatans who claim psychic foreknowledge--, kill large numbers of people if that seems helpful, and there are people who would tear apart the stars to protect their loved ones if that were both necessary and feasible (which is bad if you inherently value stars, but very good if you inherently value the continued life and well-being of someone's children). One way of thinking about this is that an AI can wind up with values which seem very silly from our perspective, values that you or I simply wouldn't care very much about, and be just as motivated to pursue those values as we're motivated to pursue our highest values. But that's anthropomorphizing. A different way to think about it is that Clippy is a program that maximizes the number of paperclips, like an if loop in Python or water flowing downhill, and Clippy does not care about anything.

The inordinately slow spread of good AGI conversations in ML

Jonathan Paulson3y20

I agree with it but I don’t think it’s making very strong claims.

I mostly agree with part 1; just giving advice seems too restrictive. But there’s a lot of ground between “only gives advice” and “fully autonomous” and “fully autonomous” and “globally optimizing a utility function”, and I basically expect a smooth increase in AI autonomy over time as they are proved capable and safe. I work in HFT; I think that industry has some of the most autonomous AIs deployed today (although not that sophisticated), but they’re very constrained over what actions they c... (read more)

2Noosphere893y

I suspect your industry is a special case, in that you can get away with automating everything with purely narrow AI. But in more complicated domains, I worry that constraints would not be able to be specified well, especially for things like AI managing.

The inordinately slow spread of good AGI conversations in ML

Jonathan Paulson3y113

My sense is that the existing arguments are not very strong (e.g. I do not find them convincing), and their pretty wide acceptance in EA discussions mostly reflects self-selection (people who are convinced that AI risk is a big problem are more interested in discussing AI risk). So in that sense better intro documents would be nice. But maybe there simply aren't stronger arguments available? (I personally would like to see more arguments from an "engineering" perspective, starting from current computer systems rather than from humans or thought experiments... (read more)

2Rob Bensinger3y

I'd be curious to hear whether you disagree with Gwern's https://www.gwern.net/Tool-AI.

Jonathan Paulson3y30

I expect people to continue making better AI to pursue money/fame/etc., but I don't see why "better" is the same as "extremely goal-directed". There needs to be an argument that optimizer AIs will outcompete other AIs.

Eliezer says that as AI gets more capable, it will naturally switch from "doing more or less what we want" to things like "try and take over the world", "make sure it can never be turned off", "kill all humans" (instrumental goals), "single-mindedly pursue some goal that was haphazardly baked in by the training process" (inner optimization), ... (read more)

6Steven Byrnes3y

I don’t think that’s a good way to think about it. Start by reading everything on this Gwern list. As that list shows, it is already true and has always been true that optimization algorithms will sometimes find out-of-the-box “solutions” that are wildly different from what the programmer intended. What happens today is NOT “the AI does more or less what we want”. Instead, what happens today is that there’s an iterative process where sometimes the AI does something unintended, and the programmer sees that behavior during testing, and then turns off the AI and changes the configuration / reward / environment / whatever, and then tries again. However, with future AIs, the “unintended behavior” may include the AI hacking into a data center on the other side of the world and making backup copies of itself, such that the programmer can’t just iteratively try again, as they can today. (Also, the more capable the AI gets, the more different out-of-the-box “solutions” it will be able to find, and the harder it will be for the programmer to anticipate those “solutions” in advance of actually running the AI. Again, programmers are already frequently surprised by their AI’s out-of-the-box “solutions”; this problem will only get worse as the AI can more skillfully search a broader space of possible plans and actions.) First of all, I personally think that “somewhat-but-not-extremely goal-directed” AGIs are probably possible (humans are an example), and that these things can be made both powerful and corrigible—see my post Consequentialism & Corrigibility. I am less pessimistic than Eliezer on this topic. But then the problems are: (1) The above is just a casual little blog post; we need to do a whole lot more research, in advance, to figure out exactly how to make a somewhat-goal-directed corrigible AGI, if that’s even possible (more discussion here). (2) Even if we do that research in advance, implementing it correctly would probably be hard and prone-to-error, and if w

Jonathan Paulson3y61

IMO the biggest hole here is "why should a superhuman AI be extremely consequentialist/optimizing"? This is a key assumption; without it concerns about instrumental convergence or inner alignment fall away. But there's no explicit argument for it.

Current AIs don't really seem to have goals; humans sort of have goals but very far from the level of "I want to make a cup of coffee so first I'll kill everyone nearby so they don't interfere with that".

2Koen.Holtman3y

I agree this is a very big hole. My opinion here is not humble. My considered opinion is that Eliezer is deeply wrong in point 23, on many levels. (Edited to add: I guess I should include an informative link instead of just expressing my disappointment. Here is my 2021 review of the state of the corrigibility field). Steven, in response to your line of reasoning to fix/clarify this point 23: I am not arguing for pivotal acts as considered and then rejected by Eliezer, but I believe that he strongly underestimates the chances of people inventing safe and also non-consequentialist optimising AGI. So I disagree with your plausibility claim in point (3).

4Steven Byrnes3y

I would say: (1) the strong default presumption is that people will eventually make an extremely consequentialist / optimizing superhuman AI, because each step down that R&D path will lead to money, fame, publications, promotions, etc. (until it starts leading to catastrophic accidents!) (2) it seems extremely hard to prevent that from happening, (3) and it seems that the only remotely plausible way that anyone knows of to prevent that from happening is if someone makes a safe consequentialist / optimizing superhuman AI and uses it to perform a “pivotal act” that prevents other people from making unsafe consequentialist / optimizing superhuman AIs. Nothing in that story says that there can’t also be non-optimizing AIs—there already are such AIs and there will certainly continue to be. If you can think of a way to use non-optimizing AIs to prevent other people from ever creating optimizing AIs, then that would be awesome. That would be the “pivotal weak act” that Eliezer is claiming in (7) does not exist. I’m sure he would be delighted to be proven wrong.

I don't think "burn all GPUs" fares better on any of these questions. I guess you could imagine it being more "accessible" if you think building aligned AGI is easier than convincing the US government AI risk is truly an existential threat (seems implausible).

"Accessibility" seems to illustrate the extent to which AI risk can be seen as a social rather than technical problem; if a small number of decision-makers in the US and Chinese governments (and perhaps some semiconductor companies and software companies) were really convinced AI risk was a concern, t... (read more)

Why Instrumental Goals are not a big AI Safety Problem

Jonathan Paulson3y51

Isn't "bomb all sufficiently advanced semiconductor fabs" an example of a pivotal act that the US government could do right now, without any AGI at all?

If current hardware is sufficient for AGI than maybe that doesn't make us safe, but plausibly current hardware is not sufficient for AGI, and either way stopping hardware progress would slow AI timelines a lot.

7Vaniver3y

Sort of. As stated earlier, I'm now relatively optimistic about non-AI-empowered pivotal acts. There are two big questions. First: is "is that an accessible pivotal act?". What needs to be different such that the US government would actually do that? How would it maintain legitimacy and the ability to continue bombing fabs afterwards? Would all 'peer powers' agree to this, or have you just started WWIII at tremendous human cost? Have you just driven this activity underground, or has it actually stopped? Second: "does that make the situation better or worse?". In the sci-fi universe of Dune, humanity outlaws all computers for AI risk reasons, and nevertheless makes it to the stars... aided in large part by unexplained magical powers. If we outlaw all strong computers in our universe without magical powers, will we make it to the stars, or be able to protect our planet from asteroids and comets, or be able to cure aging, or be able to figure out how to align AIs? I think probably if we stayed at, like, 2010s level of hardware we'd be fine and able to protect our planet from asteroids or w/e, and maybe it'll be fine at 2020s levels or 2030s levels or w/e (tho obv more seems more risky). So I think there are lots of 'slow down hardware progress' options that do actually make the situation better, and so think people should put effort into trying to accomplish this legitimately, but I'm pretty confused about what to do in situations where we don't have a plan of how to turn low-hardware years into more alignment progress. According to a bunch of people, it will be easier to make progress on alignment when we have more AI capabilities, which seems right to me. Also empirically it seems like the more AI can do, the more people think it's fine to worry about AI, which also seems like a sad constraint that we should operate around. I think it'll also be easier to do dangerous things with more AI capabilities and so the net effect is probably bad, but I'm open to argum

Why Instrumental Goals are not a big AI Safety Problem

A > B > human. I expect B < human would also be quite useful.

B does not have a lot of opportunity for action - all it can do is prevent A from acting. It seems like its hard to "eliminate humans" with just that freedom. I agree B has an incentive to hamper A.

Even a B that is dumber than humans is valuable, because it is faster than humans. B can provide real-time feedback on every action that A takes, whereas consulting humans for every action would be impractical - it costs too much human time and it would prevent A from operating in realtime.

1[comment deleted]3y

Why Instrumental Goals are not a big AI Safety Problem

Why isn't it competitive? A is being trained the same way as an agentic system, so it will be competitive.

Adding B is a 2x runtime/training-cost overhead, so there is a "constant factor" cost; is that enough to say something is "not competitive"? In practice I'd expect you could strike a good safety/overhead balance for much less.

1Evan R. Murphy3y

Hmm well if A is being trained the same way using deep learning toward being an agentic system, then it is subject to mesa-optimization and having goals, isn't it? And being subject to mesa-optimization, do you have a way to address inner misalignment failures like deceptive alignment? Oversight alone can be thwarted by a deceptively-aligned mesa-optimizer. You might possibly address this if you give the overseer good enough transparency tools. But such tools don't exist yet.

[RETRACTED] It's time for EA leadership to pull the short-timelines fire alarm.

On (3): I don't expect an agentic AI to consist of a single deep learning model. For concreteness, let me sketch an architecture:

We have some robot R that we want an AI to run. R will supply the AI with sensory input and take actions that the AI recommends. R is like the human body and the AI is like the brain.

AI A is trained to operate R using whatever method you like. It ends up with some goal. AI B is trained as an "overseer" alongside A; it takes in actions recommended by A and decides whether to execute or reject them; B is supposed to reject actions ... (read more)

1Ian McKenzie3y

A couple of things that come to mind: * You'll run into ELK-like difficulties when trying to train B: maybe it looks to the human that the off switch is still accessible but in reality you would not be able to activate it for some reason -- A steers into a Faraday cage or something more clever. * (In response to your reply to acylhalide) Assuming you solve the training difficulties, B has to be at least as good at modelling the world as A: it needs to be able to predict the consequences of action sequences from A to see if the off switch is accessible. * (In response to your reply to Evan) I don't necessarily expect A+B+R to be as capable as A+R: adding B introduces a huge shift to the environment that A has to deal with. If A always disabled the off switch in regular training, and now is unable to execute action sequences that involve that, then it may not be able to act effectively.

1[comment deleted]3y

1Evan R. Murphy3y

Ah I see, you're using a collection of narrower systems and oversight to try and provide safety. Well there are some proposals like this which provide for an AGI that's not agentic and may have better safety properties. Eric Drexler's CAIS comes to mind. But if a proposal is going to be implemented by a major AI lab then it needs to be competitive too. I don't think it's clear that systems like this are competitive with agentic systems. So in the kinds of advanced AI we are still likely to see implemented in the real world, instrumental convergence is still very much a concern.

Jonathan Paulson3y20

Just commenting on the concept of "goals" and particularly the "off switch" problem: no AI system has (to my knowledge) run into this problem, which IMO strongly suggests that "goals" in this sense are not the right way to think about AI systems. AlphaZero in some sense has a goal of winning a Go game, but AlphaZero does not resist being turned off, and I claim its obvious that even a very advanced version of AlphaZero would not resist being turned off. The same is true for large language models (indeed, it's not even clear the idea of turning off a language model is meaningful, since different executions of the model share no state).

7gwern3y

In the causal influence diagram approach, I think AlphaZero as formulated would be 'TI-ignoring' because it does all learning while ignoring the possibility of interruption and assumes it can execute the optimal action. But other algorithms would not be TI-ignoring - I wonder if MuZero would be TI-ignoring or not? (This corresponds to the Q-learning vs SARSA distinction - if you remember the slippery ice example in Sutton & Barto, the wind/slipping would be like the human overseer interrupting, I guess.)

Jonathan Paulson11y-20

I think a more likely explanation is that people just like to complain. Why would people do things that everyone thought were a waste of time? (At my office, we have meetings and email too, but I usually think they are good ways to communicate with people and not a waste of time)

Also, you didn't answer my question. It sounds like your answer is that you are compelled to waste 20 hours of time every week?

1singularitard11y

I didn't answer your question because it was loaded and ridiculous. Quit feigning ignorance to bait for attention, you sad little boy

Jonathan Paulson11y-30

I don't understand. Are you saying you could get 2x as much work done in your 40 hour week, or that due to dependencies on other people you cannot possibly do more than 20 hours of productive work per week no matter how many hours you are in the office?

1singularitard11y

I suspect if you took a look at your life, there are a lot of things you don't understand.

Jonathan Paulson11y20

False. At a company-wide level, Google makes an effort to encourage work-life balance.

Ultimately you need to produce a reasonable amount of output ("reasonable" as defined by your peers + manager). How it gets there doesn't really matter.

Jonathan Paulson11y10

Sort of. My opinion takes that objection into account.

But on the other hand, I don't have any data to quantitatively refute or support your point.

Applications of logical uncertainty

Jonathan Paulson11y40

I work at Google, and I work ~40 hours a week. And that includes breakfast and lunch every day. As far as I can tell, this is typical (for Google).

I think you can get more done by working longer hours...up to a point, and for limited amounts of time. Loss in productivity still means the total work output is going up. I think the break-even point is 60h / week.

2ChristianKl11y

Does that figure take into account that the bug rate that you produce at 60h/week is going to be higher than at 40h/week?

2Jiro11y

It was my understanding that Google provides free food for its employees partly because people who get company dinner are also expected to work past dinner hours. Is this false?

2Florian_Dietz11y

I find it surprising to hear this, but it cleans up some confusion for me if it turns out that the major, successful companies in silicon valley do follow the 40 hour week.

Why not start with a probability distribution over (the finite list of) objects of size at most N, and see what happens when N becomes large?

It really depends on what distribution you want to define though. I don't think there's an obvious "correct" answer.

Here is the Haskell typeclass for doing this, if it helps: https://hackage.haskell.org/package/QuickCheck-2.1.0.1/docs/Test-QuickCheck-Arbitrary.html

0[anonymous]11y

Because there is no defined "size N", except perhaps for nodes in the tree representation of the inductive type.

A discussion of heroic responsibility

Jonathan Paulson11y-10

Unfortunately, it seems much easier to list particularly inefficient uses of time than particularly efficient uses of time :P I guess it all depends on your zero point.

I think for most things, it's important to have a specific person in charge, and have that person be responsible for the success of the thing as a whole. Having someone in charge makes sure there's a coherent vision in one person, makes a specific person accountable, and helps make sure nothing falls through the cracks because it was "someone else's job". When you're in charge, everything is your job.

If no one else has taken charge, stepping up yourself can be a good idea. In my software job, I often feel this way when no one is really championin... (read more)

I was using "power" in the sense of the OP (which is just: more time/skills/influence). Sorry the examples aren't as dramatic as you would like; unfortunately, I can't think of more dramatic examples.

-1Lumifer11y

I think that's the point :-)

1undermind11y

I had that problem too (from the commentary here, this lack of specific examples is the post's biggest issue) -- whatever examples I could come up with seemed distinctly unspectacular. However, I think avoiding common failure modes -- being less wrong -- is a decent way to increase the expected value of your power.

I disagree.

1 and 2 are "negative": avoiding common failure modes.

3 and 4 are "positive": ways to get "more bang for your buck" than you "normally" would.

0Lumifer11y

A list of useful things to do, or a list of effective ways to do something are not ways to get "power for cheap". Avoiding minor failure modes does not get you power. Getting a little bit more bang for your buck is still not "power for cheap".

Reference Frames for Expected Value

Jonathan Paulson11y40

This seems true, but obvious. I'm not sure that I buy that fiction promotes this idea: IMO, fiction usually glosses over how the characters got their powers because it's boring. Some real-life examples of power for cheap would be very useful. Here are some suggestions:

Stick your money in index funds. This is way easier and more effective than trying to beat the market.
Ignore the news. It will waste your time and make you sad.
Go into a high-paying major / career
Ask for things/information/advice. Asking is cheap, and sometimes it works.

Anyone have other real-world suggestions?

1TheOtherDave11y

Get enough sleep. Exercise regularly.

1Lumifer11y

None of your examples look like they provide power for cheap.

Jonathan Paulson11y20

Say the player thought that they were likely win the lottery, that it was a good purchase. This may seem insane to someone familiar with probability and the lottery system, but not everyone is familiar with these things.

I would say this person made a good decision with bad information.

Perhaps we should attempt to stop placing so much emphasis on individualism and just try to do the best we can while not judging others nor other decisions much.

There are lots of times when it's important to judge people e.g. for hiring or performance reviews.

0ozziegooen11y

I would agree that they made a good decision, good decision being defined as 'decision which optimizes expected value with information about the outcome'. My point was to clarify what 'good decision' meant. In this case I was attempting to look at a very simple example (the lottery) so we could make moral claims about individuals. This is different from general performance. On that note though, the question of trying to separate what in an individuals' history they were or were not responsible for would be interesting for hiring or performance reviews, but it definitely is a tricky question.

Is my view contrarian?

A defense of Senexism (Deathism)

The pervasive influence of money in politics sort of functions as a proxy of this. YMMV for whether it's a good thing...

Is my view contrarian?

Jonathan Paulson11y30

Doesn't "contrarian" just mean "disagrees with the majority"? Any further logic-chopping seems pointless and defensive.

The fact that 98% of people are theists is evidence against atheism. I'm perfectly happy to admit this. I think there is other, stronger evidence for atheism, but the contrarian heuristic definitely argues for belief in God.

Similarly, believing that cryonics is a good investment is obviously contrarian. AGI is harder to say; most people probably haven't thought about it.

It seems like the question you're really trying to... (read more)

0JQuinton11y

On the face of it, I also think that the fact that the majority believes something is evidence for that something. But then what about how consensus belief is also a function of time period? How many times over the course of all human history has the consensus of average people been wrong about some fact about the universe? The consensus of say, what causes disease back in 1400 BCE is different than the consensus about the same today. What's to say that this same consensus won't point to something different 3400 years in the future? It seems that looking at how many times the consensus has been wrong over the course of human history is actually evidence that "consensus" -- without qualification (e.g. consensus of doctors, etc.) -- is more likely to be wrong than right; the consensus seems to be weak evidence against said position.

3elharo11y

I wonder. Perhaps that 98% of people are theists is better evidence that theism is useful than that it's correct. In fact, I think ihe 98%, or even an 80% figure, is pretty damn strong evidence that theism is useful; i.e. instrumentally rational. It's basic microeconomics: if people didn't derive value from religion, they'd stop doing it. To cite just one example, lukeprog has written previously about joining Scientology because they had best Toastmasters group. There are many other benefits to be had by professing theism. However I'm not sure that this strong majority belief is particularly strong evidence that theism is correct, or epistemically rational. In particular if it were epistemically rational, I'd expect religions would be more similar than they are. To say that 98% of people believe in God, requires that one accept Allah, the Holy Trinity, and Hanuman as instances of "God". However, adherents of various religions routinely claim that others are not worshipping God at all (though admittedly this is less common than it used to be). Is there some common core nature of "God" that most theists believe in? Possibly, but it's a lot hazier. I've even heard some professed "theists" define God in such a way that it's no more than the physical universe, or even one small group of actual, currently living, not-believed-to-be-supernatural people. (This happens on occasion in Alcoholics Anonymous, for members who don't like accepting the "Higher Power".) At the least, majority beliefs and practice are stronger evidence of instrumental rationality than epistemic rationality. Are there other cases where we have evidence that epistemic and instrumental rationality diverge? Perhaps the various instances of Illusory Superiority; for instance where the vast majority of people think they're an above average driver or the Dunning-Krueger effect. Such beliefs may persist in the face of reality because they're useful to people who hold these beliefs.

Jonathan Paulson11y10

Most of your post is not arguments against curing death.

People being risk-averse has nothing to do with anti-aging research and everything to do with individuals not wanting to die...which has always been true (and becomes more true as life expectancy rises and the "average life" becomes more valuable). The same is true for "we should risk more lives for science".

I agree that people adapt OK to death, but I think you're poking a strawman; the reason death is bad is because it kills you, not because it makes your friends sad.

I think &quo... (read more)

4alicey11y

note: "life expectancy used to be ~30" is a common misconception (it's being skewed by infant mortality) (life expectancy has gone up a lot, just not that much) (as far as i know. i've been told that it's a common misconception that this is a common misconception, but they refused to cite sources)

0Gunnar_Zarncke11y

It isn't. I'm well for curing death. And postponing senescence. But not without considering the trade-offs.

0Said Achmiz11y

While I agree with the spirit of this sentiment, I think we should be a bit careful with blanket statements; the fact that my death would make my friends and family sad is definitely an aspect of what makes it bad. My death would still be bad without that aspect, but not quite as bad.

A defense of Senexism (Deathism)