All of harfe's Comments + Replies

harfe8mo12

Mid 2027 seems too late to me for such a candidate to start the official campaign.

For the 2020 presidential election, many democratic candidates announced their campaign in early 2019, and Yang already in 2017. Debates happened already in June 2019. As a likely unknown candidate, you probably need a longer run time to accumulate a bit of fame.

3plex8mo

oh yup, sorry, I meant mid 2026, like ~6 months before the primary proper starts. But could be earlier.

harfe8mo120

Also Musk's regulatory plan is polling well

What plan are you referring to? Is this something AI safety specific?

1MichaelDickens8mo

I don't know what the regulatory plan is, I was just referring to this poll, which I didn't read in full, I just read the title. Reading it now, it's not so much a plan as a vision, and it's not so much "Musk's vision" as it is a viewpoint (that the poll claims is associated with Musk) in favor of regulating the risks of AI. Which is very different from JD Vance's position; Vance's position is closer to the one that does not poll well.

harfe8mo10

I wouldn't say so, I don't think his campaign has made UBI advocacy more difficult.

But an AI notkilleveryoneism campaign seems more risky. It could end up making the worries look silly, for example.

harfe8mo50

Their platform would be whatever version and framing of AI notkilleveryoneism the candidates personally endorse, plus maybe some other smaller things. They should be open that they consider the potential human disempowerment or extinction to be the main problem of our time.

As for the concrete policy proposals, I am not sure. The focus could be on international treaties, or banning or heavy regulation of AI models who were trained with more than a trillion quadrillion (10^27) operations. (not sure I understand the intent behind your question).

5Mitchell_Porter8mo

I guess I'm expressing doubt about the viability of wise or cautious AI strategies, given our new e/acc world order, in which everyone who can, is sprinting uninhibitedly towards superintelligence. My own best hope at this point is that someone will actually solve the "civilizational superalignment" problem of CEV, i.e. learning how to imbue autonomous AI with the full set of values (whatever they are) required to "govern" a transhuman civilization in a way that follows from the best in humanity, etc. - and that this solution will be taken into account by whoever actually wins the race to superintelligence. This is a strategy that can be pursued within the technical communities that are actually designing and building frontier AI. But the public discussion of AI is currently hopeless, it's deeply unrealistic about the most obvious consequences of creating new intelligences that are smarter, faster, and more knowledgeable than any human, namely, that they are going to replace human beings as masters of the Earth, and eventually that they will simply replace the entire human race.

Debate, Oracles, and Obfuscated Arguments

harfe8mo8113

A potentially impactful thing: someone competent runs as a candidate for the 2028 election on an AI notkilleveryoneism^[1] platform. Maybe even two people should run, one for the democratic primary, and one in the republican primary. While getting the nomination is rather unlikely, there could be lots of benefits even if you fail to gain the nomination (like other presidential candidates becoming sympathetic to AI notkilleveryoneism, or more popularity of AI notkilleveryoneism in the population, etc.)

On the other hand, attempting a presidential run can easi... (read more)

9plex8mo

Yeah, this seems worth a shot. If we do this, we should do our own pre-primary in like mid 2027 to select who to run in each party, so that we don't split the vote and also so that we select the best candidate. Someone I know was involved in a DIY pre-primary in the UK which unseated an extremely safe politician, and we'd get a bunch of extra press while doing this.

8Mateusz Bagiński8mo

Did Yang's campaign backfire in some way?

2Mitchell_Porter8mo

What would their actual platform be?

harfe8moΩ221

This can easily be done in the cryptographic example above: B can sample a new number $y = p^{'} \cdot q^{'}$ , and then present $y$ to a fresh copy of A that has not seen the transcript for $x$ so far.

I don't understand how this is supposed to help. I guess the point is to somehow catch a fresh copy of A in a lie about a problem that is different from the original problem, and conclude that A is the dishonest debater?

But couldn't A just answer "I don't know"?

Even if it is a fresh copy, it would notice that it does not know the secret factors, so it could display different ... (read more)

2Geoffrey Irving8mo

You'd need some coupling argument to know that the problems have related difficulty, so that if A is constantly saying "I don't know" to other similar problems it counts as evidence that A can't reliably know the answer to this one. But to be clear, we don't know how to make this particular protocol go through, since we don't know how to formalise that kind of similarity assumption in a plausibly useful way. We do know a different protocol with better properties (coming soon).

What do coherence arguments actually prove about agentic behavior?

harfe8mo92

Some of these are very easy to prove; here's my favorite example. An agent has a fixed utility function and performs Pareto-optimally on that utility function across multiple worlds (so "utility in each world" is the set of objectives). Then there's a normal vector (or family of normal vectors) to the Pareto surface at whatever point the agent achieves. (You should draw a picture at this point in order for this to make sense.) That normal vector's components will all be nonnegative (because Pareto surface), and the vector is defined only up to normalizati

... (read more)

johnswentworth8mo100

If you already accept the concept of expected utility maximization, then you could also use mixed strategies to get the convexity-like assumption (but that is not useful if the point is to motivate using probabilities and expected utility maximization).

That is indeed what I had in mind when I said we'd need another couple sentences to argue that the agent maximizes expected utility under the distribution. It is less circular than it might seem at first glance, because two importantly different kinds of probabilities are involved: uncertainty over the environment (which is what we're deriving), and uncertainty over the agent's own actions arising from mixed strategies.

Vanessa Kosoy's Shortform

harfe8moΩ9132

I think there are some subtleties with the (non-infra) bayesian VNM version, which come down to the difference between "extreme point" and "exposed point" of $D$ . If a point is an extreme point that is not an exposed point, then it cannot be the unique expected utility maximizer under a utility function (but it can be a non-unique maximizer).

For extreme points it might still work with uniqueness, if, instead of a VNM-decision-maker, we require a slightly weaker decision maker whose preferences satisfy the VNM axioms except continuity.

8Vanessa Kosoy8mo

Another excellent catch, kudos. I've really been sloppy with this shortform. I corrected it to say that we can approximate the system arbitrarily well by VNM decision-makers. Although, I think it's also possible to argue that a system that selects a non-exposed point is not quite maximally influential, because it's selecting somethings that's very close to delegating some decision power to chance. Also, maybe this cannot happen when D is the inverse limit of finite sets? (As is the case in sequential decision making with finite action/observation spaces). I'm not sure.

Vanessa Kosoy's Shortform

harfe9moΩ8122

For any $Φ, Ψ \in^D$ , if $Θ^{*} = Φ \lor Ψ$ then either $Φ \subseteq Ψ$ or $Ψ \subseteq Φ$ .

I think this condition might be too weak and the conjecture is not true under this definition.

If $Φ_{1} \subseteq Φ_{2}$ , then we have $E_{y \sim ξ} {min}_{μ \in Φ_{2}} E_{x \sim μ} u (x, y) \leq E_{y \sim ξ} {min}_{μ \in Φ_{1}} E_{x \sim μ} u (x, y)$ (because a minimum over a larger set is smaller). Thus, $Φ_{2}$ can only be the unique argmax if $Φ_{1} = Φ_{2}$ .

Consider the example $^D={[0,x]:x∈[0,1]}$ . Then $^D$ is closed. And $Θ^{*} = [0, 1]$ satisfies $Θ^{*} = Φ \lor Ψ ⟹ Φ \subseteq Ψ \lor Ψ \subseteq Φ$ . But per the above it cannot be a unique maximizer.

Maybe the issue can be fixed if we strengthen the condition so that $Φ^{*}$ has to be also minimal with res... (read more)

4Vanessa Kosoy9mo

You're absolutely right, good job! I fixed the OP.

quila's Shortform

harfe10mo10

For a provably aligned (or probably aligned) system you need a formal specification of alignment. Do you have something in mind for that? This could be a major difficulty. But maybe you only want to "prove" inner alignment and assume that you already have an outer-alignment-goal-function, in which case defining alignment is probably easier.

1[anonymous]10mo

correct, i'm imagining these being solved separately

D0TheMath's Shortform

harfe11mo72

insofar as the simplest & best internal logical-induction market traders have strong beliefs on the subject, they may very well be picking up on something metaphysically fundamental. Its simply the simplest explanation consistent with the facts.

Theorem 4.6.2 in logical induction says that the "probability" of independent statements does not converge to $1$ or $0$ , but to something in-between. So even if a mathematician says that some independent statement feels true (eg some objects are "really out there"), logical induction will tell him to feel uncert... (read more)

johnswentworth's Shortform

harfe11mo70

A related comment from lukeprog (who works at OP) was posted on the EA Forum. It includes:

However, at present, it remains the case that most of the individuals in the current field of AI governance and policy (whether we fund them or not) are personally left-of-center and have more left-of-center policy networks. Therefore, we think AI policy work that engages conservative audiences is especially urgent and neglected, and we regularly recommend right-of-center funding opportunities in this category to several funders.

5habryka11mo

I think the comment more confirms than disconfirms John's comment (though I still think it's too broad for other reasons). OP "funding" something historically has basically always meant recommending a grant to GV. Luke's language to me suggests that indeed the right of center grants are no longer referred to GV (based on a vague vibe of how he refers to funders in plural). OP has always made some grant recommendations to other funders (historically OP would probably describe those grants as "rejected but referred to an external funder"). As Luke says, those are usually ignored, and OP's counterfactual effect on those grants is much less, and IMO it would be inaccurate to describe those recommendations as "OP funding something". As I said in the comment I quote in the thread, most OP staff would like to fund things right of center, but GV does not seem to want to, as such the only choice OP has is to refer them to other funders (which sometimes works, but mostly doesn't). As another piece of evidence, when OP defunded all the orgs that GV didn't want to fund anymore, the communication emails that OP sent said that "Open Philanthropy is exiting funding area X" or "exiting organization X". By the same use of language, yes, it seems like OP has exited funding right-of-center policy work. (I think it would make sense to taboo "OP funding X" in future conversations to avoid confusion, but also, I think historically it was very meaningfully the case that getting funded by GV is much better described as "getting funded by OP" given that you would never talk to anyone at GV and the opinions of anyone at GV would basically have no influence on you getting funded. Things are different now, and in a meaningful sense OP isn't funding anyone anymore, they are just recommending grants to others, and it matters more what those others think then what OP staff thinks)

Flipping Out: The Cosmic Coinflip Thought Experiment Is Bad Philosophy

harfe11mo40

it's for the sake of maximizing long-term expected value.

Kelly betting does not maximize long-term expected value in all situations. For example, if some bets are offered only once (or even a finite amount), then you can get better long-term expected utility by sometimes accepting bets with a potential "0"-Utility outcome.

harfe11mo40

This is maybe not the central point, but I note that your definition of "alignment" doesn't precisely capture what I understand "alignment" or a good outcome from AI to be:

‘AGI’ continuing to exist

AGI could be very catastrophic even when it stops existing a year later.

eventually

If AGI makes earth uninhabitable in a trillion years, that could be a good outcome nonetheless.

ranges that existing humans could survive under

I don't know whether that covers "humans can survive on mars with a space-suit", but even then, if humans evolve/change to handle... (read more)

1Remmelt11mo

Thanks! These are thoughtful points. See some clarifications below: You're right. I'm not even covering all the other bad stuff that could happen in the short-term, that we might still be able to prevent, like AGI triggering global nuclear war. What I'm referring to is unpreventable convergence on extinction. Agreed that could be a good outcome if it could be attainable. In practice, the convergence reasoning is about total human extinction happening within 500 years after 'AGI' has been introduced into the environment (with very very little probability remainder above that). In theory of course, to converge toward 100% chance, you are reasoning about going across a timeline of potentially infinite span. Yes, it does cover that. Whatever technological means we could think of shielding ourselves, or 'AGI' could come up with to create as (temporary) barriers against the human-toxic landscape it creates, still would not be enough. Unfortunately, this is not workable. The mismatch between the (expanding) set of conditions needed for maintaining/increasing configurations of the AGI artificial hardware and for our human organic wetware is too great. Also, if you try entirely changing our underlying substrate to the artificial substrate, you've basically removed the human and are left with 'AGI'. The lossy scans of human brains ported onto hardware would no longer feel as 'humans' can feel, and will be further changed/selected for to fit with their artificial substrate. This is because what humans and feel and express as emotions is grounded in the distributed and locally context-dependent functioning of organic molecules (eg. hormones) in our body.

harfe11mo30

it is the case that most algorithms (as a subset in the hyperspace of all possible algorithms) are already in their maximally most simplified form. Even tiny changes to an algorithm could convert it from 'simplifiable' to 'non-simplifiable'.

This seems wrong to me: For any given algorithm you can find many equivalent but non-simplified algorithms with the same behavior, by adding a statement to the algorithm that does not affect the rest of the algorithm (e.g. adding a line such as foobar1234 = 123 in the middle of a python program)). In fact, I would c... (read more)

Lucius Bushnaq's Shortform

harfe1y12

This is not a formal definition.

Your English sentence has no apparent connection to mathematical objects, which would be necessary for a rigorous and formal definition.

1Remmelt1y

Yes, I agree formalisation is needed. See comment by flandry39 in this thread on how one might go about doing so. Worth considering is that there are actually two aspects that make it hard to define the term ‘alignment’ such to allow for sufficiently rigorous reasoning: 1. It must allow for logically valid reasoning (therefore requiring formalisation). 2. It must allow for empirically sound reasoning (ie. the premises correspond with how the world works). In my reply above, I did not help you much with (1.). Though even while still using the English language, I managed to restate a vague notion of alignment in more precise terms. Notice how it does help to define the correspondences with how the world works (2.): * “That ‘AGI’ continuing to exist, in some modified form, does not result eventually in changes to world conditions/contexts that fall outside the ranges that existing humans could survive under.” The reason why 2. is important is that just formalisation is not enough. Just describing and/or deriving logical relations between mathematical objects does not say something about the physical world. Somewhere in your fully communicated definition there also needs to be a description of how the mathematical objects correspond with real-world phenonema. Often, mathematicians do this by talking to collaborators about what symbols mean while they scribble the symbols out on eg. a whiteboard. But whatever way you do it, you need to communicate how the definition corresponds to things happening in the real world, in order to show that it is a rigorous definition. Otherwise, others could still critique you that the formally precise definition is not rigorous, because it does not adequately (or explicitly) represent the real-world problem.

2flandry391y

Simplified Claim: that an AGI is 'not-aligned' *if* its continued existence for sure eventually results in changes to all of this planets habitable zones that are so far outside the ranges any existing mammals could survive in, that the human race itself (along with most of the other planetary life) is prematurely forced to go extinct. Can this definition of 'non-alignment' be formalized sufficiently well so that a claim 'It is impossible to align AGI with human interests' can be well supported, with reasonable reasons, logic, argument, etc? The term 'exist' as in "assert X exists in domain Y" as being either true or false is a formal notion. Similar can be done for the the term 'change' (as from "modified"), which would itself be connected to whatever is the formalized from of "generalized learning algorithm". The notion of 'AGI' as 1; some sort of generalized learning algorithm that 2; learns about the domain in which it is itself situated 3; sufficiently well so as to 4; account for and maintain/update itself (its substrate, its own code, etc) in that domain -- these/they are all also fully formalizable concepts. Note that there is no need to consider at all whether or not the AGI (some specific instance of some generalized learning algorithm) is "self aware" or "understands" anything about itself or the domain it is in -- the notion of "learning" can merely mean that its internal state changes in such a way that the ways in which it processes received inputs into outputs are such that the outputs are somehow "better" (more responsive, more correct, more adaptive, etc) with respect to some basis, in some domain, where that basis could itself even be tacit (not obviously expressed in any formal form). The notions of 'inputs', 'outputs', 'changes', 'compute', and hence 'learn', etc, are all, in this way, also formalizeable, even if the notions of "understand", and "aware of" and "self" are not. Notice that this formalization of 'learning', etc, occurs inde

harfe1y30

I think you are broadly right.

So we're automatically giving $x_{1}$ ca. $2^{1000}$ higher probability – even before applying the length penalty $\frac{1}{2^{| p |}}$ .

But note that under the Solomonoff prior, you will get another $2^{- 2000 - | G |}$ penalty for these programs with DEADCODE. So with this consideration, the weight changes from $2^{- 1000}$ (for normal $p_{1}$ ) to $2^{- 1000} (1 + 2^{- | G |})$ (normal $p_{1}$ plus $2^{1000}$ DEADCODE versions of $p_{1}$ ), which is not a huge change.

For your case of "uniform probability until $10^{90}$ " I think you are right about exponential decay.

4Lucius Bushnaq1y

Yes, my point here is mainly that the exponential decay seems almost baked into the setup even if we don't explicitly set it up that way, not that the decay is very notably stronger than it looks at first glance. Given how many words have been spilled arguing over the philosophical validity of putting the decay with program length into the prior, this seems kind of important?

harfe1y20

That point is basically already in the post:

large language models can help document and teach endangered languages, providing learning tools for younger generations and facilitating the transmission of knowledge. However, this potential will only be realized if we prioritize the integration of all languages into AI training data.

Announcing the $200k EA Community Choice

harfe1y12

I have doubts that the claim about "theoretically optimal" apply to this case.

Now, you have not provided a precise notion of optimality, so the below example might not apply if you come up with another notion of optimality or assume that voters collude with each other, or use a certain decision theory, or make other assumptions... Also there are some complications because the optimal strategy for each player depends on the strategy of the other players. A typical choice in these cases is to look at Nash-equilibria.

Consider three charities A,B,C and two pla... (read more)

It's time for a self-reproducing machine

harfe1y40

There is also Project Quine, which is a newer attempt to build a self-replicating 3D printer

New Blog Post Against AI Doom

harfe1y34

This was already referenced here: https://www.lesswrong.com/posts/MW6tivBkwSe9amdCw/ai-existential-risk-probabilities-are-too-unreliable-to

I think it would be better to comment there instead of here.

5Oleg Trott1y

That post was completely ignored here: 0 comments and 0 upvotes during the first 24 hours. I don't know if it's the timing or the content. On HN, which is where I saw it, it was ranked #1 briefly, as I recall. But then it got "flagged", apparently.

1Noah Birnbaum1y

Good point!

Ilya Sutskever created a new AGI startup

harfe1y80

One thing I find positive about SSI is their intent to not have products before superintelligence (note that I am not arguing here that the whole endeavor is net-positive). Not building intermediate products lessens the impact on race dynamics. I think it would be preferable if all the other AGI labs had a similar policy (funnily, while typing this comment, I got a notification about Claude 3.5 Sonnet... ). The policy not to have any product can also give them cover to focus on safety research that is relevant for superintelligence, instead of doing some s... (read more)

3orthonormal1y

Counterpoint: other labs might become more paranoid that SSI is ahead of them. I think your point is probably more correct than the counterpoint, but it's worth mentioning.

Ilya Sutskever created a new AGI startup

harfe1y41

It does not appear paywalled to me. The link that @mesaoptimizer posted is an archive, not the original bloomberg.com article.

New intro textbook on AIXI

harfe1y30

I haven't watched it yet, but there is also a recent technical discussion/podcast episode about AIXI and relatedd topics with Marcus Hutter: https://www.youtube.com/watch?v=7TgOwMW_rnk

(Geometrically) Maximal Lottery-Lotteries Exist

harfe1y31

It suffices to show that the Smith lotteries that the above result establishes are the only lotteries that can be part of maximal lottery-lotteries are also subject to the partition-of-unity condition.

I fail to understand this sentence. Here are some questions about this sentence:

what are Smith lotteries? Ctrl+f only finds lottery-Smith lottery-lotteries, do you mean these? Or do you mean lotteries that are smith?
which result do you mean by "above result"?
What does it mean for a lottery to be part of maximal lottery-lotteries?
does "also subj

... (read more)

3Lorxus1y

To avoid confusion: this post and my reply to it were also on a past version of this post; that version lacked any investigation of dominance criterion desiderata for lottery-lotteries.

2Lorxus1y

Lotteries over the Smith set. That definitely wasn't clear - I'll fix that. * Proposition: (Lottery-lotteries are strongly characterized by their selectivity of partitions of unity) This one. "You can tell whether a lottery-lottery is maximal or not by how good the partitions of unity it admits are." Sorry, didn't really know a good way to link to myself internally and I forgot to number the various statements. Just that some maximal lottery-lottery gives it nonzero probability. Oh no! I thought I caught all the typos! That should be "also subject to the partition-of-unity condition", that is, you look at all the lotteries (which we know are over the Smith set, btw) that some arbitrary maximal lottery-lottery gives any nonzero probability to, and you should expect to be able to sort them into groups by what final probability over candidates they induce; those final probabilities over candidates should themselves result in identical geometric-expected utility for the voterbase. Consider: at this point we know that a maximal lottery-lottery would not just have to be comprised of lottery-Smith lotteries, i.e., lotteries that are in the lottery-Smith set - but also that they have to be comprised entirely of lotteries over the Smith set of the candidate set. Then on top of that, we know that you can tell which lottery-lotteries are maximal by which partitions of unity they admit (that's the "above result"). Note that by "admit" we mean "some subset of the lotteries this lottery-lottery has support over corresponds to it" (this partition of unity). This is the slightly complicated part. The game I described has a mixed strategy equilibrium; this will take the form of some probability distribution over ΔC. In fact it won't just have one, it'll likely have whole families of them. Much of the time, the lotteries randomized over won't be disjoint - they'll both assign positive probability to some candidate. The key is, the voter doesn't care. As far as a

What is the best way to talk about probabilities you expect to change with evidence/experiments?

Answer by harfeApr 19, 202410

A lot of the probabilities we talk about are probabilities we expect to change with evidence. If we flip a coin, our p(heads) changes after we observe the result of the flipped coin. My p(rain today) changes after I look into the sky and see clouds. In my view, there is nothing special in that regard for your p(doom). Uncertainty is in the mind, not in reality.

However, how you expect your p(doom) to change depending on facts or observation is useful information and it can be useful to convey that information. Some options that come to mind:

describe a m

... (read more)

Yitz's Shortform

Basic Inframeasure Theory

This sounds like https://www.super-linear.org/trumanprize. It seems like it is run by Nonlinear and not FTX.

harfe2y51

I think Proposition 1 is false as stated because the resulting functional $f^{+} : M^{\pm} (X) \oplus R \to R$ is not always continuous (wrt the KR-metric). The function $f : [0, 1] \to [0, 1]$ , $x \mapsto x^{1 / 3}$ with $X = [0, 1]$ should be a counterexample. However, the non-continuous functional $f^{+}$ should still be continuous on the set of sa-measures.

Another thing: the space of measures $M^{\pm} (X)$ is claimed to be a Banach space with the KR-norm (in the notation section). Afaik this is not true, while the space is a Banach space with the TV-norm, with the KR-metric/norm it should not be complete and is merely... (read more)

The Learning-Theoretic Agenda: Status 2023

harfe2yΩ330

Regarding direction 17: There might be some potential drawbacks to ADAM. I think its possible that some very agentic programs have relatively low $g$ score. This is due to explicit optimization algorithms being low complexity.

(Disclaimer: the following argument is not a proof, and appeals to some heuristics/etc. We fix $M = M_{0}$ for these considerations too.) Consider an utility function $^U$ . Further, consider a computable approximation of the optimal policy (AIXI that explicitly optimizes for $^U$ ) and has an approximation parameter n (this could be AIXI-tl, plus s... (read more)

3Vanessa Kosoy2y

Yes, this is an important point, of which I am well aware. This is why I expect unbounded-ADAM to only be a toy model. A more realistic ADAM would use a complexity measure that takes computational complexity into account instead of K. For example, you can look at the measure C I defined here. More realistically, this measure should be based on the frugal universal prior.

How Would an Utopia-Maximizer Look Like?

harfe2y30

I think the "deontological preferences are isomorphic to utility functions" is wrong as presented.

Firts, the formula has issues with dividing by zero and not summing probabilities to one (and re-using variable $x$ as a local variable in the sum). So you probably meant something like $P (x) = \frac{e^{u (x)}}{\sum_{y \in X} e^{u (y)}} .$ Even then, I dont think this describes any isomorphism of deontological preferences to utility functions.

Utility functions are invariant when multiplied with a positive constant. This is not reflected in the formula.
utility maximizers usually take the a

... (read more)

2Thane Ruthenis2y

Well, that was one embarrassing typo. Fixed, and thanks for pointing it out. It is. Utility functions are invariant under ordering-preserving transformations. Exponentiation is order-preserving (rises monotonically), and so is multiplying by the constant of ∑x∈Xeu(x). Interpreted as a probability distribution, it assigns the same probability to both actions. In practice, you can imagine some sort of infrabayesianism-style imprecise probabilities being involved: the "preference" being indifferent between the vast majority of actions (and so providing no advice one way or another) and only expressing specific for vs. against preferences in a limited set of situations.

harfe2y20

Some of the downvotes were probably because of the unironic use of the term TESCREAL. This term mixes a bunch of different things together, which makes your writing less clear.

1[deactivated]2y

There is a strong argument that the term is bad and misleading. I will concede that.

Buck's Shortform

To open-source or to not open-source, that is (an oversimplification of) the question.

Sure, I'd be happy to read a draft

Buck's Shortform

harfe2y*Ω230

I am going to assume that in the code, when calculating p_alice_win_given_not_caught, we do not divide the $λ_{2}$ term by two (since this is not that consistent with the description. I am also assuming that $0.25 - t_{2}$ is a typo and $0.5 - t_{2}$ is meant, which would also be more consistent with other stuff). So I am going to assume assume a symmetrical version.

Here, P(Alice wins) is $e^{- λ_{1} t_{1} - λ_{2} t_{2}} - e^{- λ_{1} / 2 - λ_{2} / 2} =: f (λ_{1}, λ_{2}, t_{1}, t_{2})$ . Wlog we can assume $0 < t < 1$ (otherwise Bob will run everything or nothing in shielded mode).

We claim that $((λ_{1}^{*}, λ_{2}^{*}), (t_{1}^{*}, t_{2}^{*})) := ((μ^{*}, μ^{*}), (t / 2, t /$ ... (read more)

3Buck2y

Thanks heaps! I wanted the asymmetric version but concurrently with your comment I figured out how to convert from the asymmetric version to the symmetric version. I'll credit you when I post the writeup that this is a part of, and I might run a draft by you in case you want to point out any errors. :)

Five neglected work areas that could reduce AI risk

This article talks a lot about risks from AI. I wish the author would be more specific what kinds of risks they are thinking about. For example, it is unclear which parts are motivated by extinction risks or not. The same goes for the benefits of open-sourcing these models. (note: I haven't read the reports this article is based on, these might have been more specific)

4Justin Bullock2y

Thanks for this comment. I agree there is some ambiguity here on the types of risks that are being considered with respect to the question of open-sourcing foundation models. I believe the report favors the term "extreme risks" which is defined as "risk of significant physical harm or disruption to key societal functions." I believe they avoid the terms of "extinction risk" and "existential risk," but are implying something not too different with their choice of extreme risks. For me, I pose the question above as: What I'm looking for is something like "total risk" versus "total benefit." In other words, if we take all the risks together, just how large are they in this context? In part I'm not sure if the more extreme risks really come from open sourcing the models or simply from the development and deployment of increasingly capable foundation models. I hope this helps clarify!

Provably Safe AI

harfe2y54

Thank you for writing this review.

The strategy assumes we'll develop a good set of safety properties that we're demanding proof of.

I think this is very important. From skimming the paper it seems that unfortunately the authors do not discuss it much. I imagine that actually formally specifying safety properties is actually a rather difficult step.

To go with the example of not helping terrorists spread harmful virus: How would you even go about formulating this mathematically? This seems highly non-trivial to me. Do you need to mathematically formulate ... (read more)

harfe2y30

A common response is that “evaluation may be easier than generation”. However, this doesn't mean evaluation will be easy in absolute terms, or relative to one’s resources for doing it, or that it will depend on the same resources as generation.

I wonder to what degree this is true for the human-generated alignment ideas that are being submitted LessWrong/Alignment Forum?

For mathematical proofs, evaluation is (imo) usually easier than generation: Often, a well-written proof can be evaluated by reading it once, but often the person who wrote up the proof had to consider different approaches and discard a lot of them.

To what degree does this also hold for alignment research?

3Roman Leventov2y

There is an argument that evaluating AI models should be formalised, i.e., turned into verification: see https://arxiv.org/abs/2309.01933 (and discussion on Twitter with Yudkowsky and Davidad).

The Dick Kick'em Paradox

harfe2y94

The setup violates a fairness condition that has been talked about previously.

From https://arxiv.org/pdf/1710.05060.pdf, section 9:

We grant that it is possible to punish agents for using a speciﬁc decision proce- dure, or to design one decision problem that punishes an agent for rational behavior in a diﬀerent decision problem. In those cases, no decision theory is safe. CDT per- forms worse that FDT in the decision problem where agents are punished for using CDT, but that hardly tells us which theory is better for making decisions. [...]

Yet FDT does

... (read more)

1Augs SMSHacks2y

This is not true in cases even where mind-reading agents do not exist. Consider the desert dilemma again with Paul Ekman, except he is actually not capable of reading people's mind. Also assume your goal here is to be selfish and gain as much utility for yourself as possible. You offer him $50 in exchange for him taking you out of the desert and to the nearest village, where you will be able to draw out the money and pay him. He can't read your mind but judges that the expected value is positive given most people in this scenario would be telling the truth. CDT says that you should simply not pay him when you reach the village, but FDT has you $50 short. In this real world scenario, that doesn't include magical mind-reading agents, CDT is about $50 up from FDT. The only times FDT wins against CDT is in strange mind-reading thought experiments that won't happen in the real world.

The Learning-Theoretic Agenda: Status 2023

harfe2y14

Is the organization who offers the prize supposed to define "alignment" and "AGI" or the person who claims the prize? this is unclear to me from reading your post.

Defining alignment (sufficiently rigorous so that a formal proof of (im)possibility of alignment is conceivable) is a hard thing! Such formal definitions would be very valuable by themselves (without any proofs). Especially if people widely agree that the definitions capture the important aspects of the problem.

1Remmelt1y

It's less hard than you think, if you use a minimal-threshold definition of alignment: That "AGI" continuing to exist, in some modified form, does not result eventually in changes to world conditions/contexts that fall outside the ranges that existing humans could survive under.

1Jeffs2y

I envision the org that offers the prize, after broad expert input, would set the definitions and criteria. Yes, surely the definition/criteria exercise would be a hard thing...but hopefully valuable.

harfe2y90

I think the conjecture is also false in the case that utility functions map from $O^{ω}$ to $[0, 1]$ .

Let us consider the case of $A = {a_{1}, a_{2}}$ and $O = {o_{1}, o_{2}}$ . We use $U_{1} (o) = 1 - 2^{- k}$ , where $k$ is the largest integer such that $o$ starts with $o_{1}^{k}$ (and $U_{1} (o_{1}^{ω}) = 1$ ). As for $U_{2}$ , we use $U_{2} (o) = 1 - 3^{- k}$ , where $k$ is the largest integer such that $o$ starts with $o_{1}^{k}$ (and $U_{2} (o_{1}^{ω}) = 1$ ). Both $U_{1}$ and $U_{2}$ are computable, but they are not locally equivalent. Under reasonable assumptions on the Solomonoff prior, the policy $π$ that always picks action $a_{1}$ is the optimal policy for both $U_{1}$ and $U_{2}$ (see proo... (read more)

Infra-Bayesian Logic

Daniel Kokotajlo's Shortform

"inclusion map" refers to the map $i_{X}$ , not the coproduct $X ∐ Y$ . The map $i_{X}$ is a coprojection (these are sometimes called "inclusions", see https://ncatlab.org/nlab/show/coproduct).

A simple example in sets: We have two sets $X$ , $Y$ , and their disjoint union $X ⊔ Y$ . Then the inclusion map $i_{X}$ is the map that maps $x$ (as an element of $X$ ) to $x$ (as an element of $X ⊔ Y$ ).

UK PM: $125M for AI safety

What is an environmental subagent? An agent on a remote datacenter that the builders of the orginal agent don't know about?

Another thing that is not so clear to me in this description: Does the first agent consider the alignment problem of the environmental subagent? It sounds like the environmental subagents cares about paperclip-shaped molecules, but is this a thing the first agent would be ok with?

3Daniel Kokotajlo2y

I think it means it builds a new version of itself (possibly an exact copy, possibly a slimmed down version) in a place where the humans who normally have power over it don't have power or visibility. E.g. it convinces an employee to smuggle a copy out to the internet. My read on this story is: There is indeed an alignment problem between the original agent and the environmental subagent. The story doesn't specify whether the original agent considers this problem, nor whether it solves it. My own version of the story would be "Just like how the AI lab builds the original agent without having solved the alignment problem, because they are dumb + naive + optimistic + in a race with rivals, so too does the original agent launch an environmental subagent without having solved the alignment problem, for similar or possibly even very similar reasons."

harfe2y1114

This does not sound very encouraging from the perspective of AI Notkilleveryoneism. When the announcement of the foundation model task force talks about safety, I cannot find hints that they mean existential safety. Rather, it seems about safety for commercial purposes.

A lot of the money might go into building a foundation model. At least they should also announce that they will not share weights and details on how to build it, if they are serious about existential safety.

This might create an AI safety race to the top as a solution to the tragedy of the

harfe2y55

There is an additional problem where one of the two key principles for their estimates is

Avoid extreme confidence

If this principle leads you to picking probability estimates that have some distance to 1 (eg by picking at most 0.95).

If you build a fully conjunctive model, and you are not that great at extreme probabilities, then you will have a strong bias towards low overall estimates. And you can make your probability estimates even lower by introducing more (conjunctive) factors.

Article Summary: Current and Near-Term AI as a Potential Existential Risk Factor

harfe2y21

Nitpick: The title the authors picked ("Current and Near-Term AI as a Potential Existential Risk Factor") seems to better represent the content of the article than the title you picked for this LW post ("The Existential Risks of Current and Near-Term AI").

Reading the title I was expecting an argument that extinction could come extremely soon (eg by chaining GPT-4 instances together in some novel and clever way). The authors of the article talk about something very different imo.

2André Ferretti2y

Thanks for the suggestion! I updated the title to match the original wording.

Improving Mathematical Reasoning with-Process Supervision

harfe2y51

From just reading your excerpt (and not the whole paper), it is hard to determine how much alignment washing is going on here.

what is aligned chain-of-thought? What would unaligned chain-of-thought look like?
what exactly means alignment in the context of solving math problems?

But maybe these worries can be answered from reading the full paper...

Yoshua Bengio: How Rogue AIs may Arise

harfe2y92

I think overall this is a well-written blogpost. His previous blogpost already indicated that he took the arguments seriously, so this is not too much of a surprise. That previous blogpost was discussed and partially criticized on Lesswrong. As for the current blogpost, I also find it noteworthy that active LW user David Scott Krueger is in the acknowledgements.

This blogpost might even be a good introduction for AI xrisk for some people.

I hope he engages further with the issues. For example, I feel like inner misalignment is still sort of missing from the ... (read more)

5Leon Lang2y

Yoshua Bengio was on David Krueger's PhD thesis committee, according to David's CV.

4Seth Herd2y

It seems like inner misalignment is a subset of "we don't know how to make aligned AI". Maybe he could've fit that in neatly, but adding more is at odds with the function as an intro to AI risk.

TED talk by Eliezer Yudkowsky: Unleashing the Power of Artificial Intelligence