Text whose primary goal is conveying information (as opposed to emotion, experience or aesthetics) should be skimming friendly. Time is expensive, words are cheap. Skimming is a vital mode of engaging with text, either to evaluate whether it deserves a deeper read or to extract just the information you need. As a reader, you should nurture your skimming skills. As a writer, you should treat skimmers as a legitimate and important part of your target audience. Among other things it means:
Stronger: as a writer you should assume your modal reader is a skimmer, both because they are, and because even non skimmers are only going to remember about the same number of things that the good skimmer does.
I propose to call metacosmology the hypothetical field of study which would be concerned with the following questions:
This concept is of potential interest for several reasons:
People like Andrew Critch and Paul Christiano have criticized MIRI in the past for their "pivotal act" strategy. The latter can be described as "build superintelligence and use it to take unilateral world-scale actions in a manner inconsistent with existing law and order" (e.g. the notorious "melt all GPUs" example). The critics say (justifiably IMO), this strategy looks pretty hostile to many actors and can trigger preemptive actions against the project attempting it and generally foster mistrust.
Is there a good alternative? The critics tend to assume slow-takeoff multipole scenarios, which makes the comparison with their preferred solutions to be somewhat "apples and oranges". Suppose that we do live in a hard-takeoff singleton world, what then? One answer is "create a trustworthy, competent, multinational megaproject". Alright, but suppose you can't create a multinational megaproject, but you can build aligned AI unilaterally. What is a relatively cooperative thing you can do which would still be effective?
Here is my proposed rough sketch of such a plan[1]:
IMO it was a big mistake for MIRI to talk about pivotal acts without saying they should even attempt to follow laws.
An AI progress scenario which seems possible and which I haven't seen discussed: an imitation plateau.
The key observation is, imitation learning algorithms[1] might produce close-to-human-level intelligence even if they are missing important ingredients of general intelligence that humans have. That's because imitation might be a qualitatively easier task than general RL. For example, given enough computing power, a human mind becomes realizable from the perspective of the learning algorithm, while the world-at-large is still far from realizable. So, an algorithm that only performs well in the realizable setting can learn to imitate a human mind, and thereby indirectly produce reasoning that works in non-realizable settings as well. Of course, literally emulating a human brain is still computationally formidable, but there might be middle scenarios where the learning algorithm is able to produce a good-enough-in-practice imitation of systems that are not too complex.
This opens the possibility that close-to-human-level AI will arrive while we're still missing key algorithmic insights to produce general intelligence directly. Such AI would not be easily scalable to superhuman. Nevert...
Here's the sketch of an AIT toy model theorem that in complex environments without traps, applying selection pressure reliably produces learning agents. I view it as an example of Wentworth's "selection theorem" concept.
Consider any environment of infinite Kolmogorov complexity (i.e. uncomputable). Fix a computable reward function
Suppose that there exists a policy of finite Kolmogorov complexity (i.e. computable) that's optimal for in the slow discount limit. That is,
Then, cannot be the only environment with this property. Otherwise, this property could be used to define using a finite number of bits, which is impossible[1]. Since requires infinitely many more bits to specify than and , there has to be infinitely many environments with the same property[2]. Therefore, is a reinforcement learning algorithm for some infinite class of hypothesis.
Moreover, there are natural examples of as above. For instance, let's construct as an infinite sequence of finite communicating infra-RDP refinements tha...
I propose a new formal desideratum for alignment: the Hippocratic principle. Informally the principle says: an AI shouldn't make things worse compared to letting the user handle them on their own, in expectation w.r.t. the user's beliefs. This is similar to the dangerousness bound I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as (ii) (from your subjective perspective).
More formally, we consider a (some extension of) delegative IRL setting (i.e. there is a single set of input/output channels the control of which can be toggled between the user and the AI by the AI). Let be the the user's policy in universe and the AI policy. Let be some event that designates when we measure the outcome / terminate the experiment, which is supposed to happen with probability for any policy. Let be the value of a state from the user's subjective POV, in universe . Let be the environment in universe . Finally, let be the AI's prior over universes and ...
This idea was inspired by a correspondence with Adam Shimi.
It seem very interesting and important to understand to what extent a purely "behaviorist" view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?
Consider a general reinforcement learning settings: we have a set of actions , a set of observations , a policy is a mapping , a reward function is a mapping , the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)
The simplest attempt at defining "goal-directed intelligence" is requiring that the policy in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows , or the prior can believe that behavior not according to leads to some terrible outcome.
The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are "contrived". However, descript...
The recent success of AlphaProof updates me in the direction of "working on AI proof assistants is a good way to reduce AI risk". If these assistants become good enough, it will supercharge agent foundations research[1] and might make the difference between success and failure. It's especially appealing that it leverages AI capability advancement for the purpose of AI alignment in a relatively[2] safe way, thereby the deeper we go into the danger zone the greater the positive impact[3].
EDIT: To be clear, I'm not saying that working on proof assistants in e.g. DeepMind is net positive. I'm saying that a hypothetical safety-conscious project aiming to create proof assistants for agent foundations research, that neither leaks dangerous knowledge nor repurposes it for other goals, would be net positive.
Of course, agent foundation research doesn't reduce to solving formally stated mathematical problems. A lot of it is searching for the right formalizations. However, obtaining proofs is a critical arc in the loop.
There are some ways for proof assistants to feed back into capability research, but these effects seem weaker: at present capability advancement is not primarily dri
I think the main way that proof assistant research feeds into capabilies research is not through the assistants themselves, but by the transfer of the proof assistant research to creating foundation models with better reasoning capabilities. I think researching better proof assistants can shorten timelines.
A thought inspired by this thread. Maybe we should have a standard template for a code of conduct for organizations, that we will encourage all rational-sphere and EA orgs to endorse. This template would include, never making people sign non-disparagement agreements (and maybe also forbidding other questionable practices that surfaced in recent scandals). Organizations would be encouraged to create their own codes based on the template and commit to them publicly (and maybe even in some legally binding manner). This flexibility means we don't need a 100% consensus about what has to be in the code, but also if e.g. a particular org decides to remove a particular clause, that will be publicly visible and salient.
Epistemic status: Leaning heavily into inside view, throwing humility to the winds.
Imagine TAI is magically not coming (CDT-style counterfactual[1]). Then, the most notable-in-hindsight feature of modern times might be the budding of mathematical metaphysics (Solomonoff induction, AIXI, Yudkowsky's "computationalist metaphilosophy"[2], UDT, infra-Bayesianism...) Perhaps, this will lead to an "epistemic revolution" comparable only with the scientific revolution in magnitude. It will revolutionize our understanding of the scientific method (probably solving the interpretation of quantum mechanics[3], maybe quantum gravity, maybe boosting the soft sciences). It will solve a whole range of philosophical questions, some of which humanity was struggling with for centuries (free will, metaethics, consciousness, anthropics...)
But, the philosophical implications of the previous epistemic revolution were not so comforting (atheism, materialism, the cosmic insignificance of human life)[4]. Similarly, the revelations of this revolution might be terrifying[5]. In this case, it remains to be seen which will seem justified in hindsight: the Litany of Gendlin, or the Lovecraftian notion that some ...
Probably not too original but I haven't seen it clearly written anywhere.
There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.
Amplifying by objective time: The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to infiltrate the present world. We can try to defend by adding a button for "malign AI is attacking", but that still leaves us open to surprise takeovers in which there is no chance to press the button.
Amplifying by subjective time: The AI is predicting what the user(s) will output after thinking about a problem for a short time, where in the beginning they are given the output of a similar process that ran for one iteration less. So, this simulates a "groundhog day" scenario where the humans wake up in the same objective time period over and over without memory of the previous iterations but with a written legacy. This is weaker than...
I have repeatedly argued for a departure from pure Bayesianism that I call "quasi-Bayesianism". But, coming from a LessWrong-ish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here's another way to understand it, using Bayesianism's own favorite trick: Dutch booking!
Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayes-optimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can choose which side of the bet to take: indeed, at least one side of any bet has non-negative expected utility. Now, Alice meets Omega. Omega is very smart so ey know more than Alice and moreover ey can predict Alice. Omega offers Alice a series of bets. The bets are specifically chosen by Omega s.t. Alice would pick the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice cannot escape eir predicament: ey might know, in some sense, that Omega is cheating em, but there is no way within the Bayesian paradigm to justify turning down the bets.
A possible counterargument is, we don't need to depart far from Bayesianis
...Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via modifying the game rather than abandoning the notion of Nash equilibrium).
The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a repeated version. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requ
...Here's a question inspired by thinking about Turing RL, and trying to understand what kind of "beliefs about computations" should we expect the agent to acquire.
Does mathematics have finite information content?
First, let's focus on computable mathematics. At first glance, the answer seems obviously "no": because of the halting problem, there's no algorithm (i.e. a Turing machine that always terminates) which can predict the result of every computation. Therefore, you can keep learning new facts about results of computations forever. BUT, maybe most of those new facts are essentially random noise, rather than "meaningful" information?
Is there a difference of principle between "noise" and "meaningful content"? It is not obvious, but the answer is "yes": in algorithmic statistics there is the notion of "sophistication" which measures how much "non-random" information is contained in some data. In our setting, the question can be operationalized as follows: is it possible to have an algorithm plus an infinite sequence of bits , s.t. is random in some formal sense (e.g. Martin-Lof) and can decide the output of any finite computation if it's also given access to ?
The answer to th...
Epistemic status: no claims to novelty, just (possibly) useful terminology.
[EDIT: I increased all the class numbers by 1 in order to admit a new definition of "class I", see child comment.]
I propose a classification on AI systems based on the size of the space of attack vectors. This classification can be applied in two ways: as referring to the attack vectors a priori relevant to the given architectural type, or as referring to the attack vectors that were not mitigated in the specific design. We can call the former the "potential" class and the latter the "effective" class of the given system. In this view, the problem of alignment is designing potential class V (or at least IV) systems are that effectively class 0 (or at least I-II).
Class II: Systems that only ever receive synthetic data that has nothing to do with the real world
Examples:
I find it interesting to build simple toy models of the human utility function. In particular, I was thinking about the aggregation of value associated with other people. In utilitarianism this question is known as "population ethics" and is infamously plagued with paradoxes. However, I believe that is the result of trying to be impartial. Humans are very partial and this allows coherent ways of aggregation. Here is my toy model:
Let Alice be our viewpoint human. Consider all social interactions Alice has, categorized by some types or properties, and assign a numerical weight to each type of interaction. Let be the weight of the interaction person had with person at time (if there was no interaction at this time then ). Then, we can define Alice's affinity to Bob as
Here is some constant. Ofc can be replaced by many other functions.
Now, we can the define the social distance of Alice to Bob as
Here is some constant, and the power law was chosen rather arbitrarily, there are many functions of that can work. Dead people should probab...
Some thoughts about embedded agency.
From a learning-theoretic perspective, we can reformulate the problem of embedded agency as follows: What kind of agent, and in what conditions, can effectively plan for events after its own death? For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, "death" can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent[1]. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.
First, in order to meaningfully plan for death, the agent's reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don't give the right object, since the reward is still tied to the agent's actions and observations. Therefore, we will consider reward functions defined in terms of some fixed ontology
...Learning theory distinguishes between two types of settings: realizable and agnostic (non-realizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best approximation of the real environment. (Or, the best reward achievable by some space of policies.)
In offline learning and certain varieties of online learning, the agnostic setting is well-understood. However, in more general situations it is poorly understood. The only agnostic result for long-term forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.
Quasi-Bayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a "quasi-realizable" setting: there might be no perfect description of the environment in the hypothesis class, but there are some incomplete descriptions. B
...Master post for alignment protocols.
Other relevant shortforms:
Much of the orthodox LessWrongian approach to rationality (as it is expounded in Yudkowsky's Sequences and onwards) is grounded in Bayesian probability theory. However, I now realize that pure Bayesianism is wrong, instead the right thing is quasi-Bayesianism. This leads me to ask, what are the implications of quasi-Bayesianism on human rationality? What are the right replacements for (the Bayesian approach to) bets, calibration, proper scoring rules et cetera? Does quasi-Bayesianism clarify important confusing issues in regular Bayesianism such as the proper use of inside and outside view? Is there rigorous justification to the intuition that we should have more Knightian uncertainty about questions with less empirical evidence? Does any of it influence various effective altruism calculations in surprising ways? What common LessWrongian wisdom does it undermine, if any?
Now that it was mentioned in ACX, I really hope the pear ring will become standard in the rationalist/EA community. So, please signal boost it: bootstrapping is everything, obvious network value effects are obvious. Also, it would be nice if they make a poly version sometime (for now, I will make do with wearing it next to my wedding ring ;), and a way to specify sexual orientation (btw, I'm bi, just saying...)
This is preliminary description of what I dubbed Dialogic Reinforcement Learning (credit for the name goes to tumblr user @di--es---can-ic-ul-ar--es): the alignment scheme I currently find most promising.
It seems that the natural formal criterion for alignment (or at least the main criterion) is having a "subjective regret bound": that is, the AI has to converge (in the long term planning limit, limit) to achieving optimal expected user!utility with respect to the knowledge state of the user. In order to achieve this, we need to establish a communicati
...A major impediment in applying RL theory to any realistic scenario is that even the control problem[1] is intractable when the state space is exponentially large (in general). Real-life agents probably overcome this problem by exploiting some special properties of real-life environments. Here are two strong candidates for such properties:
Epistemic status: most elements are not new, but the synthesis seems useful.
Here is an alignment protocol that I call "autocalibrated quantilzed debate" (AQD).
Arguably the biggest concern with naive debate[1] is that perhaps a superintelligent AI can attack a human brain in a manner that takes it out of the regime of quasi-rational reasoning altogether, in which case the framing of "arguments and counterargument" doesn't make sense anymore. Let's call utterances that have this property "Lovecraftian". To counter this, I suggest using quantilization. Quanti...
In Hanson’s futarchy, the utility function of the state is determined by voting but the actual policy is determined by a prediction market. But, voting incentivizes misrepresenting your values to get a larger share of the pie. So, shouldn’t it be something like the VCG mechanism instead?
Here's an idea about how to formally specify society-wide optimization, given that we know the utility function of each individual. In particular, it might be useful for multi-user AI alignment.
A standard tool for this kind of problem is Nash bargaining. The main problem with it is that it's unclear how to choose the BATNA (disagreement point). Here's why some simple proposals don't work:
Until now I believed that a straightforward bounded version of the Solomonoff prior cannot be the frugal universal prior because Bayesian inference under such a prior is NP-hard. One reason it is NP-hard is the existence of pseudorandom generators. Indeed, Bayesian inference under such a prior distinguishes between a pseudorandom and a truly random sequence, whereas a polynomial-time algorithm cannot distinguish between them. It also seems plausible that, in some sense, this is the only obstacle: it was established that if one-way functions don't exist (wh...
Consider a Solomonoff inductor predicting the next bit in the sequence {0, 0, 0, 0, 0...} At most places, it will be very certain the next bit is 0. But, at some places it will be less certain: every time the index of the place is highly compressible. Gradually it will converge to being sure the entire sequence is all 0s. But, the convergence will be very slow: about as slow as the inverse Busy Beaver function!
This is not just a quirk of Solomonoff induction, but a general consequence of reasoning using Occam's razor (which is the only reasonable way to re...
Epistemic status: moderately confident, based on indirect evidence
I realized that it is very hard to impossible to publish an academic work that takes more than one conceptual inferential step away from the current paradigm. Especially when the inferential steps happen in different fields of knowledge.
You cannot publish a paper where you use computational learning theory to solve metaphysics, and then use the new metaphysics to solve the interpretation of quantum mechanics. A physics publication will not understand the first part, or even understand how it
...One subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful self-modification can all be conceptualized as traps. Another is that without understanding traps we can't understand long-term planning, which is a key ingredient of goal-directed intelligence.
In general, a prior that contains traps will be unlearnable, mea
...In the past I considered the learning-theoretic approach to AI theory as somewhat opposed to the formal logic approach popular in MIRI (see also discussion):
I recently realized that the formalism of incomplete models provides a rather natural solution to all decision theory problems involving "Omega" (something that predicts the agent's decisions). An incomplete hypothesis may be thought of a zero-sum game between the agent and an imaginary opponent (we will call the opponent "Murphy" as in Murphy's law). If we assume that the agent cannot randomize against Omega, we need to use the deterministic version of the formalism. That is, an agent that learns an incomplete hypothesis converges to the corresponding max
...I just read Daniel Boettger's "Triple Tragedy And Thankful Theory". There he argues that the thrival vs. survival dichotomy (or at least its implications on communication) can be understood as time-efficiency vs. space-efficiency in algorithms. However, it seems to me that a better parallel is bandwidth-efficiency vs. latency-efficiency in communication protocols. Thrival-oriented systems want to be as efficient as possible in the long-term, so they optimize for bandwidth: enabling the transmission of as much information as possible over any given long per...
One of the central challenges in Dialogic Reinforcement Learning is dealing with fickle users, i.e. the user changing eir mind in illegible ways that cannot necessarily be modeled as, say, Bayesian updating. To take this into account, we cannot use the naive notion of subjective regret bound, since the user doesn't have a well-defined prior. I propose to solve this by extending the notion of dynamically inconsistent preferences to dynamically inconsistent beliefs. We think of the system as a game, where every action-observation history corresponds
...In my previous shortform, I used the phrase "attack vector", borrowed from classical computer security. What does it mean to speak of an "attack vector" in the context of AI alignment? I use 3 different interpretations, which are mostly 3 different ways of looking at the same thing.
In the first interpretation, an attack vector is a source of perverse incentives. For example, if a learning protocol allows the AI to ask the user questions, a carefully designed question can artificially produce an answer we would consider invalid, for example by manipulating
...The sketch of a proposed solution to the hard problem of consciousness: An entity is conscious if and only if (i) it is an intelligent agent (i.e. a sufficiently general reinforcement learning system) and (ii) its values depend on the presence and/or state of other conscious entities. Yes, this definition is self-referential, but hopefully some fixed point theorem applies. There may be multiple fixed points, corresponding to "mutually alien types of consciousness".
Why is this the correct definition? Because it describes precisely the type of agent who would care about the hard problem of consciousness.
There have been some arguments coming from MIRI that we should be designing AIs that are good at e.g. engineering while not knowing much about humans, so that the AI cannot manipulate or deceive us. Here is an attempt at a formal model of the problem.
We want algorithms that learn domain D while gaining as little as possible knowledge about domain E. For simplicity, let's assume the offline learning setting. Domain D is represented by instance space , label space , distribution and loss function . Similarly, domain E is represented by inst...
Here is a way to construct many learnable undogmatic ontologies, including such with finite state spaces.
A deterministic partial environment (DPE) over action set and observation set is a pair where and s.t.
DPEs are equipped with a natural partial order. Namely, when and .
Let  ...
A summary of my current breakdown of the problem of traps into subproblems and possible paths to solutions. Those subproblems are different but different but related. Therefore, it is desirable to not only solve each separately, but also to have an elegant synthesis of the solutions.
Problem 1: In the presence of traps, Bayes-optimality becomes NP-hard even on the weakly feasible level (i.e. using the number of states, actions and hypotheses as security parameters).
Currently I only have speculations about the solution. But, I have a few desiderata for it:
De...
It seems useful to consider agents that reason in terms of an unobservable ontology, and may have uncertainty over what this ontology is. In particular, in Dialogic RL, the user's preferences are probably defined w.r.t. an ontology that is unobservable by the AI (and probably unobservable by the user too) which the AI has to learn (and the user is probably uncertain about emself). However, onotlogies are more naturally thought of as objects in a category than as elements in a set. The formalization of an "ontology" should probably be a POMDP or a suitable
...