All of Abhimanyu Pallavi Sudhir's Comments + Replies

Articles (or writing in general) is probably best structured as a Directed Acyclic Graph, rather than linearly. At each point in the article, there may be multiple possible lines to pursue, or "sidenotes".

I say "directed acyclic graph" rather than "tree", because it may be natural as thinking of paths as joining back at some point, especially if certain threads are optional.

One may also construct an "And-Or tree" to allow multiple versions of the article preferred by conflicting writers, which may then be voted on with some mechanism. These votes can be us... (read more)

3CstineSublime
Why best structured? What quality or cause of reader-comprehension do you think non-linearity in this particular forking format maximizes? Also aren't most articles written with a singular or central proposition in mind (Gian Carlo Rota said that every lecture should say one thing, Quintillian advised all speeches to have one 'basis'), for which all paragraphs essentially converge on that as a conclusion?

"What do you gain from smalltalk?" "I learned not to threaten to nuke countries."

Lmao, amazing.

we'll elide all of the subtle difficulties involved in actually getting RL to work in practice

I haven't properly internalized the rest of the post, but this confuses me because I thought this post was about the subtle difficulties.

The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens, P(s'|s,a)=append(s,a) and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.

3Jesse Hoogland
You might enjoy this new blogpost from HuggingFace, which goes into more detail.
4Jesse Hoogland
I agree that this is the most straightforward interpretation, but OpenAI have made no commitment to sticking to honest and straightforward interpretations. So I don't think the RL setup is actually that straightforward.  If you want more technical detail, I recommend watching the Rush & Ritter talk (see also slides and bibliography). This post was meant as a high-level overview of the different compatible interpretations with some pointers to further reading/watching. 

The third virtue of rationality, lightness, is wrong. In fact: the more you value information to change your mind on some question, the more obstinate you should be to changing your mind on that question. Lightness implies disinterest in the question.

Imagine your mind as a logarithmic market-maker which assigns some initial subsidy to any new question . This subsidy parameter captures your marginal value for information on . But it also measures how hard it is to change your mind — the cost of moving your probability from to is ... (read more)

1ChristianKl
Being obstinate makes you more prone to motivated cognition. 

I don't understand. The hard problem of alignment/CEV/etc. is that it's not obvious how to scale intelligence while "maintaining" utility function/preferences, and this still applies for human intelligence amplification.

I suppose this is fine if the only improvement you can expect beyond human-level intelligence is "processing speed", but I would expect superhuman AI to be more intelligent in a variety of ways.

8TsviBT
Yeah, there's a value-drift column in the table of made-up numbers. Values matter and are always at stake, and are relatively more at stake here; and we should think about how to do these things in a way that avoids core value drift. You have major advantages when creating humans but tweaked somehow, compared to creating de novo AGI. * The main thing is that you're starting with a human. You start with all the stuff that determines human values--a childhood, basal ganglia giving their opinions about stuff, a stomach, a human body with human sensations, hardware empathy, etc. Then you're tweaking things--but not that much. (Except for in brain emulation, which is why it gets the highest value drift rating.) * Another thing is that there's a strong built-in limit on the strength of one human: skullsize. (Also other hardware limits: one pair of eyes and hands, one voicebox, probably 1 or 1.5 threads of attention, etc.) One human just can't do that much--at least not without interfacing with many other humans. (This doesn't apply for brain emulation, and potentially applies less for some brain-brain connectivity enhancements.) * Another key hardware limit is that there's a limit on how much you can reprogram your thinking, just by introspection and thinking. You can definitely reprogram the high-level protocols you follow, e.g. heuristics like "investigate border cases"; you can maybe influence lower-level processes such as concept-formation by, e.g., getting really good at making new words, but you maybe can't, IDK, tell your brain to allocate microcolumns to analyzing commonalities between the top 1000 best current candidate microcolumns for doing some task; and you definitely can't reprogram neuronal behavior (except through the extremely blunt-force method of drugs). * A third thing is that there's a more plausible way to actually throttle the rate of intelligence increase, compared to AI. With AI, there's a huge compute overhang, and you have no idea what dia

Something that seems like it should be well-known, but I have not seen an explicit reference for:

Goodhart’s law can, in principle, be overcome via adversarial training (or generally learning Multi-Agent Systems)

—aka “The enemy is smart.”

Goodhart’s law only really applies to a “static” objective, not when the objective is the outcome of a game with other agents who can adapt.

This doesn’t really require the other agents to act in a way that continuously “improves” the training objective either, it just requires them to be able to constantly throw adversarial... (read more)

3Nathan Helm-Burger
This then sets up something like a Generative Adversarial Network. The trouble is, such a setup is inherently unstable. Without careful guidance, one of the two adversaries will tend to dominate. In predator/prey relationships in nature a stable relationship can come about if the predators starve and reproduce less when they eat too many of the prey. If, however, this effect isn't strong enough (maybe the predators have several prey species), the result is the prey species can go extinct. Also, the prey species is helped in multi-prey scenarios by becoming less common, and ths less likely to be found and killed by predators and less vulnerable to species-specific disease. Obviously, these specific considerations don't apply in a literal sense. I'm trying to point out the general concept that you need counterbalancing factors for an adversarial relationship to stay stable.
5johnswentworth
This just moves the proxy-being-Goodharted-against from some hardcoded ruleset to a (presumably human) evaluator or selector of adversarial examples.

I think only particular reward functions, such as in multi-agent/co-operative environments (agents can include humans, like in RLHF) or in actually interactive proving environments?

Yes, I also realized that "ideas" being a thing is due to bounded rationality -- specifically they are the outputs of AI search. "Proofs" are weirder though, and I haven't seen them distinguished very often. I wonder if this is a reasonable analogy to make:

  • Ideas : search
  • Answers : inference
  • Proofs: alignment
2Amalthea
Ideas come from unsupervised training, answers from supervised training and proofs from RL on a specified reward function.

There is a cliche that there are two types of mathematicians: "theory developers" and "problem solvers". Similarly Dyson’s “birds and frogs”, and Robin Hanson divides the production of knowledge into "framing" and "filling".

It seems to me there are actually three sorts of information in the world:

  • "Ideas": math/science theories and models, inventions, business ideas, solutions to open-ended problems
  • "Answers": math theorems, experimental observations, results of computations
  • "Proofs": math proofs, arguments, evidence, digital signatures, certifications, r
... (read more)
Reply22111
9robo
Humans are computationally bounded, Bayes is not.  In an ideal Bayesian perspective: * Your prior must include all possible theories a priori.  Before you opened your eyes as a baby, you put some probability of being in a universe with Quantum Field Theory with SU(3)×SU(2)×U(1) gauge symmetry and updated from there. * Your update with unbounded computation.  There's not such thing as proofs, since all poofs are tautological. Humans are computationally bounded and can't think this way. (riffing) "Ideas" find paradigms for modeling the universe that may be profitable to track under limited computation.  Maybe you could understand fluid behavior better if you kept track of temperature, or understand biology better if you keep track of vital force.  With a bayesian-lite perspective, they kinda give you a prior and places to look where your beliefs are "mailable". "Proofs" (and evidence) are the justifications for answers.  With a bayesian-lite perspective, they kinda give you conditional probabilities. "Answers" are useful because they can become precomputed, reified, cached beliefs with high credence inertial you can treat as approximately atomic.  In a tabletop physics experiment, you can ignore how your apparatus will gravitationally move the earth (and the details of the composition of the earth).  Similarly, you can ignore how the tabletop physics experiment will move you belief about the conservation of energy (and the details of why your credences about the conservation of energy are what they are).

Just realized in logarithmic market scoring the net number of stocks is basically just log-odds, lol:

Your claims about markets seem just wrong to me. Markets generally do what their consumers want, and their failures are largely the result of transaction costs. Some of these transaction costs have to do with information asymmetry (which needs to be solved), but many others that show up in the real world (related to standard problems like negative externalities etc.) can just be removed by construction in virtual markets.

Markets are fundamentally driven by the pursuit of defined rewards or currencies, so in such a system, how do we ensure that the curren

... (read more)

I don't think that AI alignment people doing "enemy of enemy is friend" logic with AI luddites (i.e. people worried about Privacy/Racism/Artists/Misinformation/Jobs/Whatever) is useful.

Alignment research is a luxury good for labs, which means it would be the first thing axed (hyperbolically speaking) if you imposed generic hurdles/costs on their revenue, or if you made them spend on mitigating P/R/A/M/J/W problems.

This "crowding-out" effect is already happening to a very large extent: there are vastly more researchers and capital being devoted to P/R/A/M/J... (read more)

4ShardPhoenix
  Sounds plausible but do you have any numeric evidence for this?

Why aren't adverserial inputs used more widely for captchas?

  • Different models have different adverserial examples?
  • There are only a known adverserial examples for a given model (discovering them takes time), and can easily just be manually enumerated?

I have no idea what to make of the random stray downvotes

3Viliam
I agree, some explanation would be welcome. I didn't vote either way, because I do not understand the article, but I am also not confident enough to blame it on you, so I abstain from voting. I suspect the answer is something like: the explanation is not very clear, or possibly wrong.

The simplest way to explain "the reward function isn't the utility function" is: humans evolved to have utility functions because it was instrumentally useful for the reward function / evolution selected agents with utility functions.

(yeah I know maybe we don't even have utility functions; that's not the point)

Concretely: it was useful for humans to have feelings and desires, because that way evolution doesn't have to spoonfeed us every last detail of how we should act, instead it gives us heuristics like "food smells good, I want".

Evolution couldn't just ... (read more)

The use of "Differential Progress" ("does this advance safety more or capabilities more?") by the AI safety community to evaluate the value of research is ill-motivated.

Most capabilities advancements are not very counterfactual ("some similar advancement would have happened anyway"), whereas safety research is. In other words: differential progress measures absolute rather than comparative advantage / disregards the impact of supply on value / measures value as the y-intercept of the demand curve rather than the intersection of the demand and supply curves... (read more)

2Seth Herd
I think there are two important insights here. One is that counterfactual differential progress is the right metric for weighing whether ideas or work should be published. This seems obviously true but not obvious, so well worth stating, and frequently. The second important idea is that doing detailed work on alignment requires talking about specific AGI designs. This also seems obviously true, but I think goes unnoticed and unappreciated a lot of the time. How an AGI arrives at decisions, beliefs, and values is going to be dependent on its specific architectures. Putting these two concepts together makes the publication decision much more difficult. Should we cripple alignment work in the interest of having more time before AGI? One pat answer I see is "discuss those ideas privately not publicly". But in practice, this severely limits the number of eyes on each idea, making it vastly more likley that good ideas in alignment aren't spread worked on quickly. I don't have any good solutions here, but want to note that this issue seems critically important for alignment work. I've personally been roadblocked in substantial ways by this dilemma. My background means I have relatively a lot of knowledge and theories about how the human mind works. I have specific ideas about several possible routes from current AI to x-risk AGI. Each of these routes also has associated alignment plans. But I can't discuss those plans in detail without discussing the AGI designs in detail. They sound vague and unconvincing without the design forms they fit into. This is a sharp limitation on how much progress I can make on these ideas. I have a handful of people who can and will engage in detail in private, limited and vague engagement in public where the ideas must remain vague, and largely I am working on my own. Private feedback indicates that these AGI designs and alignment schemes might well be viable and relevant, although of course a debunking is always one conversation away. T

I think EY once mentioned it in the context of self-awareness or free will or something, and called it something like "complete epistemological panic".

2Gunnar_Zarncke
I assume you refer to  Interesting! Can you find the reference? I'd like to see what the "panic" is about. I think it is a good exercise. It makes clear that all models are wrong not just those at the top and they have to prove their value by being useful, i.e., make useful predictions. 

Abstraction is like economies of scale

1kave
But like watermelon, it is harder to get value out of as you get a lot of it without chunking

One thing I'm surprised by is how everyone learns the canonical way to handwrite certain math characters, despite learning most things from printed or electronic material. E.g. writing as IR rather than how it's rendered.

I know I learned the canonical way because of Khan Academy, but I don't think "guy handwriting on a blackboard like thing" is THAT disproportionately common among educational resources?

1bideup
I learned maths mostly by teachers at school writing on a whiteboard, university lecturers writing on a blackboard or projector, and to a lesser extent friends writing on pieces of paper. There was a tiny supplement of textbook-reading at school and large supplement of printed-notes-reading at university. I would guess only a tiny fraction learn exclusively via typed materials. If you have any kind of teacher, how could you? Nobody shows you how to rearrange an equation by live-typing latex.
2cousin_it
Also check out "personalized pagerank", where the rating shown to each user is "rooted" in what kind of content this user has upvoted in the past. It's a neat solution to many problems.

I used to have an idea for a karma/reputation system: repeatedly recalculate karma weighted by the karma of the upvoters and downvoters on a comment (then normalize to avoid hyperinflation) until a fixed point is reached.

I feel like this is vaguely somehow related to:

3Dagon
So, https://en.wikipedia.org/wiki/PageRank ?

it's extremely high immediate value -- it solves IP rights entirely.

It's the barbed wire for IP rights

quick thoughts on LLM psychology

LLMs cannot be directly anthromorphized. Though something like “a program that continuously calls an LLM to generate a rolling chain of thought, dumps memory into a relational database, can call from a library of functions which includes dumping to recall from that database, receives inputs that are added to the LLM context” is much more agent-like.

Humans evolved feelings as signals of cost and benefit — because we can respond to those signals in our behaviour.

These feelings add up to a “utility function”, something ... (read more)

[This comment is no longer endorsed by its author]Reply

conditionalization is not the probabilistic version of implies

P Q Q| P P → Q
T T T T
T F F F
F T N/A T
F F N/A T

Resolution logic for conditionalization: Q if P or True

Resolution logic for implies: Q if P or None

I think that the philosophical questions you're describing actually evaporate and turn out to be meaningless once you think enough about them, because they have a very anthropic flavour.

I don't think that's exactly true. But why do you think that follows from what I wrote?

1mruwnik
I find that if I keep recursing deep enough, after a while I get to a point where I try to work out why I believe that I can believe that logic works. At which point I bounce off a wall, seeing as I'm trying to logically come up with a reason for it.  Solipsism is similar - how do you know that you're not a brain in a vat? Or in general Descartes' demon. From my (admissively most likely confused) understanding, this would be another example of self reference, albeit in a roundabout way.
2the gears to ascension
HUH. iiiiinteresting...

It's really not, that's the point I made about semantics.

Eh that's kind-of right, my original comment there was dumb.

You overstate your case. The universe contains a finite amount of incompressible information, which is strictly less than the information contained in . That self-reference applies to the universe is obvious, because the universe contains computer programs.

The point is the universe is certainly a computer program, and that incompleteness applies to all computer programs (to all things with only finite incompressible information). In any case, I explained Godel  with an explicitly empirical example, so I'm not sure what your point is. 

0Shmi
That's about as much of an argument as saying that the universe is contained in the decimal expansion of Pi, therefore Pi has all the information one needs.

I agree, and one could think of this in terms of markets: a market cannot capture all information about the world, because it is part of the world.

But I disagree that this is fundamentally unrelated -- here too the issue is that it would need to represent states of the world corresponding to what belief it expresses. Ultimately mathematics is supposed to represent the real world.

0Shmi
Well, I think a better way to put it is that mathematics is sometimes a part of some models of the world. The relationship is world -> inputs -> models <-> math. Whether the part of mathematics that deals with self-reference and soundness and completeness of formal systems corresponds to an accurate and useful model of the world is not at all obvious. So, yeah, some parts of mathematics lossily represent some parts of the world. But it is a pretty weak statement.

No, it doesn't. There is no 1/4 chance of anything once you've found yourself in Room A1.

You do acknowledge that the payout for the agent in room B (if it exists) from your actions is the same as the payout for you from your own actions, which if the coin came up tails is $3, yes?

I don't understand what you are saying. If you find yourself in Room A1, you simply eliminate the last two possibilities so the total payout of Tails becomes 6.

If you find yourself in Room A1, you do find yourself in a world where you are allowed to bet. It doesn't make sense to consider the counterfactual, because you already have gotten new information.

That's not important at all. The agents in rooms A1 and A2 themselves would do better to choose tails than to choose heads. They really are being harmed by the information.

1Dagon
It's totally important. The knowledge that you get paid for guessing T in the cases you're never asked the question is extremely relevant here. It changes the EV from 1/3 * 3 = 1 to 1/3 * 3 + 1/4 * 3 = 1.75.

I see, that is indeed the same principle (and also simpler/we don't need to worry about whether we "control" symmetric situations).

2Charlie Steiner
Yeah I'm still not sure how to think about this sort of thing short of going full UDT and saying something like "well, imagine this whole situation was a game - what would be the globally winning strategy?"

I don't think this is right. A superrational agent exploits the symmetry between A1 and A2, correct? So it must reason that an identical agent in A2 will reason the same way as it does, and if it bets heads, so will the other agent. That's the point of bringing up EDT.

1player_03
Oh right, I see where you're coming from. When I said "you can't control their vote" I was missing the point, because as far superrational agents are concerned, they do control each other's votes. And in that case, it sure seems like they'll go for the $2, earning less money overall. It occurs to me that if team 4 didn't exist, but teams 1-3 were still equally likely, then "heads" actually would be the better option. If everyone guesses "heads," two teams are right, and they take home $4. If everyone guesses "tails," team 3 takes home $3 and that's it. On average, this maximizes winnings. Except this isn't the same situation at all. With group 4 eliminated from the get go, the remaining teams can do even better than $4 or $3. Teammates in room A2 knows for a fact that the coin landed heads, and they automatically earn $1. Teammates in room A1 are no longer responsible for their teammates' decisions, so they go for the $3. Thus teams 1 and 2 both take home $1 while team 3 takes home $3, for a total of $5. Maybe that's the difference. Even if you know for a fact that you aren't on team 4, you also aren't in a world where team 4 was eliminated from the start. The team still needs to factor into your calculations... somehow. Maybe it means your teammate isn't really making the same decision you are? But it's perfectly symmetrical information. Maybe you don't get to eliminate team 4 unless your teammate does? But the proof is right in front of you. Maybe the information isn't symmetrical because your teammate could be in room B? I don't know. I feel like there's an answer in here somewhere, but I've spent several hours on this post and I have other things to do today.

Wait, but can't the AI also choose to adopt the strategy "build another computer with a larger largest computable number"?

4Donald Hobson
If the computer has a finite amount of memory and can't build more, this puts a 2n bound on how long it can weight. If it can build more, it will. The point is that it needs to pick some long running computation that it can be fairly sure halts eventually. This gets into details about exactly how the AI is handling logical uncertainty.

I don't understand the significance of using a TM -- is this any different from just applying some probability distribution over the set of actions?

3Donald Hobson
Any number that the AI puts out must be computable, and I was reminded of an entry in a largest computable number contest, that was "The runtime of the longest running turing machine that ZFC + large cardinality axiom can prove halts (With the proof being at most 3^^^3 symbols long). This is an almost optimal answer in that it is well defined if Con(ZFC + large cardinality axiom) and it beats any answer that you can give that relies only on ZFC+large cardinality axiom. An AI asked to output the largest number it can, is playing a game of name the largest computable number.

Suppose the function U(t) is increasing fast enough, e.g. if the probability of reaching t is exp(-t), then let U(t) be exp(2t), or whatever.

I don't think the question can be dismissed that easily.

It does not require infinities. E.g. you can just reparameterize the problem to the interval (0, 1), see the edited question. You just require an infinite set.

1Dagon
The answer remains the same - as far as we know, the universe is finite and quantized. At any t, there is a probability of reaching t+epsilon, making the standard expected utility calculation (probability X reward) useful.

Infinite t does not necessarily deliver infinite utility.

Perhaps it would be simpler if I instead let t be in (0, 1], and U(t) = {t if t < 1; 0 if t = 1}.

It's the same problem, with 1 replacing infinity. I have edited the question with this example instead.

(It's not a particularly weird utility function -- consider, e.g. if the agent needs to expend a resource such that the utility from expending the resource at time t is some fast-growing function f(t). But never expending the resource gives zero utility. In any case, an adverserial agent can always create this situation.)

3Andrew Kao
I see what you mean now, thanks for clarifying. I'm not personally aware of any "best" or "correct" solutions, and I would be quite surprised if there were one (mathematically, at least, we know there's no single maximizer). But I think concretely speaking, you can restrict the choice set of t to a compact set of size (0, 1 - \epsilon] and develop the appropriate bounds for the analysis you're interested in. Maybe not the most satisfying answer, but I guess that's Analysis in a nutshell.

I see. So the answer is that it is indeed true that Godel's statement is true in all models of second-order PA, but unprovable nonetheless since Godel's completeness theorem isn't true for second-order logic?

4Kutta
Yes. To expand a bit, in fact the straightforward way to show that second-order arithmetic isn't complete in the first sense is by using the Gödel sentence G. G says via an encoding that G is not provable in second-order arithmetic. Since the only model (up to isomorphism) is the model with the standard natural numbers, an internal statement which talks about encoded proofs is interpreted in the semantics as a statement which talks about actual proof objects of second-order arithmetic. This is in contrast to first-order arithmetic where we can interpret an internal statement about encoded proofs as ranging over nonstandard numbers as well, and such numbers do not encode actual proof objects. Therefore, when we interpret second-order G, we always get the semantic statement "G is not provable in second-order arithmetic". From this and the soundness of the proof system, it follows that G is not provable. Hence, G holds in all models (recall that there is just one model up to isomorphism), but is not provable.

This seems to be relevant to calculations of climate change externalities, where the research is almost always based on the direct costs of climate change if no one modified their behaviour, rather than the cost of building a sea wall, or planting trees.

Disagree. Daria considers the colour of the sky an important issue because it is socially important, not because it is of actual cognitive importance. Ferris recognizes that it doesn't truly change much about his beliefs, since their society doesn't have any actual scientific theories predicting the colour of the sky (if they did, the alliances would not be on uncorrelated issues like taxes and marriage), and bothers with things he finds to be genuinely more important.

I'm not sure your interpretation of logical positivism is what the positivists actually say. They don't argue against having a mental model that is metaphysical, they point out that this mental model is simply a "gauge", and that anything physical is invariant under changes of this gauge.

Interesting. Did they promise to do so beforehand?

In any case, I'm not surprised the Soviets did something like this, but I guess the point is really "Why isn't this more widespread?" And also: "why does this not happen with goals other than staying in power?" E.g. why has no one tried to pass a bill that says "Roko condition AND we implement this-and-this policy". Because otherwise it seems that the stuff the Soviets did was motivated by something other than Roko's basilisk.

0avturchin
It was not promised, but anyone who read the story of previous revolutions, like French one, could guess.

But that's not Roko's basilisk. Whether or not you individually vote for the candidate does not affect you as long as the candidate wins.

3avturchin
In early Soviet history they actually checked if a person actually supported the winning party by looking of what you did 10-20 years ago. If the person was a member of wrong party in 1917, he could be prosecuted in 1930th.

The "Dutch books" example is not restricted to improper priors. I don't have time to transform this into the language of your problem, but the basically similar two-envelopes problem can arise from the prior distribution:

f(x) = 1/4*(3/4)^n where x = 2^n (n >=0), 0 if x cannot be written in this form

Considering this as a prior on the amount of money in an envelope, the expectation of the envelope you didn't choose is always 8/7 of the envelope you did choose.

There is no actual mathematical contradiction with this sort of thing -- wit... (read more)