All of David Udell's Comments + Replies

I sampled hundreds of short context snippets from openwebtext, and measured ablation effects averaged over those sampled forward-passes. Averaged over those hundreds of passes, I didn't see any real signal in the logit effects, just a layer of noise due to the ablations.

More could definitely be done on this front. I just tried something relatively quickly that fit inside of GPU memory and wanted to report it here.

2RogerDearnaley
So this suggests that, if you ablate a random feature, then in contexts where that feature doesn't apply, doing so will have some (apparently random) effect on the model's emitted logits, suggesting that there is generally some crosstalk/interdependencies between features, and that to some extent "(almost) everything depends on (almost) everything else" — would that be your interpretation? If so, that's not entirely surprising for a system that relies on only approximate orthogonality, but could be inconvenient. For example, it suggests that any security/alignment procedure that depended upon effectively ablating a large number of specific circuits (once we had identified such circuits in need of ablation) might introduce a level of noise that presumably scales with the number of circuits ablated, and might require, for example, some subsequent finetuning on a broad corpus to restore previous levels of broad model performance.?

Could you hotlink the boxes on the diagrams to that, or add the resulting content as a hover text to areas, in them or something? This might be hard to do on LW: I suspect some Javascript code might be required to do this sort of thing, but perhaps a library exists for this?

My workaround was to have the dimension links laid out below each figure.

3: [14555]

4: [5030]

5: [10603]

6: [3290]

7: [4330]

My current "print to flat .png" approach wouldn't support hyperlinks, and I don't think LW supports .svg images.

That line was indeed quite poorly phrased. It now reads:

At the bottom of the box, blue or red token boxes show the tokens most promoted (blue) and most suppressed (red) by that dimension.

That is, you're right. Interpretability data on an autoencoder dimension comes from seeing which token probabilities are most promoted and suppressed when that dimension is ablated, relative to leaving its activation value alone. That's an ablation effect sign, so the implied, plotted promotion effect signs are flipped.

The main thing I got out of reading Bostrom's Deep Utopia is a better appreciation of this "meaning of life" thing. I had never really understood what people meant by this, and always just rounded it off to people using lofty words for their given projects in life.

The book's premise is that, after the aligned singularity, the robots will not just be better at doing all your work but also be better at doing all your leisure for you. E.g., you'd never study for fun in posthuman utopia, because you could instead just ask the local benevolent god to painlessly... (read more)

1Nate Showell
The concept of "the meaning of life" still seems like a category error to me. It's an attempt to apply a system of categorization used for tools, one in which they are categorized by the purpose for which they are used, to something that isn't a tool: a human life. It's a holdover from theistic worldviews in which God created humans for some unknown purpose.   The lesson I draw instead from the knowledge-uploading thought experiment -- where having knowledge instantly zapped into your head seems less worthwhile acquiring it more slowly yourself -- is that to some extent, human values simply are masochistic. Hedonic maximization is not what most people want, even with all else being equal. This goes beyond simply valuing the pride of accomplishing difficult tasks, as such as the sense of accomplishment one would get from studying on one's own, above other forms of pleasure. In the setting of this thought experiment, if you wanted the sense of accomplishment, you could get that zapped into your brain too, but much like getting knowledge zapped into your brain instead of studying yourself, automatically getting a sense of accomplishment would be of lesser value. The suffering of studying for yourself is part of what makes us evaluate it as worthwhile.
2JBlack
I'm pretty sure that I would study for fun in the posthuman utopia, because I both value and enjoy studying and a utopia that can't carry those values through seems like a pretty shallow imitation of a utopia. There won't be a local benevolent god to put that wisdom into my head, because I will be a local benevolent god with more knowledge than most others around. I'll be studying things that have only recently been explored, or that nobody has yet discovered. Otherwise again, what sort of shallow imitation of a posthuman utopia is this?
6Garrett Baker
Many who believe in God derive meaning, despite God theoretically being able to do anything they can do but better, from the fact that He chose not to do the tasks they are good at, and left them tasks to try to accomplish. Its common for such people to believe that this meaning would disappear if God disappeared, but whenever such a person does come to no longer believe in God, they often continue to see meaning in their life[1]. Now atheists worry about building God because it may destroy all meaning to our actions. I expect we'll adapt. (edit: That is to say, I don't think you've adequately described what "meaning of life" is if you're worried about it going away in the situation you describe) ---------------------------------------- 1. If anything, they're more right than wrong, there has been much written about the "meaning crisis" we're in, possibly attributable to greater levels of atheism. ↩︎

I believe I and others here probably have a lot to learn from Chris, and arguments of the form "Chris confidently believes false thing X," are not really a crux for me about this.

Would you kindly explain this? Because you think some of his world-models independently throw out great predictions, even if other models of his are dead wrong?

4Scott Garrabrant
More like illuminating ontologies than great predictions, but yeah.

Use your actual morals, not your model of your morals.

0Vladimir_Nesov
Detailed or non-intuitive actual morals don't exist to be found and used, they can only be built with great care. None have been built so far, as no single human has lived for even 3000 years. Human condition curses all moral insight with goodhart. What remains is scaling Pareto projects of locally ordinary humanism.

Crucially: notice if your environment is suppressing you feeling your actual morals, leaving you only able to use your model of your morals.

habryka130

That's a good line, captures a lot of what I often feel is happening when talking to people about utilitarianism and a bunch of adjacent stuff (people replacing their morals with their models of their morals)

I agree that stronger, more nuanced interpretability techniques should tell you more. But, when you see something like, e.g.,

25132 ▁vs, ▁differently, ▁compared, ▁greater, all, ▁per
25134 ▁I, ▁My, I, ▁personally

isn't it pretty obvious what those two autoencoder neurons were each doing?

4Charlie Steiner
It does seem obvious[1], but I think this can easily be misleading. Are these activation directions always looking for these tokens regardless of context, or are they detecting the human-obvious theme they seem to be gesturing towards, or are they playing a more complicated functional role that merely happens to be activated by those tokens in the first position? E.g. Is the "▁vs, ▁differently, ▁compared" direction just a brute detector for those tokens? Or is it a more general detector for comparison and counting that would have rich but still human-obvious behavior on longer snippets? Or is it part of a circuit that needs to detect comparison words but is actually doing something totally different like completing discussions about shopping lists? 1. ^  certainly more so than 

No, towards an  value.  is the training proxy for that, though.

4LawrenceC
Oh, okay, makes sense.

Epistemic status: Half-baked thought.

Say you wanted to formalize the concepts of "inside and outside views" to some degree. You might say that your inside view is a Bayes net or joint conditional probability distribution—this mathematical object formalizes your prior.

Unlike your inside view, your outside view consists of forms of deferring to outside experts. The Bayes nets that inform their thinking are sealed away, and you can't inspect these. You can ask outside experts to explain their arguments, but there's an interaction cost associated with inspecti... (read more)

(Great project!) I strongly second the RSS feed idea, if that'd be possible.

I think that many (not all) of your above examples boil down to optimizing for legibility rather than optimizing for goodness. People who hobnob instead of working quietly will get along with their bosses better than their quieter counterparts, yes. But a company of brown nosers will be less productive than a competitor company of quiet hardworking employees! So there's a cooperate/defect-dilemma here.

What that suggests, I think, is that you generally shouldn't immediately defect as hard as possible, with regard to optimizing for appearances. Play the prev... (read more)

5lemonhope
Good point. I am concerned that adding even a dash of legibility screws the work over completely and immediately and invisibly rather than incrementally. I may have over-analyzed my data so I should probably return to the field to collect more samples.

The ML models that now speak English, and are rapidly growing in world-transformative capability, happen to be called transformers.

This is not a coincidence because nothing is a coincidence.

Two moments of growing in mathematical maturity I remember vividly:

  1. Realizing that equations are claims that are therefore either true or false. Everything asserted with symbols... could just as well be asserted in English. I could start chunking up arbitrarily long and complicated equations between the equals signs, because those equals signs were just the English word "is"!
  2. Learning about the objects that mathematical claims are about. Going from having to look up "Wait, what's a real number again?" to knowing how , and  interrelat
... (read more)

I found it distracting that all your examples were topical, anti-red-tribe coded events. That reminded me of

In Artificial Intelligence, and particularly in the domain of nonmonotonic reasoning, there’s a standard problem: “All Quakers are pacifists. All Republicans are not pacifists. Nixon is a Quaker and a Republican. Is Nixon a pacifist?”

What on Earth was the point of choosing this as an example? To rouse the political emotions of the readers and distract them from the main question? To make Republicans feel unwelcome in courses on Artificial Intelligenc

... (read more)
2Duncan Sabien (Deactivated)
I don't subscribe to a stay-non-politicized discourse norm (often one tribe is Actually Being Worse, and I'm not going to handicap my ability to say true things, though it's worth doing so carefully), but I am also quite happy to edit to include anti-blue-tribe examples as you or others propose them. EDIT: Also, the linked FB post "get really mad about something stupid" is a blue tribe example, which I mention as evidence that I had previously written about the blue tribe being bad in this way all by itself. =)

2. The anchor of a major news network donates lots of money to organizations fighting against gay marriage, and in his spare time he writes editorials arguing that homosexuals are weakening the moral fabric of the country. The news network decides they disagree with this kind of behavior and fire the anchor.

a) This is acceptable; the news network is acting within their rights and according to their principles
b) This is outrageous; people should be judged on the quality of their work and not their political beliefs

12. The principal of a private school is a

... (read more)

Academic philosophers are better than average at evaluating object-level arguments for some claim. They don't seem to be very good at thinking about what rationalization in search implies about the arguments that come up. Compared to academic philosophers, rationalists strike me as especially appreciating filtered evidence and its significance to your world model.

If you find an argument for a claim easily, then even if that argument is strong, this (depending on some other things) implies that similarly strong arguments on the other side may turn up with n... (read more)

Modest spoilers for planecrash (Book 9 -- null action act II).

Nex and Geb had each INT 30 by the end of their mutual war.  They didn't solve the puzzle of Azlant's IOUN stones... partially because they did not find and prioritize enough diamonds to also gain Wisdom 27.  And partially because there is more to thinkoomph than Intelligence and Wisdom and Splendour, such as Golarion's spells readily do enhance; there is a spark to inventing notions like probability theory or computation or logical decision theory from scratch, that is not directly me

... (read more)

What sequence of characters could I possibly, actually type out into a computer that would appreciably reduce the probability that everything dies?

Framed like this, writing to save the world sounds impossibly hard! Almost everything written has no appreciable effect on our world's AI trajectory. I'm sure the "savior sequence" exists mathematically, but finding it is a whole different ballgame.

In the beginning God created four dimensions. They were all alike and indistinguishable from one another. And then God embedded atoms of energy (photons, leptons, etc.) in the four dimensions. By virtue of their energy, these atoms moved through the four dimensions at the speed of light, the only spacetime speed. Thus, as perceived by any one of these atoms, space contracted in, and only in, the direction of that particular atom's motion. As the atoms moved at the speed of light, space contracted so much in the direction of the atom's motion that the dimen

... (read more)

Past historical experience and brainstorming about human social orders probably barely scratches the possibility space. If the CEV were to weigh in on possible posthuman social orders,[1] optimizing in part for how cool that social order is, I'd bet what it describes blows what we've seen out of the water in terms of cool factor.

  1. ^

    (Presumably posthumans will end up reflectively endorsing interactions with one another of some description.)

Don't translate your values into just a loss function. Rather, translate them into a loss function and all the rest of a training story. Use all the tools at your disposal in your impossible task; don't tie one hand behind your back by assuming the loss function is your only lever over the AGI's learned values.

This post crystallized some thoughts that have been floating in my head, inchoate, since I read Zvi's stuff on slack and Valentine's "Here's the Exit."

Part of the reason that it's so hard to update on these 'creative slack' ideas is that we make deals among our momentary mindsets to work hard when it's work-time. (And when it's literally the end of the world at stake, it's always work-time.) "Being lazy" is our label for someone who hasn't established that internal deal between their varying mindsets, and so is flighty and hasn't precommitted to getting st... (read more)

4TurnTrout
+1. I've explained a less clear/expansive version of this post to a few people this last summer. I think there is often some internal value-violence going on when many people fixate on Impact.

A model I picked up from Eric Schwitzgebel.

The humanities used to be highest-status in the intellectual world!

But then, scientists quite visibly exploded fission weapons and put someone on the moon. It's easy to coordinate to ignore some unwelcome evidence, but not evidence that blatant. So, begrudgingly, science has been steadily accorded more and more status, from the postwar period on.

"Calling babble and prune the True Name of text generation is like calling bogosort the True Name of search."

In the 1920s when  and CL began, logicians did not automatically think of functions as sets of ordered pairs, with domain and range given, as mathematicians are trained to do today. Throughout mathematical history, right through to computer science, there has run another concept of function, less precise at first but strongly influential always; that of a function as an operation-process (in some sense) which may be applied to certain objects to produce other objects. Such a process can be defined by giving a set of rules describing how it acts

... (read more)
3Alexander Gietelink Oldenziel
There is a third important aspect of functions-in-the-original-sense that distinguishes them from extensional functions (i.e. collection of input-output pairs): effects. Describing these 'intensional' features is an active area of research in theoretical CS. One important thread here is game semantics; you might like to take a look: https://link.springer.com/chapter/10.1007/978-3-642-58622-4_1

Given a transformer model, it's probably possible to find a reasonably concise energy function (probably of a similar OOM of complexity as the model weights themselves) whose minimization corresponds to executing forwards passes of the transformer. However, this [highly compressive] energy function wouldn't tell you much about what the personas simulated by the model "want" or how agentic they were, since the energy function is expressed in the ontology of model weights and activations, not an agent's beliefs / goals. [This has] the type signature of a uti

... (read more)

This is a great theorem that's stuck around in my head this last year! It's presented clearly and engagingly, but more importantly, the ideas in this piece are suggestive of a broader agent foundations research direction. If you wanted to intimate that research direction with a single short post that additionally demonstrates something theoretically interesting in its own right, this might be the post you'd share.

This post has successfully stuck around in my mind for two years now! In particular, it's made me explicitly aware of the possibility of flinching away from observations because they're normie-tribe-coded.

I think I deny the evidence on most of the cases of dogs generating complex English claims. But it was epistemically healthy for that model anomaly to be rubbed in my face, rather than filter-bubbled away plus flinched away from and ignored.

This is a fantastic piece of economic reasoning applied to a not-flagged-as-economics puzzle! As the post says, a lot of its content is floating out there on the internet somewhere: the draw here is putting all those scattered insights together under their common theory of the firm and transaction costs framework. In doing so, it explicitly hooked up two parts of my world model that had previously remained separate, because they weren't obviously connected.

2Davidmanheim
As I have now written in a below comment, see my 2016 Ribbonfarm post about this.

Complex analysis is the study of functions of a complex variable, i.e., functions  where  and  lie in . Complex analysis is the good twin and real analysis the evil one: beautiful formulas and elegant theorems seem to blossom spontaneously in the complex domain, while toil and pathology rule in the reals. Nevertheless, complex analysis relies more on real analysis than the other way around.

--Pugh, Real Mathematical Analysis (p. 28)

One important idea I've picked up from reading Zvi is that, in communication, it's important to buy out the status cost imposed by your claims.

If you're fielding a theory of the world that, as a side effect, dunks on your interlocutor and diminishes their social status, you can work to get that person to think in terms of Bayesian epistemology and not decision theory if you make sure you aren't hurting their social image. You have to put in the unreasonable-feeling work of framing all your claims such that their social status is preserved or fairly increas... (read more)

Thanks -- right on both counts! Post amended.

An Inconsistent Simulated World

I regret to inform you, you are an em inside an inconsistent simulated world. By this, I mean: your world is a slapdash thing put together out of off-the-shelf assets in the near future (presumably right before a singularity eats that simulator Earth).

Your world doesn't bother emulating far-away events in great detail, and indeed, may be messing up even things you can closely observe. Your simulators are probably not tampering with your thoughts, though even that is something worth considering carefully.

What are the flaws you... (read more)

When another article of equal argumentative caliber could have just as easily been written for the negation of a claim, that writeup is no evidence for its claim.

Switching costs between different kinds of work can be significant. Give yourself permission to focus entirely on one kind of work per Schelling unit of time (per day), if that would help. Don't spend cognitive cycles feeling guilty about letting some projects sit on the backburner; the point is to get where you're going as quickly as possible, not to look like you're juggling a lot of projects at once.

This can be hard, because there's a conventional social expectation that you'll juggle a lot of projects simultaneously, maybe because that's more legible t... (read more)

A multiagent Extrapolated Volitionist institution is something that computes and optimizes for a Convergent Extrapolated Volition, if a CEV exists.

Really, though, the above Extrapolated Volitionist institutions do take other people into consideration. They either give everyone the Schelling weight of one vote in a moral parliament, or they take into consideration the epistemic credibility of other bettors as evinced by their staked wealth, or other things like that.

Sometimes the relevant interpersonal parameters can be varied, and the institutional designs... (read more)

Because your utility function is your utility function, the one true political ideology is clearly Extrapolated Volitionism.

Extrapolated Volitionist institutions are all characteristically "meta": they take as input what you currently want and then optimize for the outcomes a more epistemically idealized you would want, after more reflection and/or study.

Institutions that merely optimize for what you currently want the way you would with an idealized world-model are old hat by comparison!

1TAG
Since when was politics about just one person?

Stress and time-to-burnout are resources to be juggled, like any other.

“What is the world trying to tell you?”

I've found that this prompt helps me think clearly about the evidence shed by the generator of my observations.

As Gauss stressed long ago, any kind of singular mathematics acquires a meaning only as a limiting form of some kind of well-behaved mathematics, and it is ambiguous until we specify exactly what limiting process we propose to use. In this sense, singular mathematics has necessarily a kind of anthropomorphic character; the question is not what is it, but rather how shall we define it so that it is in some way useful to us?

--E. T. Jaynes, Probability Theory (p. 108)

Bogus nondifferentiable functions

The case most often cited as an example of a nondifferentiable function is derived from a sequence , each of which is a string of isosceles right triangles whose hypotenuses lie on the real axis and have length . As , the triangles shrink to zero size. For any finite , the slope of  is  almost everywhere. Then what happens as ? The limit  is often cited carelessly as a nondifferentiable function. Now it is clear that the limit of the derivativ

... (read more)
2David Udell

Epistemic status: politics, known mindkiller; not very serious or considered.

People seem to have a God-shaped hole in their psyche: just as people banded around religious tribal affiliations, they now, in the contemporary West, band together around political tribal affiliations. Intertribal conflict can be, at its worst, violent, on top of mindkilling. Religious persecution in the UK was one of the instigating causes of British settlers migrating to the American colonies; religious conflict in Europe generally was severe.

In the US, the 1st Amendment legall... (read more)

The explicit definition of an ordered pair  is frequently relegated to pathological set theory...

It is easy to locate the source of the mistrust and suspicion that many mathematicians feel toward the explicit definition of ordered pair given above. The trouble is not that there is anything wrong or anything missing; the relevant properties of the concept we have defined are all correct (that is, in accord with the demands of intuition) and all the correct properties are present. The trouble is that the concept has some irreleva

... (read more)
3Alexander Gietelink Oldenziel
Modern type theory mostly solves this blemish of set theory and is highly economic conceptually to boot. Most of the adherence of set theory is historical inertia - though some aspects of coding & presentations is important. Future foundations will improve our understanding on this latter topic. 

Social niceties and professionalism act as a kind of 'communications handshake' in ordinary society -- maybe because they're still a credible correlate of having your act together enough to be worth considering your outputs in the first place?

Very cool! I have noticed that in arguments in ordinary academia people sometimes object that "that's so complicated" when I take a lot of deductive steps. I hadn't quite connected this with the idea that:

If you're confident in your assumptions ( is small), or if you're unconfident in your inferences ( is big), then you should penalise slow theories moreso than long theories, i.e. you should be a T-type.

I.e., that holding a T-type prior is adaptive when even your deductive inferences are noisy.

Also, I take it that this row of your table:

DebateK
... (read more)
3Cleo Nardo
yep. amended.

FWIW, this post strikes me as a very characteristically 'Hansonian' insight.

2Dagon
'Hansonian' meaning it explores a change that is a funny departure from a current equilibrium, but doesn't explain why it's only that change which is possible, rather than changing an underlying inefficiency or a different dimension of the equilibrium? Why have paid time off (or allowed time off) at all?  Why not bid for time off, with different multipliers depending on how many are out at once?  At the very least, why not measure the correlates of expense of time-off and adjust accordingly, rather than just the intuition that for some teams, one member missing has a disproportionate cost and that should be fixed by enforced scheduling.

Now, whatever  may assert, the fact that  can be deduced from the axioms cannot prove that there is no contradiction in them, since, if there were a contradiction,  could certainly be deduced from them!

This is the essence of the Gödel theorem, as it pertains to our problems. As noted by Fisher (1956), it shows us the intuitive reason why Gödel’s result is true. We do not suppose that any logician would accept Fisher’s simple argument as a proof of the full Gödel theorem; yet for most of us it is more convincing than Gödel’s

... (read more)
4JBlack
The text is slightly in error. It is straightforward to construct a program that is guaranteed to locate an inconsistency if one exists: just have it generate all theorems and stop when it finds an inconsistency. The problem is that it doesn't ever stop if there isn't an inconsistency. This is the difference between decidability and semi-decidability. All the systems covered by Gödel's completeness and incompletness theorems are semi-decidable, but not all are decidable.

I know that the humans forced to smile are not happy (and I know all the mistakes they've made while programming me, I know what they should've done instead), but I don't believe that they are not happy.

These are different senses of "happy." It should really read:

I know forcing humans to smile doesn't make them , and I know what they should've written instead to get me to optimize for  as they intended, but they are .

They're different concepts, so there's no strangeness here. The AGI knows what you meant... (read more)

1Q Home
I know that there's no strangeness from the formal point of view. But it doesn't mean there's no strangeness in general. Or that the situation isn't similar to the Moore paradox. Your examples are not 100% Moore statements too. Isn't the point of the discussion to find interesting connections between Moore paradox and other things? I know that the classical way to formulate it is "AI knows, but doesn't care". I thought it may be interesting to formulate it as "AI knows, but doesn't believe". It may be interesting to think for what type of AI this formulation may be true. For such AI alignment would mean resolving the Moore paradox. For example, imagine an AI with a very strong OCD to make people smile.

The human brain does not start out as an efficient reasoning machine, plausible or deductive. This is something which we require years to learn, and a person who is an expert in one field of knowledge may do only rather poor plausible reasoning in another. What is happening in the brain during this learning process?

Education could be defined as the process of becoming aware of more and more propositions, and of more and more logical relationships between them. Then it seems natural to conjecture that a small child reasons on a lattice of very open structur

... (read more)

Yeah, fair -- I dunno. I do know that an incremental improvement on simulating a bunch of people in an environment philosophizing is doing that but running an algorithm that prevents coercion, e.g.

I imagine that the complete theory of these incremental improvements (for example, also not running a bunch of moral patients for many subjective years while computing the CEV), is the final theory we're after, but I don't have it.

4TekhneMakre
Like, encoding what "coercion" is would be an expression of values. It's more meta, and more universalizable, and stuff, but it's still something that someone might strongly object to, and so it's coercion in some sense. We could try to talk about what possible reflectively stable people / societies would consider as good rules for the initial reflection process, but it seems like there would be multiple fixed points, and probably some people today would have revealed preferences that distinguish those possible fixed points of reflection, still leaving open conflict.  Cf. https://www.lesswrong.com/posts/CzufrvBoawNx9BbBA/how-to-prevent-authoritarian-revolts?commentId=3LcHA6rtfjPEQne4N 
Load More