This is a special post for quick takes by Charlie Steiner. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
33 comments, sorted by Click to highlight new comments since:

Some thoughts on reading Superintelligence (2014). Overall it's been quite good, and nice to read through such a thorough overview even if it's not new to me. Weirdly I got some comments that people often stop reading it. What this puts me in mind of is a physics professor remarking to me that they used to find textbooks impenetrable, but now they find it quite fun to leaf through a new introductory textbook. And now my brain is relating this to the popularity of fanfiction that re-uses familiar characters and settings :P

By god, Nick Bostrom thinks in metaphors all the time. Not to imply that this is bad at all, in fact it's very interesting.

The way the intelligence explosion kinetics is presented really could stand to be less one-dimensional about intelligence. Or rather, perhaps it should ask us to accept that there is some one-dimensional measure of capability that is growing superlinearly, which can then by parlayed into all the other things we care about via the "superpower"-style arguments that appear two chapters later.

Has progress on AI seemed to outpace progress on augmenting human intelligence since 2014? I think so, and perhaps this explains why Bostrom_2014 puts more emphasis on whole brain emulations. But perhaps not - perhaps instead I've/we've been unduly neglecting thinking about alternate paths to superintelligence in the last few years.

Human imitations (machine learning systems trained to reproduce the behavior of a human within some domain) seem conspicuously absent from Bostrom_2014's toolbox of parts to build an aligned AI out of. Is this a reflection of the times? My memory is blurry but... plausibly? If so, I think that's that's a pretty big piece of conceptual progress we've made.

When discussing what moral theory to give a superintelligence, Bostrom inadvertently makes a good case for another piece of conceptual progress since 2014 - our job is not to find The One Right Moral Theory, it is both more complicated and easier than that (see Scott, Me). Hopefully this notion has percolated around enough that this chapter would get written differently today. Or is this still something we don't have consensus on among Serious Types? Then again, it was already Bostrom who coined the term MaxiPOK - maybe the functional difference isn't too large.

I would have thought "maybe the CEV of humanity would just shut itself down" would be uttered with more alarm.

In conclusion, footnotes are superior to endnotes.

About a year ago, I made a bet - my $50,000 against their $1000 - that we wouldn't see slam-dunk evidence of UFOs/UAPs being the result of aliens, the supernatural, simulations, or anything similarly non-mundane.

What's changed 1 year on? Well, I think a year ago UAPs and aliens were more in the news, between governmental hearings in several countries, a whistleblower ex-USAF intelligence official, and continuing coverage of navy UAP tapes. None of that has led anywhere, and it's mostly fading from public memory.

You can find people currently claiming that the big reveal that breaks it all wide open is just around the corner. But you can basically always find people claiming that. While doing some quick searching before making this comment, though, I did find out that a congressman from Tennessee is a big believer that the extraordinarily fake-looking aliens exhibited in Mexico last year are super important and need to be investigated at U Tennessee.

If they were, and they turned out to have non-terrestrial biological structure, that's definitely a way I could pay the money. I estimate the probability of this at about 0.00000000000000001.

It's feasible to establish AGI-run governance that does nothing on its own other than permanently and irrevocably but unobtrusively restrict the level of technological development of every civilization it reaches, including its own builders (perhaps as a way of opposing extinction risk). This leads to strange ancient cultures of biological low-tech aliens that slowly travel the galaxy, much later than the initial wave of von Neumann probes of the technological development restricting AGI.

This is still unlikely, as the outcome both wastes the cosmic endowment and requires sufficient technical sophistication to make it stable and irrevocable. So the builders of this AGI governance both need to decently understand alignment and target an outcome that radically impairs their future. But this seems only Fermi paradox unlikely, not literal magic unlikely. The fraudulent nature of "evidence" we see reduces the probability that this is the case further, as low-tech aliens could instead be making themselves known in straightforward ways, while the high-tech AGI that restricts tech doesn't need to be observable at all. But this doesn't go all the way to impossibility, as an ancient low-tech culture could have traditions and bureaucracy cashing out in a bizarre first contact process.

The prediction of this hypothesis is that we don't get to develop unrestricted ASI of our own. Given the inscrutable nature of the models (or equivalently lack of technical sophistication needed to know what we are doing) any interventions don't yet need to be humanly observable.

What do you use to shoot an acausal blackmailer?

 

 

Depleted Platonium rounds.

These are possible worlds where you can blackmail the blaclmailer by the fact that you know that he did blackmail

A fun thought about nanotechnology that might only make sense to physicists: in terms of correlation function, CPUs are crystalline solids, but eukaryotic cells are liquids. I think a lot of people imagine future nanotechnology as made of solids, but given the apparent necessity of using diffusive transport, nanotechnology seems more likely to be statistically liquid.

(For non-physicists: the building blocks of a CPU are arranged in regular patterns even over long length scales. But the building blocks of a cell are just sort of diffusing around in a water solvent. This can be formalized by a "correlation function," basically how correlated the positions of building blocks is - how much knowing the position of one lets you pin down the position of another. The position of transistors is correlated with each other over length scales as long as the size of the entire wafer of silicon, but the position of one protein in a cell tells you only a little bit about nearby proteins, and almost nothing about the position of proteins far away.)

Dictionary/SAE learning on model activations is bad as anomaly detection because you need to train the dictionary on a dataset, which means you needed the anomaly to be in the training set.

How to do dictionary learning without a dataset? One possibility is to use uncertainty-estimation-like techniques to detect when the model "thinks its on-distribution" for randomly sampled activations.

You may be able to notice data points where the SAE performs unusually badly at reconstruction? (Which is what you'd see if there's a crucial missing feature)

Would you expect this to outperform doing the same thing with a non-sparse autoencoder (that has a lower latent dimension than the NN's hidden dimension)? I'm not sure why it would, given that we aren't using the sparse representations except to map them back (so any type of capacity constraint on the latent space seems fine). If dense autoencoders work just as well for this, they'd probably be more straightforward to train? (unless we already have an SAE lying around from interp anyway, I suppose)

Regular AE's job is to throw away the information outside some low-dimensional manifold, sparse ~linear AE's job is to throw away the information not represented by sparse dictionary codes. (Also a low-dimensional manifold, I guess, just made from a different prior.)

If an AE is reconstructing poorly, that means it was throwing away a lot of information. How important that information is seems like a question about which manifold the underlying network "really" generalizes according to. And also what counts as an anomaly / what kinds of outliers you're even trying to detect.

Ah, yeah, that makes sense.

I think this is an important point, but IMO there are at least two types of candidates for using SAEs for anomaly detection (in addition to techniques that make sense for normal, non-sparse autoencoders):

  1. Sometimes, you may have a bunch of "untrusted" data, some of which contains anomalies. You just don't know which data points have anomalies on this untrusted data. (In addition, you have some "trusted" data that is guaranteed not to have anomalies.) Then you could train an SAE on all data (including untrusted) and figure out what "normal" SAE features look like based on the trusted data.
  2. Even for an SAE that's been trained only on normal data, it seems plausible that some correlations between features would be different for anomalous data, and that this might work better than looking for correlations in the dense basis. As an extreme version of this, you could look for circuits in the SAE basis and use those for anomaly detection.

Overall, I think that if SAEs end up being very useful for mech interp, there's a decent chance they'll also be useful for (mechanistic) anomaly detection (a lot of my uncertainty about SAEs applies to both possible applications). Definitely uncertain though, e.g. I could imagine SAEs that are useful for discovering interesting stuff about a network manually, but whose features aren't the right computational units for actually detecting anomalies. I think that would make SAEs less than maximally useful for mech interp too, but probably non-zero useful.

Even for an SAE that's been trained only on normal data [...] you could look for circuits in the SAE basis and use those for anomaly detection.

Yeah, this seems somewhat plausible. If automated circuit-finding works it would certainly detect some anomalies, though I'm uncertain if it's going to be weak against adversarial anomalies relative to regular ol' random anomalies.

Congrats to Paul on getting appointed to NIST AI safety.

Can't find a reference that says it has actually happened already.

Oh, maybe I've jumped the gun then. Whoops.

I think you can steelman Ben Goertzel-style worries about near-term amoral applications of AI being bad "formative influences" on AGI, but mostly under a continuous takeoff model of the world. If AGI is a continuous development of earlier systems, then maybe it shares some datasets and learned models with earlier AI projects, and definitely it shares the broader ecosystems of tools, dataset-gathering methodologies, model-evaluating paradigms, and institutional knowledge on the part of the developers. If the ecosystem in which this thing "grows up" is one that has previously been optimized for marketing, or military applications, or what have you, this is going to have ramifications in how the first AGI projects are designed and what they get exposed to. The more continuous you think the development is going to be, the more this can be intervened on by trying to make sure that AI is pro-social even in the short term.

Interpretability as an RLHF problem seems like something to do.

Idea: The AI of Terminator.

One of the formative books of my childhood was The Physics of Star Trek, by Lawrence Krauss. Think of it sort of like xkcd's What If?, except all about physics and getting a little more into the weeds.

So:

Robotics / inverse kinematics. Voice recognition, language models, and speech synthesis. Planning / search. And of course, self-improvement, instrumental convergence, existential risk.

To make this work you'd need to already be pretty well-suited. The Physics of Star Trek was Krauss' third published book, and he got Stephen Hawking to write the forward.

There's a point by Stuart Armstrong that anthropic updates are non-Bayesian, because you can think of Bayesian updates as deprecating improbable hypotheses and renormalizing, while anthropic updates (e.g. updating on "I think just got copied") require increasing probability on previously unlikely hypotheses.

In the last few years I've started thinking "what would a Solomonoff inductor do?" more often about anthropic questions. So I just thought about this case, and I realized there's something interesting (to me at least).

Suppose we're in the cloning version of Sleeping Beauty. So if the coin landed Heads, the sign outside the room will say Room A, if the coin landed Tails, the sign could say either Room A or Room B. Normally I translate this into Solomonoff-ese by treating different hypotheses about the world as Turing machines that could reproduce my memories. Each hypothesis is like a simulation of the world (with some random seed), plus some rule for specifying what physical features (i.e. my memories) to read to the output tape. So "Heads in Room A," "Tails in Room A," and "Tails in Room B" are three different Turing machines that reproduce the sequence of my memories by simulating a physical universe.

Where's the non-Bayesian-ness? Well, it's because before getting copied, but after the coin was flipped, there were supposed to only be two clumps of hypotheses - let's call them "Heads and walking down the hallway" and "Tails and walking down the hallway." And so before getting copied you assign only 50% to Tails. And so P(Tails|memories after getting copied) is made of two hypotheses while P(Tails|memories before getting copied) is only made of one hypothesis, so the extra hypothesis got bumped up non-Bayesianly.

But... hang on. Why can't "Tails in Room B" get invoked as a second hypothesis for my memories even before getting copied? After all, it's just as good at reproducing my past memories - the sequence just happens to continue on a bit.

What I think is happening here is that we have revealed a difference between actual Solomonoff induction, and my intuitive translation of hypotheses into Turing machines. In actual (or at least typical) Solomonoff induction, it's totally normal if the sequence happens to continue! But in the hypotheses that correspond to worlds at specific points in time, the Turing machine reads out my current memories and only my current memories. At first blush, this seems to me to be an arbitrary choice - is there some reason why it's non-arbitrary?

We could appeal to the imperfection of human memory (so that only current-me has my exact memories), but I don't like the sound of that, because anthropics doesn't seem like it should undergo a discontinuous transition if we get better memory.

Do you have a link to that argument?  I think Bayesean updates include either reducing a prior or increasing it, and then renormalizing all related probabilities.  Many updatable observations take the form of replacing an estimate of future experience (I will observe sunshine tomorrow) by a 1 or zero (I did or did not observe that, possibly not quite 0 or 1 if you want to account for hallucinations and imperfect memory).

Anthropic updates are either bayesean or impossible.  The underlying question remains "how does this experience differ from my probability estimate"?   For Bayes or for Solomonoff, one has to answer "what has changed for my prediction?  In what way am I surprised and have to change my calculation?"

https://www.alignmentforum.org/posts/iNi8bSYexYGn9kiRh/paradoxes-in-all-anthropic-probabilities I think?

I have a totally non-Solomonoff explanation of what's going on, which actually goes full G.K. Chesterton - I assign anthropic probabilities because I don't assume that my not waking is impossible. But I'm not sure how a Solomonoff inductor would see it.

Will the problem of logical counterfactuals just solve itself with good model-building capabilities? Suppose an agent has knowledge of its own source code, and wants to ask the question "What happens if I take action X?" where their source code provably does not actually do X.

A naive agent might notice the contradiction and decide that "What happens if I take action X?" is a bad question, or a question where any answer is true, or a question where we have to condition on cosmic rays hitting transistors at just the right time. But we want a sophisticated agent to be able to be aware of the contradiction and yet go on to say "Ah, but what I meant wasn't a question about the real world, but a question about some simplified model of the world that lumps all of the things that would normally be contradictory about this into one big node - I take action X, and also my source code outputs X, and also maybe even the programmers don't immediately see X as a bug."

Of course, the sophisticated agent doesn't have to bother saying that if it already makes plans using simplified models of the world that lump things together etc. etc. It's planning will thus implicitly deal with logical counterfactuals, and if it does verbal reasoning that taps into these same models, it can hold a conversation about logical counterfactuals. This seems pretty close to how humans do it.

Atheoretically building an agent that is good at making approximate models would therefore "accidentally" be a route to solving logical counterfactuals. But maybe we can do theory to this too: a theorem about logical counterfactuals is going to be a theorem about processes for building approximate models of the world - which it actually seems plausible to relate back to logical inductors and the notion of out-planning "simple" agents.

It seems like there's room for the theory of logical-inductor-like agents with limited computational resources, and I'm not sure if this has already been figured out. The entire trick seems to be that when you try to build a logical inductor agent, it's got some estimation process for math problems like "what does my model predict will happen?" and it's got some search process to find good actions, and you don't want the search process to be more powerful than the estimator because then it will find edge cases. In fact, you want them to be linked somehow, so that the search process is never in the position of taking advantage of the estimator's mistakes - if you, a human, are making some plan and notice a blind spot in your predictions, you don't "take advantage" of yourself, you do further estimating as part of the search process.

The hard part is formalizing this handwavy argument, and figuring out what other strong conditions need to be met to get nice guarantees like bounded regret.

Charlie's easy and cheap home air filter design.

Ingredients:

MERV-13 fabric, cut into two disks (~35 cm diameter) and one long rectangle (16 cm by 110 cm).

Computer fan - I got a be quiet BL047. 

Cheap plug-in 12V power supply

Hot glue

Instructions:

Splice the computer fan to the power supply. When you look at the 3-pin fan connector straight on and put the bumps on the connector on the bottom, the wire on the right is ground and the wire in the middle is 12V. Do this first so you are absolutely sure which way the fan blows before you hot glue it.

Hot glue the long edge of the rectangle to the edge of one disc. The goal is to start a short cylinder (to be completed once you put on the second disc). Use plenty of glue on the spots where you have to pleat the fabric to get it to conform.

Place the fan centered on the seam of the long rectangle (the side of the cylinder), oriented to blow inwards, and tack it down with some hot glue. Then cut away a circular hole for the air to get in and hot glue the circular housing around the fan blade to the fabric.

Hot glue the other disk on top to complete the cylinder.

Power on and enjoy.

AI that's useful for nuclear weapon design - or better yet, a clear trendline showing that AI will soon be useful for nuclear weapon design - might be a good way to get governments to put the brakes on AI.

Humans using SAEs to improve linear probes / activation steering vectors might quickly get replaced by a version of probing / steering that leverages unlabeled data.

Like, probing is finding a vector along which labeled data varies, and SAEs are finding vectors that are a sparse basis for unlabeled data. You can totally do both at once - find a vector along which labeled data varies and is part of a sparse basis for unlabeled data.

This is a little bit related to an idea with the handle "concepts live in ontologies." If I say I'm going to the gym, this concept of "going to the gym" lives in an ontology where people and activites are basic components - it's probably also easy to use ideas like "You're eating dinner" in that ontology, but not "1,3-diisocyanatomethylbenzene." When you try to express one idea, you're also picking a "basis" for expressing similar ideas.

Trying to get to a good future by building a helpful assistant seems less good than it did a month ago, because the risk is more salient that clever people in positions of power may coopt helpful assistants to amass even more power.

One security measure against this is reducing responsiveness to the user, and increasing the amount of goal information that's put into to large finetuning datasets that have lots of human eyeballs on them.

Should government regulation on AI ban using reinforcement learning with a target of getting people to do things that they wouldn't endorse in the abstract (or some similar restriction)?

E.g. should using RL to make ads that maximize click-through be illegal?

Just looked up Aligned AI (the Stuart Armstrong / Rebecca Gorman show) for a reference, and it looks like they're publishing blog posts:

E.g. https://www.aligned-ai.com/post/concept-extrapolation-for-hypothesis-generation

https://venturebeat.com/2021/09/27/the-limitations-of-ai-safety-tools/

This article makes a persuasive case that there being different sorts of safety research can be confusing to keep track of if you're a journalist (who are not so different from policymakers or members of the public).

https://www.sciencedirect.com/science/article/abs/pii/S0896627321005018

(biorxiv https://www.biorxiv.org/content/10.1101/613141v2 )

Cool paper on trying to estimate how many parameters neurons have (h/t Samuel at EA Hotel). I don't feel like they did a good job distinguishing how hard it was for them to fit nonlinearities that would nonetheless be the same across different neurons, versus the number of parameters that were different from neuron to neuron. But just based on differences in physical arrangement of axons and dendrites, there's a lot of opportunity for diversity, and I do think the paper was convincing that neurons are sufficiently nonlinear that this structure is plausibly important. The question is how much neurons undergo selection based on this diversity, or even update their patterns as a form of learning!

Back in the "LW Doldrums" c. 2016, I thought that what we needed was more locations - a welcoming (as opposed to heavily curated a la old AgentFoundations), LW-style forum solely devoted to AI alignment, and then the old LW for the people who wanted to talk about human rationality.

This philosophy can also be seen in the choice to make the AI Alignment forum as a sister site to LW2.0.

However, what actually happened is that we now have non-LW forums for SSC readers who want to talk about politics, SSC readers who want to talk about human rationality, and people who want to talk about effective altruism. And meanwhile, LW2.0 and the alignment forum have sort of merged into one forum that is mostly talking about AI alignment but sometimes also has posts on COVID, EA, peoples' personal lives, and economics, and more rarely human rationality. Honestly, it's turned out pretty good.