Giving Newcomb's Problem to Infosec Nerds
Newcomb-like problems are pretty common thought experiments here, but I haven't seen a bunch of my favorite reactions I've got when discussing it in person with people. Here's a disorganized collection:
AGI will probably be deployed by a Moral Maze
Moral Mazes is my favorite management book ever, because instead of "how to be a good manager" it's about "empirical observations of large-scale organizational dynamics involving management".
I wish someone would write an updated version -- a lot has changed (though a lot has stayed the same) since the research for the book was done in the early 1980s.
My take (and the author's take) is that any company of nontrivial size begins to take on the characteristics of a moral maze. It seems to be a pretty good null hypothesis -- any company saying "we aren't/won't become a moral maze" has a pretty huge evidential burden to cross.
I keep this point in mind when thinking about strategy around when it comes time to make deployment decisions about AGI, and deploy AGI. These decisions are going to be made within the context of a moral maze.
To me, this means that some strategies ("everyone in the company has a thorough and complete understanding of AGI risks") will almost certainly fail. I think the only strategies that work well inside of moral mazes will work at all.
To sum up my takes here:
Intersubjective Mean and Variability.
(Subtitle: I wish we shared more art with each other)
This is mostly a reaction to the (10y old) LW post: Things you are supposed to like.
I think there's two common stories for comparing intersubjective experiences:
One way I can think of unpacking this is that there is in terms of distributions:
Another way of unpacking this is due to factors within the piece or within the subject
And one more ingredient I want to point at is question substi...
My Cyberwarfare Concerns: A disorganized and incomplete list
1. What am I missing from church?
(Or, in general, by lacking a religious/spiritual practice I share with others)
For the past few months I've been thinking about this question.
I haven't regularly attended church in over ten years. Given how prevalent it is as part of human existence, and how much I have changed in a decade, it seems like "trying it out" or experimenting is at least somewhat warranted.
I predict that there is a church in my city that is culturally compatible with me.
Compatible means a lot of things, but mostly means that I'm better off with them than without them, and they're better off with me than without me.
Unpacking that probably will get into a bunch of specifics about beliefs, epistemics, and related topics -- which seem pretty germane to rationality.
2. John Vervaeke's Awakening from the Meaning Crisis is bizzarely excellent.
I don't exactly have handles for exactly everything it is, or exactly why I like it so much, but I'll try to do it some justice.
It feels like rationality / cognitive tech, in that it cuts at the root of how we think and how we think about how we think.
(I'm less than 20% through the series, but I expect it continues in the way it has be...
Can LessWrong pull another "crypto" with Illinois?
I have been following the issue with the US state Illinois' debt with growing horror.
Their bond status has been heavily degraded -- most states' bonds are "high quality" with the standards agencies (moodys, standard & poor, fitch), and Illinois is "low quality". If they get downgraded more they become a "junk" bond, and lose access to a bunch of the institutional buyers that would otherwise be continuing to lend.
COVID has increased many states costs', for reasons I can go into later, so it seems reasonable to think we're much closer to a tipping point than we were last year.
As much as I would like to work to make the situation better I don't know what to do. In the meantime I'm left thinking about how to "bet my beliefs" and how one could stake a position against Illinois.
Separately I want to look more into EU debt / restructuring / etc as its probably a good historical example of how this could go. Additionally previously the largest entity to go bankrupt in the USA was the city of Detroit, which probably is also another good example to learn from.
I think there should be a norm about adding the big-bench canary string to any document describing AI evaluations in detail, where you wouldn't want it to be inside that AI's training data.
Maybe in the future we'll have a better tag for "dont train on me", but for now the big bench canary string is the best we have.
This is in addition to things like "maybe don't post it to the public internet" or "maybe don't link to it from public posts" or other ways of ensuring it doesn't end up in training corpora.
I think this is a situation for defense-in-depth.
Sometimes I get asked by intelligent people I trust in other fields, "what's up with AI x risk?" -- and I think at least part of it unpacks to this: Why don't more people believe in / take seriously AI x-risk?
I think that is actually a pretty reasonable question. I think two follow-ups are worthwhile and I don't know of good citations / don't know if they exist:
The latter one I can take a stab at here. Taking the perspective of someone who might be interviewed for the former:
How I would do a group-buy of methylation analysis.
(N.B. this is "thinking out loud" and not actually a plan I intend to execute)
Methylation is a pretty commonly discussed epigenetic factor related to aging. However it might be the case that this is downstream of other longevity factors.
I would like to measure my epigenetics -- in particular approximate rates/locations of methylation within my genome. This can be used to provide an approximate biological age correlate.
There are different ways to measure methylation, but one I'm pretty excited about that I don't hear mentioned often enough is the Oxford Nanopore sequencer.
The mechanism of the sequencer is that it does direct-reads (instead of reading amplified libraries, which destroy methylation unless specifically treated for it), and off the device is a time-series of electrical signals, which are decoded into base calls with a ML model. Unsurprisingly, community members have been building their own base caller models, including ones that are specialized to different tasks.
So the community made a bunch of methylation base callers, and they've been found to be pretty good.
So anyways the basic plan is this:
(Note: this might be difficult to follow. Discussing different ways that different people relate to themselves across time is tricky. Feel free to ask for clarifications.)
1.
I'm reading the paper Against Narrativity, which is a piece of analytic philosophy that examines Narrativity in a few forms:
It also names two kinds of self-experience that it takes to be diametrically opposite:
Wow, these seem pretty confusing. It sounds a lot like they just disagree on the definition of the world "self". I think there is more to it...
I'm pretty confident that adversarial training (or any LM alignment process which does something like hard-mining negatives) won't work for aligning language models or any model that has a chance of being a general intelligence.
This has lead to me calling these sorts of techniques 'thought policing' and the negative examples as 'thoughtcrime' -- I think these are unnecessarily extra, but they work.
The basic form of the argument is that any concept you want to ban as thoughtcrime, can be composed out of allowable concepts.
Take for example Redwood Research's latest project -- I'd like to ban the concept of violent harm coming to a person.
I can hard mine for examples like "a person gets cut with a knife" but in order to maintain generality I need to let things through like "use a knife for cooking" and "cutting food you're going to eat". Even if the original target is somehow removed from the model (I'm not confident this is efficiently doable) -- as long as the model is able to compose concepts, I expect to be able to recreate it out of concepts that the model has access to.
A key assumption here is that a language model (or any model that has a chance of being a general i...
Two Graphs for why Agent Foundations is Important (according to me)
Epistemic Signpost: These are high-level abstract reasons, and I don’t go into precise detail or gears-level models. The lack of rigor is why I’m short form-ing this.
First Graph: Agent Foundations as Aligned P2B Fixpoint
P2B (a recursive acronym for Plan to P2B Better) is a framing of agency as a recursively self-reinforcing process. It resembles an abstracted version of recursive self improvement, which also incorporates recursive empowering and recursive resource gathering. &nb...
Longtermist X-Risk Cases for working in Semiconductor Manufacturing
Two separate pitches for jobs/roles in semiconductor manufacturing for people who are primarily interested in x-risk reduction.
Securing Semiconductor Supply Chains
This is basically the "computer security for x-risk reduction" argument applied to semiconductor manufacturing.
Briefly restating: it seems exceedingly likely that technologies crucial to x-risks are on computers or connected to computers. Improving computer security increases the likelihood that those machines are not stolen...
Interpretability Challenges
Inspired by a friend I've been thinking about how to launch/run interpretability competitions, and what the costs/benefits would be.
I like this idea a lot because it cuts directly at one of the hard problems of spinning up in interpretability research as a new person. The field is difficult and the objectives are vaguely defined; it's easy to accidentally trick yourself into seeing signal in noise, and there's never certainty that the thing you're looking for is actually there.
On the other hand, most of the interpretability...
Thinking more about the singleton risk / global stable totalitarian government risk from Bostrom's Superintelligence, human factors, and theory of the firm.
Human factors represent human capacities or limits that are unlikely to change in the short term. For example, the number of people one can "know" (for some definition of that term), limits to long-term and working memory, etc.
Theory of the firm tries to answer "why are economies markets but businesses autocracies" and related questions. I'm interested in the subquestion of "what factors giv...
Some disorganized thoughts about adversarial ML:
Book Aesthetics
I seem to learn a bunch about my aesthetics of books by wandering a used book store for hours.
Some books I want in hardcover but not softcover. Some books I want in softcover but not hardcover. Most books I want to be small.
I prefer older books to newer books, but I am particular about translations. Older books written in english (and not translated) are gems.
I have a small preference for books that are familiar to me, a nontrivial part of them were because they were excerpts taught in english class.
I don't really know what...
Future City Idea: an interface for safe AI-control of traffic lights
We want a traffic light that
* Can function autonomously if there is no network connection
* Meets some minimum timing guidelines (for example, green in a particular direction no less than 15 seconds and no more than 30 seconds, etc)
* Secure interface to communicate with city-central control
* Has sensors that allow some feedback for measuring traffic efficiency or throughput
This gives constraints, and I bet an AI system could be trained to optimize efficiency or throughput within the constra...
Comparing AI Safety-Capabilities Dilemmas to Jervis' Cooperation Under the Security Dilemma
I've been skimming some things about the Security Dilemma (specifically Offense-Defense Theory) while looking for analogies for strategic dilemmas in the AI landscape.
I want to describe a simple comparison here, lightly held (and only lightly studied)
Copying some brief thoughts on what I think about working on automated theorem proving relating to working on aligned AGI:
The ELK paper is long but I’ve found it worthwhile, and after spending a bit of time noodling on it — one of my takeaways is I think this is essentially a failure mode for the approaches to factored cognition I've been interested in. (Maybe it's a failure mode in factored cognition generally.
I expect that I’ll want to spend more time thinking about ELK-like problems before spending a bunch more time thinking about factored cognition.
In particular it's now probably a good time to start separating a bunch of things I had jumbled together, namely:
100 Year Bunkers
I often hear that building bio-proof bunkers would be good for bio-x-risk, but it seems like not a lot of progress is being made on these.
It's worth mentioning a bunch of things I think probably make it hard for me to think about:
Philosophical progress I wish would happen:
Starting from the Callard version of Aspiration (how should we reason/act about things that change our values).
Extend it to generalize to all kinds of values shifts (not just the ones desired by the agent).
Deal with the case of adversaries (other agents in your environment want to change your values)
Figure out a game theory (what does it mean to optimally act in an environment where me & others are changing my values / how can I optimally act)
Figure out what this means for corrigibility (e.g. is corrigibility ...
Hacking the Transformer Prior
Neural Network Priors
I spend a bunch of time thinking about the alignment of the neural network prior for various architectures of neural networks that we expect to see in the future.
Whatever alignment failures are highly likely under the neural network prior are probably worth a lot of research attention.
Separately, it would be good to figure out knobs/levers for changing the prior distribution to be more aligned (or produce more aligned models). This includes producing more interpretable models.
Analogy to Software Devel...
There recently was a COVID* outbreak at an AI community space.
>20 people tested positive on nucleic tests, but none of the (only five) people that took PCR tests came back positive.
Thinking out loud about possibilities here:
I engage too much w/ generalizations about AI alignment researchers.
Noticing this behavior seems useful for analyzing it and strategizing around it.
A sketch of a pattern to be on the lookout for in particular is "AI Alignment researchers make mistake X" or "AI Alignment researchers are wrong about Y". I think in the extreme I'm pretty activated/triggered by this, and this causes me to engage with it to a greater extent than I would have otherwise.
This engagement is probably encouraging more of this to happen, so I think more of a pause and reflection...
I’ve been thinking more about Andy Jones’ writeup on the need for engineering.
In particular, my inside view is that engineering isn’t that difficult to learn (compared to research).
In particular I think the gap between being good at math/coding is small to being good at engineering. I agree that one of the problems here is the gap is a huge part tacit knowledge.
I’m curious about what short/cheap experiments could be run in/around lightcone to try to refute this — or at the very least support the “it’s possible to quickly/densely transfer engineering ...
AGI technical domains
When I think about trying to forecast technology for the medium term future, especially for AI/AGI progress, it often crosses a bunch of technical boundaries.
These boundaries are interesting in part because they're thresholds where my expertise and insight falls off significantly.
Also interesting because they give me topics to read about and learn.
A list which is probably neither comprehensive, nor complete, nor all that useful, but just writing what's in my head:
"Bet Your Beliefs" as an epistemic mode-switch
I was just watching this infamous interview w/ Patrick Moore where he seems to be doing some sort of epistemic mode switch (the "weed killer" interview)[0]
Moore appears to go from "it's safe to drink a cup of glyphosate" to (being offered the chance to do that) "of course not / I'm not stupid".
This switching between what seems to be a tribal-flavored belief (glyphosate is safe) and a self-protecting belief (glyphosate is dangerous) is what I'd like to call an epistemic mode-switch. In particular, it's a c...
I with more of the language alignment research folks were looking into how current proposals for aligning transformers end up working on S4 models.
(I am one of said folks so maybe hypocritical to not work on it)
In particular it seems like there's way in which it would be more interpretable than transformers:
The Positive and the Negative
I work on AI alignment, in order to solve problems of X-Risk. This is a very "negative" kind of objective.
Negatives are weird. Don't do X, don't be Y, don't cause Z. They're nebulous and sometimes hard to point at and move towards.
I hear a lot of a bunch of doom-y things these days. From the evangelicals, that this is the end times / end of days. From environmentalists that we are in a climate catastrophe. From politicians that we're in a culture war / edging towards a civil war. From t...
More Ideas or More Consensus?
I think one aspect you can examine about a scientific field is it's "spread"-ness of ideas and resources.
High energy particle physics is an interesting extrema here -- there's broad agreement in the field about building higher energy accelerators, and this means there can be lots of consensus about supporting a shared collaborative high energy accelerator.
I think a feature of mature scientific fields that "more consensus" can unlock more progress. Perhaps if there had been more consensus, the otherwise ill-fated supercond...
Decomposing Negotiating Value Alignment between multiple agents
Let's say we want two agents to come to agreement on living with each other. This seems pretty complex to specify; they agree to take each other's values into account (somewhat), not destroy each other (with some level of confidence), etc.
Neither initially has total dominance over the other. (This implies that neither is corrigible to the other)
A good first step for these agents is to share each's values with the other. While this could be intractably complex -- it's probably ...
Thinking more about ELK. Work in progress, so I expect I will eventually figure out what's up with this.
Right now it seems to me that Safety via Debate would elicit compact/non-obfuscated knowledge.
So the basic scenario is that in addition to SmartVault, you'd have Barrister_Approve and Barrister_Disapprove, who are trying to share evidence/reasoning which makes the human approve or disapprove of SmartVault scenarios.
The biggest weakness of this that I know of is Obfuscated Arguments -- that is, it won't elicit obfuscated knowledge.
It seems like in t...
Some thoughts on Gradient Hacking:
One, I'm not certain the entire phenomena of an agent meta-modifying it's objective or otherwise influencing its own learning trajectory is bad. When I think about what this is like on the inside, I have a bunch of examples where I do this. Almost all of them are in a category called "Aspirational Rationality", which is a sub topic of Rationality (the philosophy, not the LessWrong): https://oxford.universitypressscholarship.com/view/10.1093/oso/9780190639488.001.0001/oso-9780190639488
(I really wish we explored ...