If your endgame strategy involved relying on OpenAI, DeepMind, or Anthropic to implement your alignment solution that solves science / super-cooperation / nanotechnology, consider figuring out another endgame plan.
if a lab has 100 million AI employs and 1000 human employees then you only need one human employee to spend 1% of their allotted AI headcount on your pet project and you’ll have 1000 AI employees
That isn't anyone's first/preferred plan. I assure you everyone born in a liberal democracy has considered another plan before arriving at that one.
I've become somewhat pessimistic about encouraging regulatory power over AI development recently after reading this Bismarck Analysis case study on the level of influence (or lack of it) that scientists had over nuclear policy.
The impression I got from some other secondary/tertiary sources (specifically the book Organizing Genius) was that General Groves, the military man who was the interface between the military and Oppenheimer and the Manhattan Project, did his best to shield the Manhattan Project scientists from military and bureaucratic drudgery, and that Vannevar Bush was someone who served as an example of a scientist successfully steering policy.
This case study seems to show that Groves was significantly less of a value add than I thought given the likelihood of him having destroyed Leo Szilard's political influence (and therefore Leo's ability to influence nuclear policy in a direction of preventing an arms race or using it in war). Bush also seems like a disappointment -- he waited months for information to pass through 'official channels' before he attempted to persuade people like FDR to begin a nuclear weapons development program. On top of that, Bush seemed like he internalized the bureaucratic norms of the political and military hierarchy he worked in -- when a scientist named Ernest Lawrence tried to reach the relevant government officials to talk about the importance of nuclear weapons development, Bush (according to this paper) got annoyed by him seemingly bypassing the 'chain of command' (I assume by focusing on talking to people Bush would report to, instead of to Bush himself) that he threatened to politically marginalize Ernest.
Finally, I see clear parallels between the ineffective attempts by these physicists at influencing nuclear weapons policy via contributing technically and trying to build 'political capital', and the ineffective attempts by AI safety engineers and researchers who decide to go work at frontier labs (OpenAI is the clearest example) with the intention of building influence with the people in there so that they can steer things in the future. I'm pretty sure at this point that such a strategy is a pretty bad idea, given that it seems better to do nothing than to contribute to accelerating towards ASI.
There are galaxy-brained counter-arguments to this claim, such as davidad's supposed game-theoretic model that (AFAICT) involves accelerating to AGI powerful enough to make the provable safety agenda viable, or Paul Christiano's (again, AFAICT) plan that's basically 'given intense economic pressure for better capabilities, we shall see a steady and continuous improvement, so the danger actually is in discontinuities that make it harder for humanity to react to changes, and therefore we should accelerate to reduce compute overhang'. I remain unconvinced by them.
I'm not actually seeing where deep expertise on nuclear weapons technology would qualify anybody to have much special input into nuclear weapons policy in general. There just don't seem to be that many technical issues compared to the vast number of political ones.
I don't know if that applies to AI, but tend to think the two are different.
I agree with your argument here, especially your penultimate paragraph, but I'll nitpick that framing your disagreements with Groves as him being "less of a value add" seems wrong. The value that Groves added was building the bomb, not setting diplomatic policy.
given intense economic pressure for better capabilities, we shall see a steady and continuous improvement, so the danger actually is in discontinuities that make it harder for humanity to react to changes, and therefore we should accelerate to reduce compute overhang
I don't feel like this is actually a counterargument? You could agree with both arguments, concluding that we shouldn't work for OpenAI but a outfit better-aligned to your values is okay.
As of right now, I expect we have at least a decade, perhaps two, until we get a human intelligence level generalizing AI (which is what I consider AGI). This is a controversial statement in these social circles, and I don't have the bandwidth or resources to write a concrete and detailed argument, so I'll simply state an overview here.
Scale is the key variable driving progress to AGI. Human ingenuity is irrelevant. Lots of people believe they know the one last piece of the puzzle to get AGI, but I increasingly expect the missing pieces to be too alien for most researchers to stumble upon just by thinking about things without doing compute-intensive experiments.
Scale shall increasingly require more and larger datacenters and a lot of power. Humanity's track record at accomplishing megaprojects is abyssmal. If we find ourselves needing to build city-sized datacenters (with all the required infrastructure to maintain and supply it), I expect that humanity will take twice the initially estimated time and resources to build something with 80% of the planned capacity.
So the main questions for me, given my current model, are these:
Both questions are very hard to answer with rigor I'd consider adequate given their importance. If you did press me to answer, however: my intuition is that we'd need at least three OOMs and that the OOM-increase difficulty would be exponential, which I approximate via a doubling of time taken. Given that Epoch's historical trends imply that it takes two years for one OOM, I'd expect that we roughly have at least 2 + 4 + 8 = 14 years more before the labs stumble upon a proto-Clippy.
The current scaling speed is created by increasing funding for training projects, which isn't sustainable without continued success. Without this, the speed goes down to the much slower FLOP/dollar trend of improving cost efficiency of compute, making better AI accelerators. The 2 + 4 + 8 years thing might describe gradual increase in funding, but there are still 2 OOMs of training compute beyond original GPT-4 that are already baked-in in the scale of the datacenters that are being built and didn't yet produce deployed models. We'll only observe this in full by late 2026, so the current capabilities don't yet match the capabilities before a possible scaling slowdown.
you say "Human ingenuity is irrelevant. Lots of people believe they know the one last piece of the puzzle to get AGI, but I increasingly expect the missing pieces to be too alien for most researchers to stumble upon just by thinking about things without doing compute-intensive experiments." and you link https://tsvibt.blogspot.com/2024/04/koan-divining-alien-datastructures-from.html for "too alien for most researchers to stumble upon just by thinking about things without doing compute-intensive experiments"
i feel like that post and that statement are in contradiction/tension or at best orthogonal
I think Mesa is saying something like "The missing pieces are too alien for us to expect to discover them by thinking/theorizing but we'll brute-force the AI into finding/growing those missing pieces by dumping more compute into it anyway." and Tsvi's koan post is meant to illustrate how difficult it would be to think oneself into those missing pieces.
It seems like a significant amount of decision theory progress happened between 2006 and 2010, and since then progress has stalled.
Counterfactual mugging was invented independently by Gary Drescher in 2006, and by Vladimir Nesov in 2009.
Counterlogical mugging was invented by Vladimir Nesov in 2009.
The "agent simulates predictor" problem (now popularly known as the commitment races problem) was invented by Gary Drescher in 2010.
The "self-fulfilling spurious proofs" problem (now popularly known as the 5-and-10 problem) was invented by Benja Fallenstein in 2010.
Updatelessness was first proposed by Wei Dai in 2009.
Yeah it seems like a bunch of low hanging fruit was picked around that time, but that opened up a vista of new problems that are still out of reach. I wrote a post about this, which I don't know if you've seen or not.
(This has been my experience with philosophical questions in general, that every seeming advance just opens up a vista of new harder problems. This is a major reason that I switched my attention to trying to ensure that AIs will be philosophically competent, instead of object-level philosophical questions.)
Thanks for the link. I believe I read it a while ago, but it is useful to reread it from my current perspective.
trying to ensure that AIs will be philosophically competent
I think such scenarios are plausible: I know some people argue that certain decision theory problems cannot be safely delegated to AI systems, but if we as humans can work on these problems safely, I expect that we could probably build systems that are about as safe (by crippling their ability to establish subjunctive dependence) but are also significantly more competent at philosophical progress than we are.
I think I've been (slowly) making progress.
I think we would be able to make progress on this if people seriously wanted to make progress, but understandably it's not the highest priority.
Project proposal: EpochAI for compute oversight
Detailed MVP description: website with an interactive map that shows locations of high risk data centers globally, with relevant information appearing when you click on the icons on the map. Examples of relevant information: organizations and frontier labs that have access to this compute, the effective FLOPS of the data center, what time would it take to train a SOTA model in that datacenter).
High risk datacenters are datacenters that are capable of training current or next generation SOTA AI systems.
Why:
Thoughts? I've been playing around with the idea of building it, but have been uncertain about how useful this would be, since I don't have enough interaction with the AI alignment policy people here. Posting it here is an easy test to see whether it is worth greater investment or prioritization.
Note: Uncertain as to whether dual-use issues exist here. I expect that datacenter builders and frontier labs probably have a very good model of the global compute distribution situation and this would significantly benefit regulatory efforts compared to helping increase the strategic allocation of training compute allocation.
Collections of datacenter campuses sufficiently connected by appropriate fiber optic probably should count as one entity for purposes of estimating training potential, even in the current synchronous training paradigm. My impression is that laying such fiber optic is both significantly easier and significantly cheaper than building power plants or setting up power transmission over long distances in the multi-GW range.
Thus for training 3M GPUs/6GW scale models ($100 billion in infrastructure, $10 billion in cost of training time), hyperscalers "only" need to upgrade the equipment and arrange for "merely" on the order of 1GW in power consumption at multiple individual datacenter campuses connected to each other, while everyone else is completely out of luck. This hypothetical advantage makes collections of datacenter campuses an important unit of measurement, and also it would be nice to have a more informed refutation or confirmation that this is a real thing.
Seems like a useful resource to have out there. Some other information that would be nice to have are details about the security of the data center - but there's probably limited information that could be included [1].
Because you probably don't want too many details about your infosec protocols out there for the entire internet to see.
I notice that I find Valentine's posts somewhat insightful, and believe they point at incredibly neglected research directions, but I notice a huge the distance seems to exist between what Valentine intends to communicate and what most readers seem to get.
Off the top of my head:
The reason I wrote this down is because I think Valentine (and other people reading this) might find this helpful, and I didn't feel like it made sense to post this as a comment in any specific individual post.
It has been six months since I wrote this, and I want to note an update: I now grok what Valentine is trying to say and what he is pointing at in Here's the Exit and We're already in AI takeoff. That is, I have a detailed enough model of Valentine's model of the things he talks about, such that I understand the things he is saying.
I still don't feel like I understand Kensho. I get the pattern of the epistemic puzzle he is demonstrating, but I don't know if I get the object-level thing he points at. Based on a reread of the comments, maybe what Valentine means by Looking is essentially gnosis, as opposed to doxa. An understanding grounded in your experience rather than an ungrounded one you absorbed from someone else's claims. See this comment by someone else who is not Valentine in that post:
The fundamental issue is that we are communicating in language, the medium of ideas, so it is easy to get stuck in ideas. The only way to get someone to start looking, insofar as that is possible, is to point at things using words, and to get them to do things. This is why I tell you to do things like wave your arms about or attack someone with your personal bubble or try to initiate the action of touching a hot stove element.
Alternately, Valentine describes the process of Looking as "Direct embodied perception prior to thought.":
Most of that isn’t grounded in reality, but that fact is hard to miss because the thinker isn’t distinguishing between thoughts and reality.
Looking is just the skill of looking at reality prior to thought. It’s really not complicated. It’s just very, very easy to misunderstand if you fixate on mentally understanding it instead of doing it. Which sadly seems to be the default response to the idea of Looking.
I am unsure if this differs from mundane metacognitive skills like "notice the inchoate cognitions that arise in your mind-body, that aren't necessarily verbal". I assume that Valentine is pointing at a certain class of cognition, one that is essentially entirely free of interpretation. Or perhaps before 'value-ness' is attached to an experience -- such as "this experience is good because <elaborate strategic chain>" or "this experience is bad because it hurts!"
I understand how a better metacognitive skillset would lead to the benefits Valentine mentioned, but I don't think it requires you to only stay at the level of "direct embodied perception prior to thought".
As for kensho, it seems to be a term for some skill that leads you to be able to do what romeostevensit calls 'fully generalized un-goodharting':
I may have a better answer for the concrete thing that it allows you to do: it’s fully generalizing the move of un-goodharting. Buddhism seems to be about doing this for happiness/inverse-suffering, though in principle you could pick a different navigational target (maybe).
Concretely, this should show up as being able to decondition induced reward loops and thus not be caught up in any negative compulsive behaviors.
I think that "fully generalized un-goodharting" is a pretty vague phrase and I could probably come up with a better one, but it is an acceptable pointer term for now. So I assume it is something like 'anti-myopia'? Hard to know at this point. I'd need more experience and experimentation and thought to get a better idea of this.
I believe that Here's the Exit, We're already in AI Takeoff, and Slack matters more than any outcome all were pointing at the same cluster of skills and thought -- about realizing the existence of psyops, systematic vulnerabilities or issues that leads you (whatever 'you' means) to forgetting the 'bigger picture', and that the resulting myopia causes significantly bad outcomes from the perspective of the 'whole' individual/society/whatever.
In general, Lexicogenesis seems like a really important sub-skill for deconfusion.
And yet, I get the sentiment that Valentine seems to have been trying to communicate—it sure seems like there are epistemic rationality techniques that seem incredibly valuable and neglected, and one could discover them in the course of doing something about as useless as paperwork, and talking about how you became more efficient at paperwork would seem like a waste of time to everyone involved.
Is this a real example or one that you’ve made up? That is, do you actually have cases in mind where someone discovered valuable and neglected epistemic rationality techniques in the course of doing paperwork?
I apologize for not providing a good enough example -- yes, it was made up. Here's a more accurate explanation of what causes me to believe that Valentine's sentiment has merit:
thinking that people must be suspended upside down below the equator, once someone understands the notion of an approximately spherical Earth
That page seems to be talking about a four-year-old child, who has not yet learned about space, how gravity works, etc. It’s not clear to me that there’s anything to conclude from this about what sorts of epistemic rationality techniques might be useful to adults.
More importantly, it’s not clear to me how any of your examples are supposed to be examples of “epistemic confusion [that] can be traced to almost unrelated upstream misconceptions”. Could you perhaps make the connection more explicitly?
Similarly, it seems plausible to me that while attempting to fix one issue (similar to attempting to fix a confusion of the sort just listed), one could find themselves making almost unrelated upstream epistemic discoveries that might just be significantly more valuable).
And… do you have any examples of this?
It also seems that a lot of rationality skill involves starting out with a bug one notices (“hey, I seem to be really bad at going to the gym”), and then making multiple attempts to fix the problem (ideally focusing on making an intervention as close to the ‘root’ of the issue as possible), and then discovering epistemic rationality techniques that may be applicable in many places.
There’s a lot of “<whatever> seems like it could be true” in your comment. Are you really basing your views on this subject on nothing more than abstract intuition?
I agree that it seems like really bad strategy to then not try to explain why the technique is useful by giving another example where the technique is useful and results in good object-level outcomes, instead of simply talking about (given my original example) paperwork for a sentence and then spending paragraphs talking about some rationality technique in the abstract.
If, hypothetically, you discovered some alleged epistemic rationality technique while doing paperwork, I would certainly want you to either explain how you applied this technique originally (with a worked example involving your paperwork), or explain how the reader might (or how you did) apply the technique to some other domain (with a worked example involving something else, not paperwork), or (even better!) both.
It would be very silly to just talk about the alleged technique, with no demonstration of its purported utility.
If, hypothetically, you discovered some alleged epistemic rationality technique while doing paperwork, I would certainly want you to either explain how you applied this technique originally (with a worked example involving your paperwork), or explain how the reader might (or how you did) apply the technique to some other domain (with a worked example involving something else, not paperwork), or (even better!) both.
This seems sensible, yes.
It would be very silly to just talk about the alleged technique, with no demonstration of its purported utility.
I agree that it seems silly to not demonstrate the utility of a technique when trying to discuss it! I try to give examples to support my reasoning when possible. What I attempted to do with that one passage that you seemed to have taken offense to was show that I could guess at one causal cognitive chain that would have led Valentine to feel the way they did and therefore act and communicate the way they did, not that I endorse the way Kensho was written -- because I did not get anything out of the original post.
There’s a lot of “<whatever> seems like it could be true” in your comment.
Here's a low investment attempt to point at the cause of what seems to you a verbal tic:
I can tell you that when I put “it seems to me” at the front of so many of my sentences, it’s not false humility, or insecurity, or a verbal tic. (It’s a deliberate reflection on the distance between what exists in reality, and the constellations I’ve sketched on my map.)
If you need me to write up a concrete elaboration to help you get a better idea about this, please tell me.
Are you really basing your views on this subject on nothing more than abstract intuition?
My intuitions on my claim related to rationality skill seem to be informed by concrete personal experience, which I haven't yet described in length, mainly because I expected that using a simple plausible made-up example would serve as well. I apologize for not adding a "(based on experience)" in that original quote, although I guess I assumed that was deducible.
That page seems to be talking about a four-year-old child, who has not yet learned about space, how gravity works, etc. It’s not clear to me that there’s anything to conclude from this about what sorts of epistemic rationality techniques might be useful to adults.
I'm specifically pointing at examples of deconfusion here, which I consider the main (and probably the only?) strand of epistemic rationality techniques. I concede that I haven't provided you useful information about how to do it -- but that isn't something I'd like to get into right now, when I am still wrapping my mind around deconfusion.
More importantly, it’s not clear to me how any of your examples are supposed to be examples of “epistemic confusion [that] can be traced to almost unrelated upstream misconceptions”. Could you perhaps make the connection more explicitly?
For the gravity example, the 'upstream misconception' is that the kid did not realize that 'up and down' is relative to the direction in which Earth's gravity acts on the body, and therefore the kid tries to fit the square peg of "Okay, I see that humans have heads that point up and legs that point down" into the round hole of "Below the equator, humans are pulled upward, and humans heads are up, so humans' heads point to the ground".
For the AI example, the 'upstream misconception' can be[1] conflating the notion of intelligence with 'human's behavior and tendencies that I recognize as intelligence' (and this in turn can be due to other misconceptions, such as not understanding how alien the selection process that underlies evolution is; not understanding how intelligence is not the same as saying impressive things in a social party but the ability to squeeze the probability distribution of future outcomes into a smaller space; et cetera), and then making a reasoning error that seems like anthromorphizing an AI, and concluding that the more intelligent a system would be, the more it would care about the 'right things' that us humans seem to care about.
The second example is a bit expensive to elaborate on, so I will not do so right now. I apologize.
Anyway, I intended to write this stuff up when I felt like I understood deconfusion enough that I could explain it to other people.
Similarly, it seems plausible to me that while attempting to fix one issue (similar to attempting to fix a confusion of the sort just listed), one could find themselves making almost unrelated upstream epistemic discoveries that might just be significantly more valuable).
And… do you have any examples of this?
I find this plausible based on my experience with deconfusion and my current state of understanding of the skill. I do not believe I understand deconfusion well enough to communicate it to people who have an inferential distance as huge as the one between you and I, so I do not intend to try.
[1]: There are a myriad of ways you can be confused, and only one way you can be deconfused.
I notice that Joe Carlsmith dropped a 127 page paper on the question of deceptive alignment. I am confused; who is the intended audience of this paper?
AFAICT nobody would actually read all 127 pages of the report, and most potential reasons for writing the report to me seem better served by faster feedback loops and significantly smaller research artifacts.
What am I missing?
Often I write big boring posts so I can refer to my results in shorter, more readable posts later on. That way if anyone cares and questions my result they can see the full argument, without impairing readability on the focal post.
My model is that a text like this often is in substantial parts an artifact of the author's personal understanding. But also, my model of Open Phil employees totally read 100-page documents all the time.
I had the impression that SPAR was focused on UC Berkeley undergrads and had therefore dismissed the idea of being a SPAR mentor or mentee. It was only recently that I looked at the website when someone mentioned that they wanted to learn from this one SPAR mentor, and then I looked at the website, and SPAR now seems to focus on the same niche as AI Safety Camp.
Did SPAR pivot in the past six months, or did I just misinterpret SPAR when I first encountered it?
I've noticed that there are two major "strategies of caring" used in our sphere:
Nate Soares obviously endorses staring unflinchingly into the abyss that is reality (if you are capable of doing so). However, I expect that almost-pure Soares-style caring (which in essence amounts to "shut up and multiply", and consequentialism) combined with inattention or an inaccurate map of the world (aka broken epistemics) can lead to making severely sub-optimal decisions.
The harder you optimize for a goal, the better your epistemology (and by extension, your understanding of your goal and the world) should be. Carlsmith-style caring seems more effective since it very likely is more robust to having bad epistemology compared to Soares-style caring.
(There are more pieces necessary to make Carlsmith-style caring viable, and a lot of them can be found in Soares' "Replacing Guilt" series.)
Does this come from a general idea of "optimizing hard" means higher risk of damage caused by errors in detail, and "optimizing soft" has enough slack so as not to have the same risks, but also soft is less ambitious and likely less effective (if both are actually implemented well)?
a general idea of “optimizing hard” means higher risk of damage caused by errors in detail
Agreed.
“optimizing soft” has enough slack so as not to have the same risks, but also soft is less ambitious and likely less effective
I disagree with the idea that "optimizing soft" is less ambitious. "Optimizing soft", in my head, is about as ambitious as "optimizing hard", except it makes the epistemic uncertainty more explicit. In this model of caring I am trying to make more legible, I believe that Carlsmith-style caring may be more robust to certain epistemological errors humans can make that can result in severely sub-optimal scenarios, because it is constrained by human cognition and capabilities.
Note: I notice that this can also be said for Soares-style caring -- both are constrained by human cognition and capabilities, but in different ways. Perhaps both have different failure modes, and are more effective in certain distributions (which may diverge)?
Backing up a step, because I'm pretty sure we have different levels of knowledge and assumptions (mostly my failing) about the differences between "hard" and "soft" optimizing.
I should acknowledge that I'm not particularly invested in EA as a community or identity. I try to be effective, and do some good, but I'm exploring rather than advocating here.
Also, I don't tend to frame things as "how to care", so much as "how to model the effects of actions, and how to use those models to choose how to act". I suspect that's isomorphic to how you're using "how to care", but I'm not sure of that.
All that said, I think of "optimizing hard" as truly taking seriously the "shut up and multiply" results, even where it's uncomfortable epistemically, BECAUSE that's the only way to actually do the MOST POSSIBLE good. actually OPTIMIZING, you know? "soft" is almost by definition less ambitious, BECAUSE it's epistemically more conservative, and gives up average expected value in order to increase modal goodness in the face of that uncertainty. I don't actually know if those are the positions taken by those people. I'd love to hear different definitions of "hard" and "soft", so I can better understand why they're both equal in impact.
I predict this is not really an accurate representation of Soares-style caring. (I think there is probably some vibe difference between these two clusters that you're tracking, but I doubt Nate Soares would advocate "overriding" per se)
I doubt Nate Soares would advocate “overriding” per se
Acknowledged, that was an unfair characterization of Nate-style caring. I guess I wanted to make explicit two extremes. Perhaps using the name "Nate-style caring" is a bad idea.
(I now think that "System 1 caring" and "System 2 caring" would have been much better.)
2022-08; Jan Leike, John Schulman, Jeffrey Wu; Our Approach to Alignment Research
OpenAI's strategy, as of the publication of that post, involved scalable alignment approaches. Their philosophy is to take an empirical and iterative approach[1] to finding solutions to the alignment problem. Their strategy for alignment is cyborgism, where they create AI models that are capable and aligned enough to further alignment research enough that they can align even more capable models.[2]
Their research focus is on scalable approaches to direct models[3]. This means that the core of their strategy involves RLHF. They don't expect RLHF to be sufficient on its own, but it is necessary for the other scalable alignment strategies they are looking at[4].
They intend to augment RLHF with AI-assisted scaled up evaluation (ensuring RLHF isn't bottlenecked by a lack of accurate evaluation data for tasks too onerous for baseline humans to evaluate performance for)[5].
Finally, they then intend to use these partially-aligned models to do alignment research, since they anticipate that alignment approaches that work and are viable for low capability models may not be adequate for models with higher capabilities.[6] They intend to use the AI-based evaluation tools to both RLHF-align models, and as part of a process where humans evaluate alignment research produced by these LLMs (here's the cyborgism part of the strategy).[7]
Their "Limitations" section of their blog post does clearly point out the vulnerabilities in their strategy:
↩︎We take an iterative, empirical approach: by attempting to align highly capable AI systems, we can learn what works and what doesn’t, thus refining our ability to make AI systems safer and more aligned.
↩︎We believe that even without fundamentally new alignment ideas, we can likely build sufficiently aligned AI systems to substantially advance alignment research itself.
↩︎At a high-level, our approach to alignment research focuses on engineering a scalable training signal for very smart AI systems that is aligned with human intent.
↩︎We don’t expect RL from human feedback to be sufficient to align AGI, but it is a core building block for the scalable alignment proposals that we’re most excited about, and so it’s valuable to perfect this methodology.
↩︎RL from human feedback has a fundamental limitation: it assumes that humans can accurately evaluate the tasks our AI systems are doing. Today humans are pretty good at this, but as models become more capable, they will be able to do tasks that are much harder for humans to evaluate (e.g. finding all the flaws in a large codebase or a scientific paper). Our models might learn to tell our human evaluators what they want to hear instead of telling them the truth.
↩︎There is currently no known indefinitely scalable solution to the alignment problem. As AI progress continues, we expect to encounter a number of new alignment problems that we don’t observe yet in current systems. Some of these problems we anticipate now and some of them will be entirely new.
We believe that finding an indefinitely scalable solution is likely very difficult. Instead, we aim for a more pragmatic approach: building and aligning a system that can make faster and better alignment research progress than humans can.
↩︎We believe that evaluating alignment research is substantially easier than producing it, especially when provided with evaluation assistance. Therefore human researchers will focus more and more of their effort on reviewing alignment research done by AI systems instead of generating this research by themselves. Our goal is to train models to be so aligned that we can off-load almost all of the cognitive labor required for alignment research.
Sidenote: I like how OpenAI ends their blog posts with an advertisement for positions they are hiring for, or programs they are running. That's a great strategy to advertise to the very people they want to reach.
I recently had to solve a Captcha to submit a reddit post using a new reddit account I made (because I did not use reddit until now). It was an extremely Kafkaesque experience: I tried the Captcha in good faith and Google repeatedly told me I did my job incorrectly, but did not explain why. This went on for multiple minutes, and I kept being told I was doing it wrong, even though I kept clicking on all the right boxes that contained parts of a bicycle or a motorcycle or whatever. The slow fade-in and fade-out images were the worst, and I consider this a form of low level torture when you are made to do this for extended periods of time.
I admit that I use an extremely unique browser setup: portrait mode, OpenBSD amd64 OS, Mozilla Firefox with uBlock Origin, an external keyboard where I use my arrow keys to control the mouse most of the time. I expect that such an out-of-distribution setup may have led the Captcha AI to be suspicious of me. All this was intended to improve my experience of using my machine and interfacing with the Internet. Worse, I was already signed into my Google account, so it didn't make sense that Google would still be suspicious of me being a bot.
I've decided on a systemic solution for this problem:
One could interpret this as adversarial action against Google and Reddit, but it seems to me that when dealing with an optimizer that is taking constant adversarial action against you, and is credibly unwilling to attempt to co-operate and solve the problem you both face, the next step is to defect. Ideally you extricate yourself from the situation, but in some cases that isn't acceptable given your goals.
I expect that people who are paid to solve captchas probably are numb to this, or have been trained by the system to solve captchas more efficiently, such that they may be optimized for dealing with its Kafkaesque nature. I do not expect to feel like I would be putting them through the pain I would have experienced. I still do not consider it an ideal state of affairs, though.
Yeah, my understanding of how bot detection on lots of these sites work is they track your mouse, then do a simple classification scheme on mouse movements to differentiate between bots and humans. So it's no surprise that moving your mouse with your arrow keys would make the classifier very suspicious.
Just a quote I find rather interesting, since it is rare to see a Hero's Journey narrative with a Return that involves the hero not knowing if he will ever belong or find meaning once he returns, and yet chooses to return, having faith in his ability to find meaning again:
If every living organism has a fixed purpose for its existence, then one thing's for sure. I [...] have completed my mission. I've fulfilled my purpose. But a great amount of power that has served its purpose is a pain to deal with, just like nuclear materials that have reached the end of their lifespan. If that's the case, there'll be a lot of questions. Would I now become an existence that this place doesn't need anymore?
The time will come when the question of whether it's okay for me to remain in this place will be answered.
However...
If there's a reason to remain in this place, then it's probably that there are still people that I love in this place.
And that people who love me are still here.
Which is why that's enough reason for me to stay here.
I'll stay here and find other reasons as to why I should stay here...
That's what I've decided on.
Causal Influence Diagrams are interesting, but don't really seem all that useful. Anyway, the latest formal graphical representation for agents that the authors seem to promote are structured causal models so you don't read this paper for object level usefulness but incidental research contributions that are really interesting.
The paper divides AI systems into two major frameworks:
I liked how lucidly they defined wireheading:
In the basic MDP from Figure 1, the reward parameter ΘR is assumed to be unchanging. In reality, this assumption may fail because the reward function is computed by some physical system that is a modifiable part of the state of the world. [...] This gives an incentive for the agent to obtain more reward by influencing the reward function rather than optimizing the state, sometimes called wireheading.
The common definition of wireheading is informal enough that different people would map it to different specific formalizations in their head (or perhaps have no formalization and therefore be confused), and having this 'more formal' definition in my head seems rather useful.
Here's their distillation for Current RF-optimization, a strategy to avoid wireheading (which reminds me of shard theory, now that I think about it -- models that avoid wireheading by modelling effects of resulting changes to policy and then deciding what trajectory of actions to take):
An elegant solution to this problem is to use model-based agents that simulate the state sequence likely to result from different policies, and evaluate those state sequences according to the current or initial reward function.
Here's their distillation of Reward Modelling:
A key challenge when scaling RL to environments beyond board games or computer games is that it is hard to define good reward functions. Reward Modeling [Leike et al., 2018] is a safety framework in which the agent learns a reward model from human feedback while interacting with the environment. The feedback could be in the form of preferences, demonstrations, real-valued rewards, or reward sketches. [...] Reward modeling can also be done recursively, using previously trained agents to help with the training of more powerful agents [Leike et al., 2018].
The resulting CI diagram modelling actually made me feel like I grokked Reward Modelling better.
Here's their distillation of CIRL:
Another way for agents to learn the reward function while interacting with the environment is Cooperative Inverse Reinforcement Learning (CIRL) [Hadfield-Menell et al., 2016]. Here the agent and the human inhabit a joint environment. The human and the agent jointly optimize the sum of rewards, but only the human knows what the rewards are. The agent has to infer the rewards by looking at the human’s actions.
The difference between RM and CIRL causal influence diagrams is interesting, because there is a subtle difference. The authors imply that this minor difference matters and can imply different things about system incentives and therefore safety guarantees, and I am enthusiastic about such strategies for investigating safety guarantees.
The authors describe a wireheading-equivalent for QA systems called self-fulfilling prophecies:
The assumption that the labels are generated independently of the agent’s answer sometimes fails to hold. For example, the label for an online stock price prediction system could be produced after trades have been made based on its prediction. In this case, the QA-system has an incentive to make self-fulfilling prophecies. For example, it may predict that the stock will have zero value in a week. If sufficiently trusted, this prediction may lead the company behind the stock to quickly go bankrupt. Since the answer turned out to be accurate, the QA-system would get full reward. This problematic incentive is represented in the diagram in Figure 9, where we can see that the QA-system has both incentive and ability to affect the world state with its answer [Everitt et al., 2019].
They propose a solution to the self-fulfilling prophecies problem, via making oracles optimize for reward in the counterfactual world where their answer doesn't influence the world state and therefore the label which they are optimized for. While that is a solution, I am unsure how one can get counterfactual labels for complicated questions whose answers may have far reaching consequences in the world.
It is possible to fix the incentive for making self-fulfilling prophecies while retaining the possibility to ask questions where the correctness of the answer depends on the resulting state. Counterfactual oracles optimize reward in the counterfactual world where no one reads the answer [Armstrong, 2017]. This solution can be represented with a twin network [Balke and Pearl, 1994] influence diagram, as shown in Figure 10. Here, we can see that the QA-system’s incentive to influence the (actual) world state has vanished, since the actual world state does not influence the QA-system’s reward; thereby the incentive to make self-fulfilling prophecies also vanishes. We expect this type of solution to be applicable to incentive problems in many other contexts as well.
The authors also anticipate this problem but instead of considering whether and how one can tractably calculate counterfactual labels, they connect this intractability to introducting the debate AI safety strategy:
To fix this, Irving et al. [2018] suggest pitting two QA-systems against each other in a debate about the best course of action. The systems both make their own proposals, and can subsequently make arguments about why their own suggestion is better than their opponent’s. The system who manages to convince the user gets rewarded; the other system does not. While there is no guarantee that the winning answer is correct, the setup provides the user with a powerful way to poke holes in any suggested answer, and reward can be dispensed without waiting to see the actual result.
I like how they explicitly mention that there is no guarantee that the winning answer is correct, which makes me more enthusiastic about considering debate as a potential strategy.
They also have an incredibly lucid distillation of IDA. Seriously, this is significantly better than all the Paul Christiano posts I've read and the informal conversations I've had about IDA:
Iterated distillation and amplification (IDA) [Christiano et al., 2018] is another suggestion that can be used for training QA-systems to correctly answer questions where it is hard for an unaided user to directly determine their correctness. Given an original question Q that is hard to answer correctly, less powerful systems Xk are asked to answer a set of simpler questions Qi. By combining the answers Ai to the simpler questions Qi, the user can guess the answer ˆA to Q. A more powerful system Xk+1 is trained to answer Q, with ˆA used as an approximation of the correct answer to Q.
Once the more powerful system Xk+1 has been trained, the process can be repeated. Now an even more powerful QA-system Xk+2 can be trained, by using Xk+1 to answer simpler questions to provide approximate answers for training Xk+2. Systems may also be trained to find good subquestions, and for aggregating answers to subquestions into answer approximations. In addition to supervised learning, IDA can also be applied to reinforcement learning.
I have no idea why they included Drexler's CAIS -- but it is better than reading 300 pages of the original paper:
Drexler [2019] argues that the main safety concern from artificial intelligence does not come from a single agent, but rather from big collections of AI services. For example, one service may provide a world model, another provide planning ability, a third decision making, and so on. As an aggregate, these services can be very competent, even though each service only has access to a limited amount of resources and only optimizes a short-term goal.
The authors claim that the AI safety issues commonly discussed can be derived 'downstream' of modelling these systems more formally, using these causal influence diagrams. I disagree, due to the amount of degrees of freedom the modeller is given when making these diagrams.
In the discussion section, the authors talk about the assumptions underlying the representations, and their limitations. They explicitly point out how the intensional stance may be limiting and not model certain classes of AI systems or agents (hint: read their newer papers!)
Overall, the paper was an easy and fun read, and I loved the distillations of AI safety approaches in them. I'm excited to read papers by this group.
I want to differentiate between categories of capabilities improvement in AI systems, and here's the set of terms I've come up with to think about them:
Infrastructure improvements: Capability boost in the infrastructure that makes up an AI system. This involves software (Pytorch, CUDA), hardware (NVIDIA GPUs), operating systems, networking, the physical environment where the infrastructure is situated. This probably is not the lowest hanging fruit when it comes to capabilities acceleration.
Scaffolding improvements: Capability boost in an AI system that involves augmenting the AI system via software features. Think of it as keeping the CPU of the natural language computer the same, but upgrading its RAM and SSD and IO devices. Some examples off the top of my head: hyperparameter optimization for generating text, use of plugins, embeddings for memory. More information is in beren's essay linked in this paragraph.
Neural network improvements: Any capability boost in an AI system that specifically involves improving the black-box neural network that drives the system. This is mainly what SOTA ML researchers focus on, and is what has driven the AI hype over the past decade. This can involve architectural improvements, training improvements, finetuning afterwards (RLHF to me counts as capabilities acceleration via neural network improvements), etc.
There probably are more categories, or finer ways to slice the space of capability acceleration mechanisms, but I haven't thought about this in as much detail yet.
As far as I can tell, both capabilities augmentation and capabilities acceleration contribute to achieving recursive self-improving (RSI) systems, and once you hit that point, foom is inevitable.
Alignment agendas can generally be classified into two categories: blueprint-driven and component-driven. Understanding this distinction is probably valuable for evaluating and comprehending different agendas.
Blueprint-driven alignment agendas are approaches that start with a coherent blueprint for solving the alignment problem. They prioritize the overall structure and goals of the solution before searching for individual components or building blocks that fit within that blueprint. Examples of blueprint-driven agendas include MIRI's agent foundations, Vanessa Kosoy and Diffractor's Infrabayesianism, and carado's formal alignment agenda. Research aimed at developing a more accurate blueprint, such as Nate Soares' 2022-now posts, Adam Shimi's epistemology-focused output, and John Wentworth's deconfusion-style output, also fall into this category.
Component-driven alignment agendas, on the other hand, begin with available components and seek to develop new pieces that work well with existing ones. They focus on making incremental progress by developing new components that can be feasibly implemented and integrated with existing AI systems or techniques to address the alignment problem. OpenAI's strategy, Deepmind's strategy, Conjecture's LLM-focused outputs, and Anthropic's strategy are examples of this approach. Agendas that serve as temporary solutions by providing useful components that integrate with existing ones, such as ARC's power-seeking evals, also fall under the component-driven category. Additionally, the Cyborgism agenda and the Accelerating Alignment agenda can be considered component-driven.
The blueprint-driven and component-driven categorization seems to me to be more informative than dividing agendas into conceptual and empirical categories. This is because all viable alignment agendas require a combination of conceptual and empirical research. Categorizing agendas based on the superficial pattern of their current research phase can be misleading. For instance, shard theory may initially appear to be a blueprint-driven conceptual agenda, like embedded agency. However, it is actually a component-driven agenda, as it involves developing pieces that fit with existing components.
Given the significant limitations of using a classifier to detect AI generated text, it seems strange to me that OpenAI went ahead and built one and threw it out for the public to try. As far as I can tell, this is OpenAI aggressively acting to cover its bases for potential legal and PR damages due to ChatGPT's existence.
For me this is a slight positive evidence for the idea that AI Governance may actually be useful in extending the timelines, but only if it involves adverserial actions that act on the vulnerabilities of these companies. But even then, that seems like a myopic decision given the existence of other, less controllable actors (like China), racing as fast as possible towards AGI.
Jan Hendrik Kirchner now works at OpenAI, it seems, given that he is listed as the author of this blog post. I don't see this listed on his profile or on his substack or twitter account, so this is news to me.