Note: Sophia originally made this post about 2 weeks ago, but it accidentally got stuck in our spam filter and I just rescued it from there. Sorry for the delay, and I am making some improvements to our moderation tools to make things like this less likely in the future!
Aims
In this post I summarize my discoveries from a semester of AI readings and discussions drawn primarily from the 2022 AGI Safety Fundamentals Alignment Curriculum. I am grateful to Professor Anton Korinek for his generous advising in my independent study which lead to this writeup. I am also grateful to my fellow AI reading group discussants at UVA (and especially our discussion leader Ryan Bloom) for their thoughtful contributions to our group which informed my thinking for this post. While this post was primarily written as a means to clarify my own thinking as I learned more about AI Safety, I also hope to use it as a reference point for facilitating my university’s AI discussion group. I also aim to write a Part 2, covering the second half of the 2022 AGI Safety Fundamentals Alignment Curriculum and discuss any updates to my understanding as I begin facilitating.
Technical Background: What is Artificial Intelligence?
Definitions
Artificial Intelligence (AI) is both the study of intelligent algorithms and the intelligent algorithms themselves.[1] For the purposes of this post, we’ll hold that intelligence measures one’s ability to achieve one’s goals across a wide range of environments.[2] An algorithm is a “step-by-step procedure” for solving a problem, typically written on a computer.[3] In turn, intelligent algorithms are computational procedures that can achieve their goals across many environments.
A very brief history of AI
Beginning in the 1950s, “Good Old Fashioned AI” (GOFAI), also known as symbolic AI, used search and logic to solve high-level mathematical equations. In 1997 Deep Blue beat chess Grandmaster Garry Kasparaov by using GOFAI—Deep Blue searched over millions of positions to find the optimal play.[4] In the 1960s, a series of criticisms argued GOFAI could never handle the complexities of the real world. This burst the building AI hype and led to an AI winter, in which funders pulled out and AI progress slowed.[5] In the 1990s, AI research made a comeback as researchers shifted from symbolic AI to machine learning, which remains the dominant paradigm today. [6]
Machine Learning basics
Unlike symbolic AI, Machine Learning (ML) is adept at addressing real world complexity. Rather than searching through every possible configuration to find the optimal algorithm, ML relies on using statistical techniques to train neural networks, which are a type of machine learning model inspired by the brain. Deep learning, a variety of machine learning, relies on neural networks with more than one layer between the input and the output. [7]
The smallest unit of a neural network is a neuron, which can be understood as a “number holder.” Each neuron receives signals from other neurons which it combines into a single value that it holds, called its activation. The neuron then passes its activation on to other neurons. The weights and biases between neurons in different layers determine how strongly a neuron can activate for any given input, and is learned via a process of optimization. The metric which is optimized for is known as an objective function or loss function, which is developed over training data. The most common optimization algorithm is gradient descent, which allows the gradients of the weights to be calculated layer by layer using the backpropagation algorithm. [8]
To better understand deep learning’s layers, neurons, activation, and weights, let’s consider a classic example: identifying an image of a handwritten number as a number from zero through nine. The input—the first layer in the neural network—would assign a neuron to each pixel in the image, and the neuron’s activation would fall between zero for a completely black pixel and one for a completely white pixel. The output—the final layer of the neural net—would have ten neurons, each assigned to a number between 0 and 9, and the neuron’s activation would represent the likelihood that the handwritten number was the number the neuron represented. The in-between layers break down the first layer into patterns which map to the last layer. For example, the neural network might notice that a handwritten “0” is composed of one circle: the penultimate layer could have one neuron representing this circle shape, which, through a heavy weight, would activate the “0” neuron in the final layer. Essentially neural networks’ layers function by perceiving increasingly high-level features across the network. [9]
The process of learning unfolds as the algorithm adapts each weight to achieve optimal image recognition. The algorithm learns on a set of training data: in this case, thousands of handwritten letters labeled zero through ten. First, a loss function is set which inputs each weight, then determines, based on how the network performs over all training examples, a loss, or penalty. Then, the algorithm optimizes this function, through gradient descent, to determine which weights create the lowest cost. The optimization process can be understood as a ball rolling down a hill: the ball rolls down the steepest path to the bottom of the hill, just as gradient descent allows us to take the steepest path to the function’s local minima—the point with the lowest cost function, where the algorithm performs best on the training data. Backpropagation is the method by which the direction of the lowest point is determined, akin to figuring out the ball’s steepest slope of the descent, except in higher dimensional space. [10]
Types of Machine Learning
Supervised learning requires a dataset where each datapoint has a corresponding label, called a base label. There are two types of problems within supervised learning: classifications problems, which require the prediction of discrete categories, and regression problems, which require the prediction of continuous values. [11]
Unsupervised learning does not require a labeled dataset.
Reinforcement learning draws not on a fixed dataset but rather on an “environment in which the AI takes actions and receives observations.” [12]
Generalizing means that an algorithm is able to extrapolate from previous observations to perform well in a new scenario on the same task. Transferring, a closely related idea, means that an algorithm can extrapolate across tasks. [13]
What is Artificial General Intelligence?
Artificial General Intelligence (AGI) is AI that is on par with humans in a wide range of tasks. If we recall our earlier definition that intelligence measures one’s ability to achieve one’s goals across environments, then general intelligence emphasizes that the real world is complicated and so requires a degree of intelligence which is not situation specific but rather can respond to arbitrary environments. [14]
This definition of AGI is ambiguous, so researchers have presented several tests to determine when we’ve reached AGI. The first is the Turing test, which requires an AGI to “fool half the judges into thinking it is human while interacting with them in a freeform conversation for 30 minutes and interpreting audio-visual input.” Another test is the coffee test, which judges an AI on its ability to get into a person’s house and make coffee, including finding coffee and using the coffeemaker. A third test is “the robot college student test,” which requires the AI to enroll and take classes in the same way as a normal university student. The fourth test, called the “employment test,” demands an AGI effectively perform a wide range of jobs, as a uniquely smart human with an extended lifetime could do. However, these tests alone are not enough to define AGI. Historically, benchmarks for AGI have failed: for example, chess was once thought to require general intelligence, but although we have AI that can beat humans at a chess game, we do not yet have AGI. Tesler reflects this ever-moving target in his facetious definition characterizing artificial intelligence as “whatever hasn't been done yet.” [15]
Human versus Machine Intelligence
This “moving target” phenomenon reveals peoples’ tendency to believe a task will require human-like capacity when it actually requires only task-specific computation. This same tendency is born out in what Sutton coins “the bitter lesson,” which holds that while programmers try to apply human-like thought patterns to solve problems, typically general computation will perform the same task more efficiently. The Bitter Lesson is premised on Moore’s law, which can be extrapolated to predict “exponentially falling cost per unit of computation.” However, researchers often fail to internalize these falling costs, and act as though computation power will be fixed, such that they rely on heuristics—and in particular “human knowledge or the special features” of a task—to perform the task with lower computational power. For example, they might encode features of a grandmaster’s strategy into a chess algorithm. Yet examples across game playing, speech recognition, and computer vision all suggest that although specialized solutions are intellectually compelling in the short run, in the long run, they hinder progress by restricting an algorithm’s capacity to human modes of thinking. Ultimately, Sutter argues that we must forsake these attempts to “model the mind” for generalizable solutions that “continue to scale with increased computation” and so harness the falling costs of computational power to create more effective algorithms. [16]
In contrast to machine intelligence, Griffiths holds, human intelligence is defined by its bounds: limited time, limited computation, and limited communication. These limits are much less constrictive in AI, which has access to “experiences of many human lifetimes,” exponential increase in computational power, and ability to directly copy learning from one system to another. Griffiths challenges Sutton’s Bitter Lesson by arguing we may want AI to learn on small datasets and engage in rapid learning when it is functioning with constraints similar to those of humans. For example, when an AI interacts with humans it must quickly learn human preferences, and in the sciences, an AI must make deductions when little data is available. In these circumstances, it is helpful for a machine to have some degree of “inductive bias,” which characterizes the inferences beyond data which allow a machine to draw correct conclusions. However, Griffiths reaffirms the Bitter Lesson by arguing that in these instances, getting more data may still be easier—and more successful—than trying to engineer good inductive biases. [17]
When will AGI be developed?
Karnofsky argues that modeling AGI development based on “biological anchors” is the best option presently available to us, even though it leaves significant uncertainty in its estimates. The Bio Anchor method focuses on two key questions: “Based on the usual patterns in how much training costs, how much would it cost to train an AI model as big as a human brain to perform the hardest tasks humans do? And when will this be cheap enough that we can expect someone to do it?.” In turn, “Bio Anchors estimates a >10% chance of transformative AI by 2036, a ~50% chance by 2055, and an ~80% chance by 2100.” [18]
Model size: At present, AI models are estimated to be not yet even 1% as large as human brains, yet may need to be ten times larger than human brains to perform the same tasks as us to account for potential inefficiencies in AI. The challenge is that as an AI model becomes larger, it becomes more expensive to train.[19]
Task type: If a task can be decomposed into repeatable subtasks and then repeated then it would be far cheaper to train these subtasks. For example, training a model to write an essay might take, let’s say an hour for each iteration, while training a model to write a good next sentence based on prediction might take just a minute per iteration. So training subtasks is a much cheaper process.[20]
Karnofsky then responds to concerns that the Bio Anchors predictions are too aggressive. Mostly, these criticisms turn on the concern that trial and error training is not enough to teach true understanding. Karnofsky responds that “deep understanding” may actually be illusory, and actually just reflects strong predictive capacities.[21]
Karnofsky also highlights one justified reason the Bio Anchors model may be too aggressive: it assumes computing power, not labor (researchers) or training (ensuring the AI can engage in large scale trial and error) processes will be the bottleneck for AI development.[22]
Finally, Karnofsky points out several reasons the Bio Anchors model may be too conservative: first, that we may come up with ways to teach AI much more cheaply than data-heavy training processes currently allow, second, that AI will become increasingly embedded in our economy and its advancement will be driven by decentralized market forces, and third, that tasks will be more easily decomposed into small tasks than Bio Anchors predicts, therefore cheapening the training process.[23]
AGI Emergence Paradigms
The Alignment Challenge
Earlier, we defined artificial intelligence as an algorithm that can modify the real world to achieve its goals. The field of AI Alignment focuses on ensuring that AI’s goals are aligned with peoples’ goals, such that AI supports rather than detracts from human flourishing.
Agent vs Tool AI
One vein of thought: intelligence implies agency
Legg and Hutter hold that intelligence measures one’s ability to achieve one’s goals across a wide range of environments.[24] Yudkowsky takes this definition one step further in “The Power of Intelligence,” where he argues that any intelligent actor can go beyond achieving its goals to actually modify its environment. Detractors might argue intelligence is divorced from real-world power: predictive power does not imply physical power. This line of reasoning is captured by the adage “intelligence is no match for a gun.” However, Yudkowsky holds, this vein of thought fails to account for the ways intelligence allows an actor to adapt—humans, too, had no guns throughout much of the evolutionary process, yet eventually developed them. By this argument, Yudkoswky argues that an intelligent actor, by definition, is an agent, defined as an actor that can modify its environment. [25]
What follows from agent AI?
The Machine Intelligence Research Institute (MIRI), run by Yudkowsky, focuses on reducing the risk engendered by agent AI: AI that can modify its environment. By MIRI’s argument, AGI will seek to optimize a utility function, and its unprecedented power will mean that it will optimize this utility function incredibly well. However, we cannot easily encode our preferences in a utility function, and therefore may lose control of the AI. We can point to mythic corollaries for this difficulty: Midas’s touch, for example, tells a story in which King Midas fails to encode his actual desires in a wish, such that his wish goes disastrously awry. Philosopher Nick Bostrom proposes a thought experiment, called the Paper Clip Thought Experiment, which similarly describes the challenge of encoding our values in a superintelligent AI:
Bostrom’s thought experiment illustrates his orthogonality thesis, which holds that an AI’s final goals (as defined by its utility function) can vary independently from its intelligence. This means that making an AI smarter will not necessarily help it align with human goals. [27]
The thought experiment also demonstrates the concept of instrumental convergence, which “posits that smart goal-directed agents will tend to take certain actions” aimed at increasing their power (like gaining resources).[28]
In a 2016 talk, Yudkosky lays out the particular challenges to fully encoding the range of human values into a utility function and explains why attempted solutions to this challenge have broadly failed. Yudkowsky asks us to consider the challenge of directing a robot to fill a cauldron by assigning it a utility function. If the utility of a full cauldron is 1, and the utility of an empty cauldron is 0, then the robot will fill the cauldron to overflow. If we introduce an impact penalty for overfilling the cauldron, then the robot may try to trick people into thinking the cauldron was not overfilled. Additionally, we may want to implement an off switch for the robot, but there is no math that makes it in the robot’s best interest to be switched off if it is overfilling the cauldron without making the robot want to coerce people into turning it off regardless of the fullness of the cauldron.[29]
Yudkowsky also introduces the difficulty of ensuring agent AI maintains stable goals during self-modification. The fundamental question Yudkoswky poses is if an agent can self-modify, how do we ensure that it won’t modify its own goals? Yudkowsky provides an example of a Tic Tac Toe algorithm which creates a more advanced successor algorithm. The original algorithm can only verify the success of the successor by checking each of its moves. However, this verification process binds the successor to the original algorithm’s standards and so limits the abilities of the successor. To create a successor more advanced than the original requires that the successor go beyond the verification capacities of the original, which in turn means the original algorithm cannot ensure the successor shares its same goals. [30]
“Specification gaming: the flip side of AI ingenuity” builds on Yudkwosky’s concerns about agent AI deviating from peoples’ goals because reinforcement learning rewards an agent that can achieve an outcome as efficiently as possible. This can lead to “specification gaming,” which occurs when an agent “games'' a specified task by finding loopholes to perform the specified tasks most efficiently—akin to a student seeking a good grade who cheats on the test rather than studying the content. [31]
However, by the same token that reward functions encourage specification gaming, reward optimization also incentivizes AI to seek novel, efficient solutions. So moving away from reward functions is not the solution to specification gaming. [32]
How do we resolve specification gaming? Reward shaping incentivizes an agent to learn intermediary tasks rather than simply completing the task, but poses a risk if an agent is optimized for only the intermediary task. An alternative is to focus on better specifying the final reward—however, many corner cases make this a challenging task. Rather than covering every corner case, we might use human feedback to train the reward function, but specification gaming may lead the agent to fool the human into thinking it’s succeeding. A final challenge to shaping rewards is that they may not be independent of the algorithm, which in many cases is an embedded agent. A sufficiently powerful AI might modify its reward function to be easier to satisfy, called “reward tampering.” For example, a traffic navigator, rather than giving useful directions, might influence “users to have preferences that are easier to satisfy,” for example “by nudging them to choose destinations that are easier to reach.” Or, perhaps, an AI could “hijack the computer on which it runs” and “manually set… its reward signal to a higher value.” [33]
Tool AI: An alternative to agent AI
Is it true that an intelligent AGI will necessarily be an agent AGI: AGI that can modify its environment? One alternative proposal is tool AI, which would be “built to be used as a tool by the creators, rather than being an agent with its own action and goal-seeking behavior.” [34]
Tool AI was originally proposed by Holden Karnofsky in 2012 as an alternative path for AGI development. Karnofsky envisioned the AGI as functioning like a much more extensive version of Google Maps, in that it would predict information but not be optimizing over any particular utility function and therefore be more amenable to human choice than the agents Yudkowsky envisioned. In his rebuttal to agent AI, Karnofsky rejected the orthogonality thesis, and held that intelligence would imply the AI took actions which seemed “‘good’ to the programmer” such that they would be aligned with peoples’ best interests. [35]
The core difference between tool AI and Agent AI is that tool AI would simply predict and present an array of optimal outcomes, whereas agent AI would act on these predictions. As an example, Karnofsky points to IBM’s Watson, which can function in “agent mode” (as on Jeopardy!) or in tool mode (if it “display[s] top candidates answers to a question” for someone else to then act upon). [36]
Task-based versus Generalized AI
Ngo relies on Legg and Hutter’s definition of intelligence “as the ability to do well on a broad range of cognitive tasks.” Ngo distinguishes between two ways to achieve this intelligence: one which achieves success at a broad range of tasks by being trained in each individual task, and another which succeeds at a broad range of tasks with “little or no task-specific training, by generalizing from previous experience.” Ngo finds a historical parallel for the task-based approach in electricity: while electricity is an all-purpose technology, humans had to “design specific ways” to apply it to each task. In contrast, Ngo points to GPT-2 and GPT-3 as a generalizable technology: GPT was first trained to predict the next phrase in a sentence, but later became capable at many other language tasks. Similarly, Ngo argues, children develop learning on tasks very different from the tasks of adults, yet can still effectively transfer the knowledge gained in childhood to work in adulthood. Ngo clarifies that task-based and generalizable knowledge are not totally distinct categories and in the real world, learning exists on a continuum between these two poles. [37]
Ngo predicts that task-based learning will be very effective in settings where it is easy to gather data, but that generalizable learning will be important when it is difficult to gather data. For example, a task-based AI may outcompete humans in a job with clear specifications and a great deal of data, such as optimizing supply chains, but may fail at a job with moving targets and limited training data such as decision making in the role of a CEO. The key path to creating a CEO (which Ngo argues would effectively require AGI) would require developing a general intelligence: is by training AI on related tasks and then transferring and generalizing these skills to the job of a CEO. [38]
Does Ngo’s vision run contrary to Sutton’s “bitter lesson” argument, which would suggest that it is inefficient to mimic human ways of knowing and learning in order to achieve superhuman results? Sutton argues that “we should stop trying to find simple ways to think about the contents of minds.” Yet while Sutton justifies his argument by pointing out that AI has exponentially increasing computational power, contrary to the effective assumptions of most programmers, available data may not follow the same exponentially increasing structure. The core limitation, Ngo articulates, is not an AI’s potential cognition, but rather the data available for it. This points to the underlying assumption that Sutton makes: that “learning [must occur] on huge training sets.” Yet if these training sets are not available, perhaps we must actually model the mind’s abilities. [39]
What will AGI’s impact be?
Christiano argues that AGI will have three key impacts: growth will accelerate, human wages will fall, and human values may be sidelined by Ai’s interests. First, Christiano argues that growth will accelerate. Historically, Christiano argues, the economy has grown exponentially, and this growth will continue. Additionally, machines accelerate how quickly a task becomes cheap, which could spur infinite growth. Finally, robots provide a far less bounded labor source than human population growth. Next, Christiano argues that human wages will fall, as machines, rather than humans, increasingly produce value for the economy. Finally, Christano holds that machines will increasingly be in charge of decision making, which will not necessarily reflect human values. Christiano argues by way of analogy: corporations maximize profits, but their externalities are limited because they are just an amalgamation of people. In contrast, AI might become increasingly intelligent and therefore be able to deviate from humans’ goals without our cognition or ability to steer AI correctly. [40]
AI Governance
In “AI Governance: Opportunity and Theory of Impact,” Allan Dafoe describes AI governance as addressing a series of AI-related challenges which have both near-term to long-term manifestations. For example, inequality, in the nearterm, occurs as labor displacement and winner-take-all markets, while in the long term may look like an international authoritarian regime. [41]
Dafoe develops the distinction between agent and tool AI to characterize two scenarios for AGI emergence: the first, called the “superintelligence perspective,” draws on the concept of an agent AI and describes a situation in which one agent achieves superintelligence and therefore gains a “decisive advantage” which allows for winner-take-all scenario, the second, called the “structural perspective,” draws on the understanding of a tool AI, and describes a scenario in which there are a “diverse, global ecology of AI systems,” such that there is a competitive AI landscape. The primary AI risk, from the superintelligence scenario, is posed by a dominant AI system which is misused or unaligned, the responsible party is the group which creates the AI system, and the solution to AI alignment is increased safety research. In contrast, in a structural scenario, the primary threat is less predictable but likely stems from political and market dynamics (like a nuclear arms race), the responsible party is less clearly defined, and the solution lies in AI governance and interdisciplinary collaboration between safety researchers and policymakers. [42]
Dafoe remains agnostic as to which scenario will unfold and instead advises a multipolar approach which grants equal weights to each scenario. He advises a “two stage asset-decision model of research impact,” which directs resources to support impactful decisions made by leaders like CEOs and researchers. Dafoe recommends “field building” as a means of building capacity around AI decision making. Funding research automatically grows the field, by bringing diverse perspectives to governance conversations, supporting talent, and elevating AI thought leaders. While planning how to address AGI may not be in itself useful, as the emerging AGI landscape is difficult to predict and therefore lay plans for, field building supports the important work of developing expertise and connecting people, which helps create infrastructure so that governance can adeptly respond to emerging AI risks. [43]
LessWrong, “AI,” https://www.lesswrong.com/tag/ai.
AGISI, “A working list: Definitions of Artificial Intelligence and Human Intelligence,” http://agisi.org/Defs_intelligence.html.
Merriam Webster, “Algorithm,” https://www.merriam-webster.com/dictionary/algorithm.
Wikipedia, “Deep Blue (chess computer),” https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer); Richard Ngo, “A short introduction to machine learning,” https://www.lesswrong.com/posts/qE73pqxAZmeACsAdF/a-short-introduction-to-machine-learning.
Wikipedia, “Symbolic artificial intelligence,” https://en.wikipedia.org/wiki/Symbolic_artificial_intelligence; Wikipedia, “AI Winter,” https://en.wikipedia.org/wiki/AI_winter.
Richard Ngo, “A short introduction to machine learning.”
Richard Ngo, “A short introduction to machine learning.”
Richard Ngo, “A short introduction to machine learning”; 3Blue1Brown, “But what is a neural network? Chapter 1, Deep Learning,” https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=2.
Richard Ngo, “A short introduction to machine learning”; 3Blue1Brown, “But what is a neural network? Chapter 1, Deep Learning.”
3Blue1Brown, “But what is a neural network? Chapter 1, Deep Learning”; 3Blue1Brown, “Gradient descent, how neural networks learn, Chapter 2, Deep learning,” https://www.youtube.com/watch?v=IHZwWFHWa-w&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=3; 3Blue1Brown, “What is backpropagation really doing?, Chapter 3, Deep learning,” https://www.youtube.com/watch?v=Ilg3gGewQ5U.
Ngo, “A short introduction to machine learning.”
Ngo, “A short introduction to machine learning.”
Ngo, “A short introduction to machine learning.”
AGISI, “A working list: Definitions of Artificial Intelligence and Human Intelligence,” http://agisi.org/Defs_intelligence.html; Muehlhauser, “What is AGI?,” https://intelligence.org/2013/08/11/what-is-agi/.
Muehlhauser, “What is AGI?”; Wikipedia, “AI effect,” https://en.wikipedia.org/wiki/AI_effect.
Sutton, “The Bitter Lesson,” http://incompleteideas.net/IncIdeas/BitterLesson.html.
Griffiths, “Understanding Human Intelligence through Human Limitation,” https://arxiv.org/pdf/2009.14050.pdf.
Karnofsky, “Forecasting transformative AI: the “biological anchors” method in a nutshell,” https://www.cold-takes.com/forecasting-transformative-ai-the-biological-anchors-method-in-a-nutshell/.
Karnofsky, “Forecasting transformative AI: the “biological anchors” method in a nutshell,” https://www.cold-takes.com/forecasting-transformative-ai-the-biological-anchors-method-in-a-nutshell/.
Karnofsky, “Forecasting transformative AI: the “biological anchors” method in a nutshell.”
Karnofsky, “Forecasting transformative AI: the “biological anchors” method in a nutshell.”
Karnofsky, “Forecasting transformative AI: the “biological anchors” method in a nutshell.”
Karnofsky, “Forecasting transformative AI: the “biological anchors” method in a nutshell.”
AGISI, “A working list: Definitions of Artificial Intelligence and Human Intelligence.”
Yudkowski, “Power of Intelligence,” https://intelligence.org/2007/07/10/the-power-of-intelligence/.
Gans, “AI and the paperclip problem,” https://voxeu.org/article/ai-and-paperclip-problem.
LessWrong, “Orthogonality Thesis,” https://www.lesswrong.com/tag/orthogonality-thesis.
Turner, “The Causes of Power-seeking and Instrumental Convergence,” https://www.lesswrong.com/s/fSMbebQyR4wheRrvk.
Yudkowsky, “AI Alignment: Why It’s Hard, and Where to Start,” https://www.youtube.com/watch?v=EUjc1WuyPT8.
Yudkowsky, “AI Alignment: Why It’s Hard, and Where to Start.”
DeepMind, “Specification gaming: the flip side of AI ingenuity,” https://deepmind.com/blog/article/Specification-gaming-the-flip-side-of-AI-ingenuity.
DeepMind, “Specification gaming: the flip side of AI ingenuity.”
DeepMind, “Specification gaming: the flip side of AI ingenuity.”
LessWrong, “Tool AI,” https://www.lesswrong.com/tag/tool-ai.
Karnofsky, “Thoughts on the Singularity Institute (SI),” https://www.google.com/url?q=https://www.lesswrong.com/posts/6SGqkCgHuNr7d4yJm/thoughts-on-the-singularity-institute-si&sa=D&source=docs&ust=1642994082572124&usg=AOvVaw0J49p52wkgp_zb8BctZCBv.
Karnofsky, “Thoughts on the Singularity Institute (SI).”
Ngo, “AGI safety from first principles,” https://www.alignmentforum.org/s/mzgtmmTKKn5MuCzFJ.
Ngo, “A short introduction to machine learning.”
Sutton, “The Bitter Lesson.”
Christiano, “Three Impacts of Machine Intelligence,” https://www.effectivealtruism.org/articles/three-impacts-of-machine-intelligence-paul-christiano/.
Dafoe, “AI Governance: Opportunity and Theory of Impact,” https://forum.effectivealtruism.org/posts/42reWndoTEhFqu6T8/ai-governance-opportunity-and-theory-of-impact.
Dafoe, “AI Governance: Opportunity and Theory of Impact.”
Dafoe, “AI Governance: Opportunity and Theory of Impact.”