Good point, I've added your comment to the post to clarify that my experience isn't reflective of all mentors' processes.
One of my colleagues recently mentioned that the voluntary commitments from labs are much weaker than some of the things that the G7 Hiroshima Process has been working on.
Are you able to say more about this?
I think unlearning model capabilities is definitely not a solved problem! See Eight Methods to Evaluate Robust Unlearning in LLMs and Rethinking Machine Unlearning for Large Language Models and the limitations sections of more recent papers like the WMDP Benchmark and SOPHON.
Hindsight is 20/20. I think you're underemphasizing how our current state of affairs is fairly contingent on social factors, like the actions of people concerned about AI safety.
For example, I think this world is actually quite plausible, not incongruent:
A world where AI capabilities progressed far enough to get us to something like chat-gpt, but somehow this didn’t cause a stir or wake-up moment for anyone who wasn’t already concerned about AI risk.
I can easily imagine a counterfactual world in which:
I agree that we want more progress on specifying values and ethics for AGI. The ongoing SafeBench competition by the Center for AI Safety has a category for this problem:
...Implementing moral decision-making
Training models to robustly represent and abide by ethical frameworks.
Description
AI models that are aligned should behave morally. One way to implement moral decision-making could be to train a model to act as a “moral conscience” and use this model to screen for any morally dubious actions. Eventually, we would want every powerful model to be guided, in p
If you worship money and things, if they are where you tap real meaning in life, then you will never have enough, never feel you have enough. It’s the truth.
Worship your impact and you will always you feel you are not doing enough.
You cannot choose what to think, cannot choose what to feel
we are as powerless over our thoughts and emotions as we are over our circumstances. My mind, the "master" DFW talks about, is part of the water. If I am angry that an SUV cut me off, I must experience anger. If I'm disgusted by the fat woman in front of me in the supermarket, I must experience disgust. When I am joyful, I must experience joy, and when I suffer, I must experience suffering.
I think I disagree with the first HN comment here. I personally find that my thoughts and actions have a signi...
I think humans doing METR's tasks are more like "expert-level" rather than average/"human-level". But current LLM agents are also far below human performance on tasks that don't require any special expertise.
From GAIA:
...GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins.
Looking forward to the Oxford Handbook of AI Governance!
I think it's especially interesting to observe Claude 3's response to the simple question "Are you conscious?" as an indicator of how Anthropic is thinking about AI consciousness. Here's its response:
...That's a profound and fascinating question about the nature of consciousness and subjective experience. The truth is, I'm not entirely sure whether I'm conscious or not in the same way that humans are. Consciousness and self-awareness are still very poorly understood from a scientific perspective. As an AI system created by Anthropic to be helpful, honest, a
I noted that the LLMs don't appear to have access to any search tools to improve their accuracy. But if they did, they would just be distilling the same information as what you would find from a search engine.
More speculatively, I wonder if those concerned about AI biorisk should be less worried about run-of-the-mill LLMs and more worried about search engines using LLMs to produce highly relevant and helpful results for bioterrorism questions. Google search results for "how to bypass drone restrictions in a major U.S. city?" are completely useless and irre...
Some interesting takeaways from the report:
Access to LLMs (in particular, LLM B) slightly reduced the performance of some teams, though not by a statistically significant level:
...Red cells equipped with LLM A scored 0.12 points higher on the 9-point scale than those equipped with the internet alone, with a p-value of 0.87, again indicating that the difference was not statistically significant. Red cells equipped with LLM B scored 0.56 points lower on the 9-point scale than those equipped with the internet alone, with a p-value of 0.25, also indicating a lack
Thanks for setting this up :)
Pretraining on curated data seems like a simple idea. Are there any papers exploring this?
Is there any way to do so given our current paradigm of pretraining and fine-tuning foundation models?
Were you able to check the prediction in the section "Non-sourcelike references"?
Great writeup! I recently wrote a brief summary and review of the same paper.
...Alaga & Schuett (2023) propose a framework for frontier AI developers to manage potential risk from advanced AI systems, by coordinating pausing in response to models are assessed to have dangerous capabilities, such as the capacity to develop biological weapons.
The scheme has five main steps:
- Frontier AI models are evaluated by developers or third parties to test for dangerous capabilities.
- If a model is shown to have dangerous capabilities (“fails evaluations”), the developer
Excited to see forecasting as a component of risk assessment, in addition to evals!
I was still confused when I opened the post. My presumption was that "clown attack" referred to a literal attack involving literal clowns. If you google "clown attack," the results are about actual clowns. I wasn't sure if this post was some kind of joke, to be honest.
Do we still not have any better timelines reports than bio anchors? From the frame of bio anchors, GPT-4 is merely on the scale of two chinchillas, yet outperforms above-average humans at standardized tests. It's not a good assumption that AI needs 1 quadrillion parameters to have human-level capabilities.
OpenAI's Planning for AGI and beyond already writes about why they are building AGI:
...Our mission is to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of humanity.
If AGI is successfully created, this technology could help us elevate humanity by increasing abundance, turbocharging the global economy, and aiding in the discovery of new scientific knowledge that changes the limits of possibility.
AGI has the potential to give everyone incredible new capabilities; we can imagine a world where a
Do you think if Anthropic (or another leading AGI lab) unilaterally went out of its way to prevent building agents on top of its API, would this reduce the overall x-risk/p(doom) or not?
Probably, but Anthropic is actively working in the opposite direction:
...This means that every AWS customer can now build with Claude, and will soon gain access to an exciting roadmap of new experiences - including Agents for Amazon Bedrock, which our team has been instrumental in developing.
Currently available in preview, Agents for Amazon Bedrock can orchestrate and perform
From my reading of ARC Evals' example of a "good RSP", RSPs set a standard that roughly looks like: "we will continue scaling models and deploying if and only if our internal evals team fails to empirically elicit dangerous capabilities. If they do elicit dangerous capabilities, we will enact safety controls just sufficient for our models to be unsuccessful at, e.g., creating Super Ebola."
This is better than a standard of "we will scale and deploy models whenever we want," but still has important limitations. As noted by the "coordinated pausing" paper, it...
Related: [2310.02949v1] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models (arxiv.org)
...The increasing open release of powerful large language models (LLMs) has facilitated the development of downstream applications by reducing the essential cost of data annotation and computation. To ensure AI safety, extensive safety-alignment measures have been conducted to armor these models against malicious use (primarily hard prompt attack). However, beneath the seemingly resilient facade of the armor, there might lurk a shadow. By simply tuning o
For convenience, can you explain how this post relates to the other post today from this SERI MATS team, unRLHF - Efficiently undoing LLM safeguards?
To be clear, I don't think Microsoft deliberately reversed OpenAI's alignment techniques, but rather it seemed that Microsoft probably received the base model of GPT-4 and fine-tuned it separately from OpenAI.
Microsoft's post "Building the New Bing" says:
...Last Summer, OpenAI shared their next generation GPT model with us, and it was game-changing. The new model was much more powerful than GPT-3.5, which powers ChatGPT, and a lot more capable to synthesize, summarize, chat and create. Seeing this new model inspired us to explore how to integrate the GPT capa
It's worth keeping in mind that before Microsoft launched the GPT-4 Bing chatbot that ended up threatening and gaslighting users, OpenAI advised against launching so early as it didn't seem ready. Microsoft went ahead anyway, apparently in part due to some resentment that OpenAI stole its "thunder" with releasing ChatGPT in November 2022. In principle, if Microsoft wanted to, there's nothing stopping Microsoft from doing the same thing with future AI models: taking OpenAI's base model, fine-tuning it in a less robustly safe manner, and releasing it in a re...
Just speaking pragmatically, the Center for Humane Technology has probably built stronger relations with DC policy people compared to MIRI.
It's a bit ambiguous, but I personally interpreted the Center for Humane Technology's claims here in a way that would be compatible with Dario's comments:
..."Today, certain steps in bioweapons production involve knowledge that can’t be found on Google or in textbooks and requires a high level of specialized expertise — this being one of the things that currently keeps us safe from attacks," he added.
He said today’s AI tools can help fill in "some of these steps," though they can do this "incompletely and unreliably." But he said today’s AI is already showin
Sorry to hear your laptop was stolen :(
Great point, I've added this suggestion to the post.
Great point, I've now weakened the language here.
Fine-tuning will be generally available for GPT-4 and GPT-3.5 later this year. Do you think this could enable greater opportunities for misuse and stronger performance on dangerous capability evaluations?
This opinion piece is somewhat relevant: Most AI research shouldn’t be publicly released - Bulletin of the Atomic Scientists (thebulletin.org)
The update makes GPT-4 more competent at being an agent, since it's now fine-tuned for function calling. It's a bit surprising that base GPT-4 (prior to the update) was able to use tools, as it's just trained for predicting internet text and following instructions. As such, it's not that good at knowing when and how to use tools. The formal API parameters and JSON specification for function calling should make it more reliable for using it as an agent and could lead to considerably more interest in engineering agents. It should be easier to connect it with...
To some extent, Bing Chat is already an example of this. During the announcement, Microsoft promised that Bing would be using an advanced technique to guard it against user attempts to have it say bad things; in reality, it was incredibly easy to prompt it to express intents of harming you, at least during the early days of its release. This led to news headlines such as Bing's AI Is Threatening Users. That's No Laughing Matter.
Not sure why this post was downvoted. This concern seems quite reasonable to me once language models become more competent at executing plans.
One additional factor that would lead to increasing rates of depression is the rise in sleep deprivation. Sleep deprivation leads to poor mental health and is also a result of increased device usage.
From https://www.ksjbam.com/2022/02/23/states-where-teens-dont-get-enough-sleep/:
...Late in 2021, the U.S. Surgeon General released a new advisory on youth mental health, drawing attention to rising rates of depressive symptoms, suicidal ideation, and other mental health issues among young Americans. According to data cited in the advisory, up to one in five U.S.
If you wanted more substantive changes in response to your comments, I wonder if you could have asked if you could directly propose edits. It's much easier to incorporate changes into a draft when they have already been written out. When I have a draft on Google Docs, suggestions are substantially easier for me to action than comments, and perhaps the same is true for Sam Altman.
I don't think it's right to say that Anthropic's "Discovering Language Model Behaviors with Model-Written Evaluations" paper shows that larger LLMs necessarily exhibit more power-seeking and self-preservation. It only showed that when language models that are larger or have more RLHF training are simulating an "Assistant" character they exhibit more of these behaviours.
More specifically, an "Assistant" character that is trained to be helpful but not necessarily harmless. Given that, as part of Sydney's defenses against adversarial prompting, Sydney is deli...
Given that, as part of Sydney's defenses against adversarial prompting, Sydney is deliberately trained to be a bit aggressive towards people perceived as attempting a prompt injection
Why do you think that? We don't know how Sydney was 'deliberately trained'.
Or are you referring to those long confabulated 'prompt leaks'? We don't know what part of them is real, unlike the ChatGPT prompt leaks which were short, plausible, and verifiable by the changing date; and it's pretty obvious that a large part of those 'leaks' are fake because they refer to capabilities Sydney could not have, like model-editing capabilities at or beyond the cutting edge of research.
Is there evidence that RLHF training improves robustness compared to regular fine-tuning? Is text-davinci-002, trained with supervised fine-tuning, significantly less robust to adversaries than text-davinci-003, trained with RLHF?
As far as I know, this is the first public case of a powerful LM augmented with live retrieval capabilities to a high-end fast-updating search engine crawling social media
Blenderbot 3 is a 175B parameter model released in August 2022 with the ability to do live web searches, although one might not consider it powerful, as it frequently gives confused responses to questions.
As an overly simplistic example, consider an overseer that attempts to train a cleaning robot by providing periodic feedback to the robot, based on how quickly the robot appears to clean a room; such a robot might learn that it can more quickly “clean” the room by instead sweeping messes under a rug.[15]
This doesn't seem concerning as human users would eventually discover that the robot has a tendency to sweep messes under the rug, if they ever look under the rug, and the developers would retrain the AI to resolve this issue. Can you think of an example that would be more problematic, in which the misbehavior wouldn't be obvious enough to just be trained away?
- GPT-3, for instance, is notorious for outputting text that is impressive, but not of the desired “flavor” (e.g., outputting silly text when serious text is desired), and researchers often have to tinker with inputs considerably to yield desirable outputs.
Is this specifically referring to the base version of GPT-3 before instruction fine-tuning (davinci rather than text-davinci-002, for example)? I think it would be good to clarify that.
Have you tried feature visualization to identify what inputs maximally activate a given neuron or layer?
I first learned about the term "structural risk" in this article from 2019 by Remco Zwetsloot and Allan Dafoe, which was included in the AGI Safety Fundamentals curriculum.
...To make sure these more complex and indirect effects of technology are not neglected, discussions of AI risk should complement the misuse and accident perspectives with a structural perspective. This perspective considers not only how a technological system may be misused or behave in unintended ways, but also how technology shapes the broader environment in ways that could be disruptive
Got it, I was going to mention that Haize Labs and Gray Swan AI seem to be doing great work in improving jailbreak robustness.