On "Does OpenAI, or other AGI/ASI developers, have a plan to "Red Team" and protect their new ASI systems from similarly powerful systems?"
Well, we know that red teaming is one of their priorities right now, having formed a red-teaming network already to test the current systems comprised of domain experts apart from researchers which previously they used to contact people every time they wanted to test a new model which makes me believe they are aware of the x-risks (by the way they higlighted on the blog including CBRN threats). Also, from the superalignment blog, the mandate is to:
"to steer and control AI systems much smarter than us."
So, either OAI will use the current Red-Teaming Network (RTN) or form a separate one dedicated to the superalignment team (not necessarily an agent).
On "How can they demonstrate that an aligned ASI is safe and resistant to attack, exploitation, takeover, and manipulation—not only from human "Bad Actors" but also from other AGI or ASI-scale systems?"
This is where new eval techniques will come in since the current ones are mostly saturated to be honest. With the presence of the Superalignment team, which I believe will have all the resources available (given they have already been dedicated a 20% compute) will be one of their key research areas.
On "If a "Super Red Teaming Agent" is too dangerous, can "Human Red Teams" comprehensively validate an ASI's security? Are they enough to defend against superhuman ASIs? If not, how can companies like OpenAI ensure their infrastructure and ASIs aren't vulnerable to attack?"
As human beings we will always try but won't be enough that's why open source is key. Companies should engage in bugcrowd program. Glad to see OpenAI engaged in such through their trust portal end external auditing for stuff like malicious actors.
Also, worth noting OAI hires a lot of cyber security roles like Security Engineer etc which is very pertinent for the infrastructure.
Yes, good context, thank you!
As human beings we will always try but won't be enough that's why open source is key.
Open source for which? Code? Training Data? Model weights? Either way, it does not seem like any of these are likely from "Open"AI.
...Well, we know that red teaming is one of their priorities right now, having formed a red-teaming network already to test the current systems comprised of domain experts apart from researchers which previously they used to contact people every time they wanted to test a new model which makes me believe they are
OpenAI recently released their “Superalignment” plan. But, along with “AI Alignment “, “Red Teaming” is also a critical part of building an AI, LLM, or other software stack that will be accessible to the world over the internet. Red Teaming for an Artificial Superintelligence (ASI) must address potential attacks from diverse "Bad Actors”: including humans, AIs, LLMs, AGIs, and ASIs.
Would a "Super Red Teaming Agent" be needed to test ASI vulnerabilities?
Alongside their "roughly human-level automated alignment researcher," would a "roughly human-level (or higher) automated Red Teaming researcher" be needed? An agent to develop threat models and test for vulnerabilities arising from potentially super-human AI systems. Including:
The Dangers of a Super Red Teaming Agent
To be effective, this "Super Red Teaming Agent" must possess super-human capabilities to exploit any system hack, loophole, vulnerability, prompt injection, etc. It should test the ASI itself, its infrastructure, and even social engineering attacks against those controlling the ASI.
But, could the valuable capabilities of a Super Red Teaming Agent also make it a dangerous tool? If trained to infiltrate an ASI running on the world's most secure systems, could it be repurposed to attack any company, AGI, or infrastructure? What if this "Super Red Teaming Agent" was leaked and sold to the highest bidder?
Can Human Red Teams Alone Ensure ASI Security?
If a "Super Red Teaming Agent" is too dangerous, can "Human Red Teams" comprehensively validate an ASI's security? Are they enough to defend against superhuman ASIs? If not, how can companies like OpenAI ensure their infrastructure and ASIs aren't vulnerable to attack?
Does OpenAI, or other AGI/ASI developers, have a plan to "Red Team" and protect their new ASI systems from similarly powerful systems?
How can they demonstrate that an aligned ASI is safe and resistant to attack, exploitation, takeover, and manipulation—not only from human "Bad Actors" but also from other AGI or ASI-scale systems?