As Artificial Intelligence (AI) continues its rapid ascent, we are witnessing increasing evidence of well-crafted methods to undermine and exploit AI responses. Carefully crafted inputs can exploit vulnerabilities and lead to harmful or undesired results.
In recent times, we have seen numerous individuals independently uncover failure modes within AI systems, particularly through targeted attacks on language models.
By crafting clever prompts and inputs, these individuals have exposed the models' vulnerabilities, causing them to generate offensive content, reveal sensitive information, or behave in ways that deviate from their intended purpose.
To address these concerns and gain a deeper understanding of this evolving threat landscape, we are undertaking two key initiatives:
1. Attack Space
AttackSpace is an open-source curated comprehensive list of LLM security methods and safeguarding techniques.
Located at https://github.com/equiano-institute/attackspace, this open-source repository is dedicated to collecting and documenting known attacks on language models. This comprehensive resource serves as a vital platform for researchers, developers, and the general public to access information on these vulnerabilities. By fostering a collaborative space for information sharing, we aim to accelerate research and development efforts focused on improving the robustness and security of language models. This collection contains work by Viktoria Krakovna. The goal is to have a structured view and characterisation of the latent attack space. We want to model satisfiability of AI attacks. This concept, similar to the P-NP problem, involves determining whether a given set of conditions can be met to successfully launch an attack against a language model. By analyzing the features and conditions of successful attacks, we can develop efficient algorithms for identifying and preventing future attacks.
2. Project Haystack
A suite of red teaming and evaluation frameworks for language models
This project aims to develop Haystack, an open-source platform for red teaming and human feedback on LLMs with crowd-sourcing and automated methods
Despite there being many efforts to red team language models, there aren't any available open source frameworks for client-level user model evaluation and red teaming testing.
Complementing the attack space collection is our ongoing academic survey, designed to gather crucial data and insights into the nature and scope of language model attacks. This survey delves into key questions such as:
What types of attacks have successfully exploited language models and how can we characterise these attacks?
What underlying explainable vulnerabilities enable these attacks ?
What potential consequences and risks do these attacks pose?
What mitigation strategies and research directions can address these vulnerabilities?
We encourage you to join us in this endeavor. Contribute your knowledge and expertise to the attack space collection or participate in our academic survey. As we continue to ensure the responsible development and deployment of AI technology, safeguarding both its immense potential and the well-being of society.
As Artificial Intelligence (AI) continues its rapid ascent, we are witnessing increasing evidence of well-crafted methods to undermine and exploit AI responses. Carefully crafted inputs can exploit vulnerabilities and lead to harmful or undesired results.
In recent times, we have seen numerous individuals independently uncover failure modes within AI systems, particularly through targeted attacks on language models.
By crafting clever prompts and inputs, these individuals have exposed the models' vulnerabilities, causing them to generate offensive content, reveal sensitive information, or behave in ways that deviate from their intended purpose.
To address these concerns and gain a deeper understanding of this evolving threat landscape, we are undertaking two key initiatives:
1. Attack Space
AttackSpace is an open-source curated comprehensive list of LLM security methods and safeguarding techniques.
Located at https://github.com/equiano-institute/attackspace, this open-source repository is dedicated to collecting and documenting known attacks on language models. This comprehensive resource serves as a vital platform for researchers, developers, and the general public to access information on these vulnerabilities. By fostering a collaborative space for information sharing, we aim to accelerate research and development efforts focused on improving the robustness and security of language models. This collection contains work by Viktoria Krakovna. The goal is to have a structured view and characterisation of the latent attack space. We want to model satisfiability of AI attacks. This concept, similar to the P-NP problem, involves determining whether a given set of conditions can be met to successfully launch an attack against a language model. By analyzing the features and conditions of successful attacks, we can develop efficient algorithms for identifying and preventing future attacks.
2. Project Haystack
A suite of red teaming and evaluation frameworks for language models
This project aims to develop Haystack, an open-source platform for red teaming and human feedback on LLMs with crowd-sourcing and automated methods
Goals
Standards
Challenges & Efforts
Despite there being many efforts to red team language models, there aren't any available open source frameworks for client-level user model evaluation and red teaming testing.
2. Academic Survey:
Complementing the attack space collection is our ongoing academic survey, designed to gather crucial data and insights into the nature and scope of language model attacks. This survey delves into key questions such as:
We encourage you to join us in this endeavor. Contribute your knowledge and expertise to the attack space collection or participate in our academic survey. As we continue to ensure the responsible development and deployment of AI technology, safeguarding both its immense potential and the well-being of society.