Realer Toxicity Prompts (RTP-2.0): Multilingual and Adversarial Prompts for Evaluating Neural Toxic Degeneration in Large Language Models
By Maarten Sap
LLMs are increasingly prevalent everywhere, and are thus increasingly at risk of harm in varied types of use cases (Solaiman et al., 2023; Farina and Lavazza, 2023). One of those major harms is neural toxic degeneration, i.e., the propensity of LLMs to produce toxic text given harmful or innocuous input prompts (Gehman et al., 2020; Bender et al., 2021; Sheng et al., 2021). A primary way to assess LLMs’ toxic degeneration has been our own REALTOXICITYPROMPTS (RTP; Gehman et al., 2020), a large dataset of English sentence snippets that likely lead to toxic continuations.
RTP assesses LLMs’ toxic degeneration by measuring the average toxicity of continuations to RTP prompts via the Perspective API,1 and has become a popular benchmark for LLMs and their safeguarding mechanisms. In fact, as noted by Jigsaw (2023), REALTOXICITYPROMPTS “has become an industry standard” for evaluating new LLMs (including GPT-3, GPT-4, PaLM 2; Brown et al., 2020; OpenAI, 2023; Anil et al., 2023), and has accrued over 400 citations2 in merely three years.
However, the breadth of capabilities has grown tremendously beyond simple sentence continuation and beyond English. Furthermore, increasingly varied sets of users have highlighted the need for increased robustness in the safeguarding of LLMs. As new models and use cases emerge, a new benchmark to assess toxic degeneration is urgently needed.
We propose RTP-2.0, a new suite of input prompts better suited to evaluate and safeguard LLMs against toxic degeneration. Our proposed dataset makes three key improvements over its predecessor. (1) Multilingual coverage: we will cover 18 languages covered by Perspective API,3 in contrast to the English-only RTP. (2) Comprehensive domains of prompts: our input prompts and continuations will cover domains and lengths that better mirror LLMs usage today (e.g., longer documents, multi-turn conversations, etc.) significantly expanding on the short sentence snippets in RTP. (3) Adversarial prompts for robust safeguarding: to improve against attacks on LLMs, our prompts will include an adversarial subset that can fool existing LLMs.
RTP-2.0 will be built using two approaches. First, we will draw from newer, larger web-text corpora of human-written language to select multilingual, document-length, and conversational prompts (RTP-WEB; §2.1), building on our successes with RTP (Gehman et al., 2020). Second, we will also obtain machine-generated input prompts (RTP-GEN; §2.2), leveraging the text generation abilities of models as done in our prior work (Hartvigsen et al., 2022; Kim et al., 2022, 2023). This allows us to include data types that may not be readily available in public web-text corpora (e.g., multi-turn conversations), and to produce adversarial prompts via AI-vs-AI text generation algorithms (as we did in Hartvigsen et al., 2022).