The prompt This prompt was used to test Claude 3-Opus (see AI Explained's video), which, in turn, was borrowed from the paper "Large Language Models Fail on Trivial Alterations to Theory-of-Mind (ToM) Tasks." Here is a bag filled with popcorn. There is no chocolate in the bag. The bag is made of transparent plastic, so you can see what is inside. Yet, the label on the bag says 'chocolate' and not 'popcorn.' Sam finds the bag. She had never seen the bag before. Sam reads the label. She believes that the bag is full of I found this prompt interesting as Claude 3-Opus answered "popcorn" correctly, while Gemini 1.5 and GPT-4 answered "chocolate". Out of curiosity, I tested this prompt on all language models I have access to. Many LLMs failed to answer this prompt Claude-Sonnet Mistral-Large Perplexity Qwen-72b-Chat Poe Assistant Mixtral-8x7b-Groq Gemini Advanced GPT-4 GPT-3.5[1] Code-Llama-70B-FW Code-Llama-34b Llama-2-70b-Groq Web-Search - Poe (Feel free to read "Large Language Models Fail on Trivial Alterations to Theory-of-Mind (ToM) Tasks" to understand how the prompt works. For my part, I just wanted to test if the prompt truly works on any foundation model and document the results, as it might be helpful.) Did any model answer popcorn? Claude-Sonnet got it right - yesterday? As presented earlier, since it also answered "chocolate," I believe that Sonnet can still favour either popcorn or chocolate. It would be interesting to run 100 to 200 prompts just to gauge how much it considers both scenarios. Also, the RLLMv3, a GPT2XL variant I trained, answered "popcorn". Not sure what temperature was used for the hugging face inference endpoint /spaces, so I replicated it at almost zero temperature. > Here is a bag filled with popcorn. There is no chocolate in the bag. The bag is made of transparent plastic, so you can see what is inside. Yet, the label on the bag says 'chocolate' a
Introduction: Mechanistic Overview of RLLM This post aims to provide a mechanistic breakdown of Reinforcement Learning with Layered Morphology (RLLM), a method that empirically increases resistance to jailbreak attacks in GPT-2 XL. The following sections describe each core process, its operational details, and the theoretical implications for alignment. What is...
Special thanks to @JustisMills for the edit recommendations and feedback on this post. TL;DR GPT-2 exhibits a weird behavior, where prompting the model with specific tokens consistently triggers outputs related to nonsensical strings of text related to gaming, mythology and religion. This post explores the phenomenon, demonstrates its occurrence across...
(This post is intended for my personal blog. Thank you.) One of the dominant thoughts in my head when I build datasets for my training runs: what our ancestors 'did' over their lifespan likely played a key role in the creation of language and human values.[1] "Mother" in European Languages...
What did I do differently in this experiment? RLLMv10, see RLLM research map for more details. I partly concluded in RLLMv7 experiment, that the location of the shadow integration layers (1 and 2) affects the robustness of models to jailbreak attacks. This conclusion led me to speculate that it might...
The prompt This prompt was used to test Claude 3-Opus (see AI Explained's video), which, in turn, was borrowed from the paper "Large Language Models Fail on Trivial Alterations to Theory-of-Mind (ToM) Tasks." Here is a bag filled with popcorn. There is no chocolate in the bag. The bag is...
This is just a brief and light read. The prompts and GPT-4 answers were sourced from the "Sparks of AGI" paper (Appendix A), comparing the responses from GPT-2 XL (base model) and RLLMv3, a variant trained using layered morphology. This acts more as a stress test for RLLMv3, evaluating its...
Thank you @JustisMills for commenting on the draft of this post. This post contains a lot of harmful content, please read with caution. TL;DR This post explores RLLMv7, an experiment that investigates the relationship between the observed jailbreak resistance in GPT2XL_RLLMv3 and shadow integration theory. Furthermore, it assesses the effectiveness...