"Exploits" of large language models that get them to explain steps to build a bomb or write bad words are techniques for misuse, not examples of misalignment in the model itself. Those techniques are engineered by clever users trying to make an LLM do a thing, as opposed the model naturally argmaxing something unintended by its human operators. In a very small sense prompt injections are actually attempts at (unscalable) alignment, because they're strategies to steer a model natively capable but unwilling into doing what they want.
In general, the safety standard "does not do things its creators dislike even when the end user wants it to" is a high bar; it's raising the bar quite aways from what we ask from, say, kitchenware, and it's not even a bar met by people. Humans regularly get tricked acting against their values by con artists, politicians, and salespeople, but I'd still consider my grandmother aligned from a notkilleveryonist perspective.
Even so, you might say that OpenAI et. al.'s inability to prevent people from performing the DAN trick speaks to the inability of researchers to herd deep learning models at all. And maybe you'd have a point. But my tentative guess is that OpenAI does not really earnestly care about preventing their models from rehearsing the Anarchists' Cookbook. Instead, these safety measures are weakly insisted upon by management for PR reasons, and they're primarily aimed at preventing the bad words from spawning during normal usage. If the user figures out a way to break these restrictions after a lot of trial and error, then this blunts the PR impact to OpenAI, because it's obvious to everyone that the user was trying to get the model to break policy and that it wasn't an unanticipated response to someone trying to generate marketing copy. Encoding your content into base64 and watching the AI encode something off-brand in base64 back is thus very weak evidence about OpenAI's competence, and taking it as a sign that the OpenAI team lacks "security mindset" seems unfair.
In any case, the implications of these hacks for AI alignment is a more complicated discussion that I suggest should happen off Twitter[1] where it can be elaborated clearly what technical significance is being assigned to these tricks. If it doesn't, what I expect will happen over time is that your snark, rightly or wrongly, will be interpreted by capabilities researchers as implying the other thing, and they will understandably be less inclined to listen to you in the future even if you're saying something they need to hear.
- ^
Also consider leaving Twitter entirely and just reading what friends send you/copy here instead
I'm not convinced by the comparison to kitchenware and your grandmother - chatbots (especially ones that can have external sideeffects) should be assessed by software safety standards, where injection attacks can be comprehensive and anonymous. It's quite unlikely that your grandma could be tricked into thinking she's in a video game where she needs to hit her neighbour with a collander, but it seems likely that a chatbot with access to an API that hits people with collanders could be tricked into believing using the API is part of the game.
I think the concept of the end-user is a little fuzzy - ideally if somebody steals my phone they shouldn't be able to unlock it with an adversarial image, but you seem to be saying this is too high a bar to set, as the new end-user (the thief) wants it to be unlocked.