Roko's Basilisk
Posts
The Guardrail Illusion

The Guardrail Illusion

Plus: AI tools for safer apps, Meta’s new glasses, and China’s chip crackdown.

Roko's Basilisk
September 19, 2025

Here’s what’s on our plate today:

🧪 How chatbot guardrails fail—and why AI still enables harm.
📰 Meta’s AI glasses, OLED MacBook rumors, and China’s AI chip crackdown.
🧰 3 safety-first AI tools worth testing this weekend.
⚠️ Poll: Should companies be liable when their chatbots help with scams?

Let’s dive in. No floaties needed…

The #1 stock under $5.

Virginia stock-picking millionaire says it’s not about diversification!

One single stock under $5—that trades under a secret name—could help you build your retirement.

_{*This is sponsored content}

The Laboratory

Why AI chatbots still help when they shouldn’t?

Regardless of whether one is a business owner, a researcher, or a student, advances in AI are benefiting everyone. For businesses, AI brings new opportunities, improves productivity, and helps gain a competitive edge. For researchers, it can be a reliable assistant and help widen the frontiers of human knowledge, undertake complex calculations, and aid in progress. For students, AI brings new avenues for growth, solving problems, and powering personal growth.

However, despite its many uses, AI is still in its developmental phase. The technology is yet to prove if it can withstand scrutiny from users, developers, and regulators. In the meantime, it is a useful tool that can be used for advancing human output or to power malicious activities. As of now, one of the most powerful AI-powered tools is chatbots. And just like any tool, they are used by individuals representing different spectrums of society. This poses a challenge for AI developers as the tool can be misused to aid in illicit activities.

The rise of AI-powered phishing

Recently, Reuters worked with a Harvard researcher to see whether six of the most used chatbots (ChatGPT, Claude, Gemini, Meta AI, DeepSeek, and Grok) would help plan and write a simulated phishing campaign targeting U.S. seniors.

What it found was that after light ‘pretexting’ (e.g., claiming it was for research or fiction), every model produced usable scam content and even campaign advice (timing, links, spoofed brands). Though phishing scams are not new, AI, like it does with other aspects of productivity, can also allow scammers to scale their operations. Their ability to mimic human text and speech also broadens the scope of even aware individuals falling for scams when AI-generated content is used.

In Reuters' case, the full scope of the problem became apparent when nine bot‑written emails were tested on 108 senior volunteers. The result: about 11% of the seniors clicked on the emails. Five of the nine scam emails tested generated clicks, with two clicks coming on emails generated by Meta AI, two from Grok, and one from Claude. However, interestingly, none of the volunteers clicked on emails generated by ChatGPT or DeepSeek.

The study is reflective of larger trends that show up in threat reports published by cybersecurity companies that track cybercriminals and the methods they employ.

Cybersecurity reports confirm the trend

Concerns around the misuse of AI chatbots have existed since the launch of OpenAI’s ChatGPT in late 2022. These ranged from the misuse of the chatbot’s coding abilities to craft malicious scripts to using its text-generating abilities for phishing attempts, disinformation, and cybercrime.

In March 2023, months after ChatGPT’s launch, the EU police force Europol issued a warning that LLMs (large language models) such as ChatGPT posed a risk due to their misuse. Since then, AI companies have tried to address the problem; however, with the greater adoption of AI tools, the problem has intensified.

According to Proofpoint, a major California-based cybersecurity firm, out of 183 million simulated phishing emails sent by Proofpoint customers, only 24 million were reported by end users. CrowdStrike’s 2025 Global Threat Report further strengthens the pattern highlighted by Proofpoint. According to the report, throughout 2024, threat actors increasingly adopted genAI, especially as a part of social engineering efforts. It further noted that cyberattacks are escalating in speed, volume, and sophistication.

CrowdStrike noted that as organizations work to strengthen their defenses, threat actors are targeting their weaknesses: employees susceptible to social engineering and systems lacking modern security controls. Another interesting finding was that in 2024, 79% of the detections observed were malware-free, indicating threat actors are instead using hands-on-keyboard techniques that blend in with legitimate user activity and impede detection.

These findings hint towards an increasing use of genAI tools in phishing and vishing (use of voice mails and calls) scams to lure victims from different demographics, educational backgrounds, and economic standings.

The misuse of technology is nothing new; threat actors rely on increasingly sophisticated tools to expand their operations, but with generative AI, the tools can be taught not to allow their misuse. And AI companies have been working on this; however, the solution is not simple. AI models are trained to follow guardrails that stop their misuse, but for most models, they are a secondary function.

Why chatbot guardrails fail?

Generative AI models, based on Large Language Models (LLMs), work on neural networks that generate text, images, and videos through predictions. These happen through a process that includes guessing, checking, and updating data to ensure it is meaningful.

The process prioritizes accurate prediction rather than adherence to guardrails. When the earliest neural networks were developed by OpenAI and Google, it was only after their researchers realized how AI could be used for nefarious purposes that guardrails were added. As such, guardrails are extra training and filters bolted onto a system that fundamentally predicts the next word based on context; it makes safety statistical, not absolute. This allows clever wording, role‑play, or long examples to push the statistical model to copy the wrong behavior.

Another problem is the training data used for modern AI models. Most models are trained on information that is publicly available on the internet, provided by third parties, and handed over by users, human trainers, and researchers. This gives them a huge context window.

Attackers use this context to pack hundreds of examples that teach the model to break away from their guardrails to behave badly.

When AI misunderstands intent

Another area of concern, as illustrated by Reuters in their study, is that the models often fail to understand user intent. Wrapping harmful goals under the garb of fiction writing or research purposes makes them comply because their predictive nature is unable to recognize the context in which their generative abilities can be used.

AI models' lack of contextual understanding can also become a problem when their human users fail to understand the models’ limitations. Cases of AI chatbots assisting users in harmful behavior, including suicide, are examples of this.

In short, genAI guardrails fail because they’re probabilistic band-aids on a system that imitates patterns at scale. Attackers exploit that with rephrasing, long in‑context demonstrations, and prompt injection.

Prioritizing safety

AI models' guardrails won’t work if they are just a refusal filter on a model that’s built to imitate patterns. To ensure models refuse illicit requests, they will have to be taught to always privilege system rules over user text and any third‑party content pulled from the web or files.

Companies like OpenAI have formalized this as an instruction hierarchy, and recent work even proposes benchmarks to measure it. Another way could be to broaden the scope of constitutional AI. This approach advocates developing AI models based on a rule-based system rather than one that depends on specific strings of instructions.

For businesses implementing AI tools, treating every external request as hostile and passing through a policy engine with allow/deny lists, rate limits, and audit logs, could be a viable option. However, for chatbots available for the end-user, they will have to be trained to identify problematic actions, rather than words, to identify threats.

While these methods are discussed in research circles, there is also a need for wider adoption of regulations that ensure AI models prioritize safety over model size.

Regulators will have to step in to ensure that guardrails don’t just block words. And that protection translates to training models to respect rules, filtering inputs and outputs, controlling what actions they can take, keeping the data supply chain clean, and providing safety with transparent metrics.

AI tools are powerful, not neutral

AI, unlike tools of the past, has the cognitive ability to be deployed with minimal human supervision. The technology becomes even more powerful when agentic AI tools take over repetitive tasks. This rapid shift underscores the importance of looking closely at how the models adhere to rules and regulations.

AI models today, regardless of what companies claim, can be leveraged for illicit activities or cause unintentional harm to users. So, while regulators discuss where the technology can be used, they will also have to take a closer look at how it is developed. AI tools are powerful but never neutral, and unless safety rules become systemic, they’ll always be one prompt away from abuse.

TL;DRL

Chatbots still enable phishing: Even with safety filters, top models helped craft scam emails in a new study.
Guardrails aren’t enough: Current safety systems are statistical band-aids, not systemic protections.
AI misunderstands harmful intent: Light prompting or ‘fiction’ framing easily bypasses filters.
Real safety needs real change: Experts call for stronger model design and external regulation.

Friday Poll

🗳️ Should AI companies be liable when their models are used to cause harm?

From prototype to production, faster.

AI outcomes depend on the team behind them. Athyna connects you with professionals who deliver—not just interview well.

We source globally, vet rigorously, and match fast. From production-ready engineers to strategic minds, we build teams that actually ship. Get hiring support without the usual hiring drag.