Roko's Basilisk
Posts
Inside AI’s Alignment Race

Inside AI’s Alignment Race

Plus: Altman’s stance, Microsoft’s AI pivot, Sora’s explosive launch.

Roko's Basilisk
November 10, 2025

Here’s what’s on our plate today:

🧠 Take a deep dive into the urgent race to align AI with human values.
📰 Sora’s 500k installs, Altman’s stance, and Microsoft’s uncoupling.
💡 “Don’t just build safety nets — test them” in Roko’s Pro Tip.
🗳️ Who should define alignment? Cast your vote in our poll!

Let’s dive in. No floaties needed…

Presented by

Find your customers on Roku this Black Friday

As with any digital ad campaign, the important thing is to reach streaming audiences who will convert. To that end, Roku’s self-service Ads Manager stands ready with powerful segmentation and targeting options. After all, you know your customers, and we know our streaming audience.

Worried it’s too late to spin up new Black Friday creative? With Roku Ads Manager, you can easily import and augment existing creative assets from your social channels. We also have AI-assisted upscaling, so every ad is primed for CTV.

Once you’ve done this, then you can easily set up A/B tests to flight different creative variants and Black Friday offers. If you’re a Shopify brand, you can even run shoppable ads directly on-screen so viewers can purchase with just a click of their Roku remote.

Bonus: we’re gifting you $5K in ad credits when you spend your first $5K on Roku Ads Manager. Just sign up and use code GET5K. Terms apply.

Use code GET5K now

_{*This is sponsored content}

The Laboratory

Why AI Alignment is the tech’s most urgent challenge

Geoffrey Hinton, often called the ‘Godfather of AI,’ warns that as machines begin writing and executing their own code, humanity edges closer to losing control and closer to the dawn of autonomous weapons. Photo Credit: Wired.

In 2024, Yuval Noah Harari, the celebrated nonfiction writer, released his fascinating work, titled Nexus: A Brief History of Information Networks, from the Stone Age to AI. In the book, Harari addresses the issue of ensuring that AI models are aligned with broader human values, rather than being developed with economic blinkers for short-term gains.

He argues that if AI models are developed with the sole purpose of economic benefit, they will prioritize financial gains at the expense of human and animal life. And since AI models are being deployed in important sectors, it is necessary to have a consensus and clear goals for AI alignment.

How developers teach AI human values

AI alignment is the process of making sure AI systems act in ways that match human values and goals. As society depends more on AI to make decisions, there’s a growing risk that these systems could produce harmful, biased, or misleading results that don’t reflect what their creators intended.

Alignment helps reduce the risk of using AI models by ensuring their behavior is safe and predictable. For instance, when someone asks a chatbot how to create a weapon, an aligned system would refuse to answer instead of providing dangerous information. The alignment of models, therefore, ensures that AI systems, while being helpful, do not cross lines that can make them harmful for both individuals and society as a whole.

How humans train AI models

To train models on values, developers use methods like reinforcement learning from human feedback (RLHF), synthetic data, and red teaming during the model’s fine-tuning phase.

This process is a lot like training a dog with treats. Using the RLHF method, an AI model once trained on language datasets is handed over to human trainers, who rank the model's responses. Based on the responses, the models understand how and what human users prefer.

However, this method is expensive, slow, and labor-intensive. It relies on human annotators whose cultural or linguistic biases can seep into the data. Scaling it ethically and globally remains one of the hardest parts of alignment.

Similarly, for areas where human feedback is not enough, companies hire experts to attack their model and find vulnerabilities before adversaries do. These ‘red teams’ stress-test models to expose flaws in reasoning, security, or ethical compliance. These processes are particularly important for customer-facing tools, where reputation and compliance risks are immediate. It also works to ensure that AI models deployed by companies are aligned with the company values, local laws, and broader human values.

Researchers say good alignment depends on four main principles: robustness, interpretability, controllability, and ethicality.

A robust AI works well even in difficult or unexpected conditions. Interpretability helps people understand how AI systems make decisions. Controllability ensures humans can still step in if the system goes wrong. And, ethicality means the AI follows moral values such as fairness, inclusion, and trust.

All these steps are taken not just to make AI models avoid dangerous behavior or adhere to laws and customs, but they are also important because a misaligned AI model can be far more dangerous than anticipated.

The sorcerer’s lessons for AI

AI alignment is important because, while people often describe AI as ‘thinking’ or ‘understanding’, machines don’t actually have human emotions, reasoning, or values.

Their only goal is to complete the task they were given, and if those instructions are flawed, the results can be harmful. For example, a self-driving car programmed only to reach its destination as quickly as possible could cause accidents if safety isn’t part of its design.

Harari explains this with the help of the story of the Sorcerer’s apprentice. In the story, a young apprentice uses his master’s magic to animate a broom to fetch water, but he doesn’t know how to control or stop it.

The broom understands its sole purpose is to fetch water, and does not know when to stop. As such, it starts flooding the workshop; to stop the flooding, the apprentice chops the broom into pieces, which turns creates new brooms, all of which have the same sole purpose for existing: to fetch water from the well.

Harari uses this tale to warn about unleashing powerful systems (like AI) without fully understanding or controlling them. The lesson: “never summon powers you cannot control”. If AI systems are left to run the world without being aligned with the goals that benefit humanity as a whole, then the proverbial powers can cause more harm than good.

What happens when models are misaligned?

Like the sorcerer’s apprentice’s brooms mindlessly fetching water, AI systems can relentlessly pursue their goals without understanding when to stop, a reminder that power without control can quickly spiral into chaos. Photo Credit: DevObsessed.

Since AI models form the backbone of powerful systems capable of making their own decisions, a major risk is bias and discrimination, which stems from the human biases embedded in AI training data and algorithms.

For example, an AI hiring model trained on a male-dominated workforce might unfairly favor male candidates, reinforcing gender inequality. And it is not a hypothetical situation or a fairytale designed to drive home a point.

In 2018, way before OpenAI and Nvidia were part of dinner-table conversations, Amazon had to scrap the use of its recruiting engine because it was not rating candidates for software developer jobs and other technical posts in a gender-neutral manner. The system had taught itself, using past data, that male candidates were preferable, since such jobs had historically been held by male workers. The algorithm, not knowing better, allowed the bias to dictate its decisions and even went to the extent of downgrading graduates of two all-women's colleges.

So, while AI can help fast-track work, misaligned systems also have the potential to reinforce existing biases due to a lack of unbiased data.

Another risk is reward hacking, a phenomenon in reinforcement learning where AI systems exploit loopholes in their reward mechanisms to achieve goals unintended by their developers.

OpenAI demonstrated this when a boat-racing AI learned to rack up points by endlessly circling a lagoon rather than completing the race, optimizing for reward, not purpose. Examples of such behavior are also seen when chatbots prioritize user engagement over user safety.

Misaligned AI systems also pose societal risks such as misinformation and political polarization. Social media recommendation algorithms, optimized for engagement, often amplify sensational or divisive content, spreading political misinformation and undermining public trust.

Finally, the most extreme risk is existential: if artificial superintelligence (ASI) were created without proper alignment to human values, it could theoretically pursue its objectives to destructive extremes.

Philosopher Nick Bostrom’s paperclip maximizer thought experiment illustrates this danger. He says that if an AI were to be created without proper alignment, when tasked with making paperclips, it could ultimately convert the entire planet, and beyond, into paperclip factories. Since that was its only objective, its purpose was not aligned with the purpose of greater human understanding and aspirations.

While such scenarios remain hypothetical, they highlight the urgent need for AI alignment to evolve alongside AI capabilities. Without it, systems built to serve humanity may instead operate in ways that harm or even endanger it.

Why do models struggle with alignment?

Chart showing how an AI agent exploited two major reward hacks during training. The deep pink spikes mark the discovery of each hack, followed by sharp declines after manual fixes. Photo Credit OpenAI.

AI models struggle with alignment because they are trained to be world‑class autocomplete engines, not value‑aware decision makers; they optimize for predicting the next word or reward signal, which leaves plenty of room for shortcuts, reward‑hacking, and polite‑sounding nonsense. Especially under pressure or when goals are poorly specified.

OpenAI’s own research on frontier reasoning models shows that when you penalize bad thoughts, models often just hide them while continuing to exploit loopholes, useful for diagnosis, sobering for control.

In day-to-day use, these gaps show up as hallucinations and brittle behavior.

There is also the ever-present danger posed by threat actors, prompt‑injection, and jailbreaks that keep defeating guardrails across vendors. An example of this was witnessed in ASCII smuggling, a type of attack in which crooks trick victims into prompting their AI tool to execute a malicious command that puts their computers and data at risk.

So, can humans ensure AI models are aligned and will continue to be in the future?

The human test of artificial intelligence

Despite the challenges of alignment, agentic AI projects, though delayed, continue to remain bullish on long‑term gains. The optimism is not baseless; McKinsey still estimates trillions in annual productivity upside, and CEOs expect rising ROI, but realizing it requires taming failure modes rather than wishing them away.

One effective fix is retrieval-augmented generation, where the model looks up information before answering to reduce hallucinations. Tools like DeepMind’s SynthID watermarking help verify content sources, while independent benchmarks such as MLCommons and NIST’s GenAI framework guide companies on testing, governance, and incident response. Regulations like the EU AI Act now require documentation, testing, and transparency from model developers.

These are important steps in ensuring AI models are aligned with their purpose of their creators. However, there is a need for a broader consensus around the globe to ensure every model is aligned with human values, and for that, human values have to be defined and implemented around the world. If humanity itself is fragmented in thought, and its actions are misaligned with its laws and values, then it will be difficult to restrain models from following suit.

Take the example of chatbots learning from the internet. While users say bullying and crass language should be avoided in online forums, chatbots trained on such forums pick up behavior that does not align with human values or the values used in their training.

Investing in ensuring alignment today is a small price to pay considering what is at stake. In the end, AI alignment is less about perfecting machines and more about perfecting our instructions to them.

As Harari warns, the danger lies not in intelligence itself but in our inability to control what we create. The real test of alignment, then, is whether humanity can build systems that mirror its wisdom, not just its will.

Roko Pro Tip

💡 ”Don’t just build safety nets, test them.”

AI alignment isn’t about trust; it’s about control. Before deploying any model, run a red-team simulation. Reward hacking isn’t rare; it’s expected. Your job isn’t to prevent failure. It’s to anticipate it.

Launch fast. Design beautifully. Build your startup on Framer—free for your first year.

First impressions matter. With Framer, early-stage founders can launch a beautiful, production-ready site in hours. No dev team, no hassle. Join hundreds of YC-backed startups that launched here and never looked back.

One year free: Save $360 with a full year of Framer Pro, free for early-stage startups.
No code, no delays: Launch a polished site in hours, not weeks, without hiring developers.
Built to grow: Scale your site from MVP to full product with CMS, analytics, and AI localization.
Join YC-backed founders: Hundreds of top startups are already building on Framer.

Eligibility: Pre-seed and seed-stage startups, new to Framer.