The three big unanswered questions about Sora
AI & Machine Learning·4 min read

Sora’s Short Films, Big Questions: How OpenAI Turned Video into an AI Compute Gamble

By Alexander Cole

In early October 2025, OpenAI quietly launched Sora — an app that feeds an endless stream of AI‑generated, 10‑second videos — and climbed to No. 1 on Apple’s US App Store.

In early October 2025, OpenAI quietly launched Sora — an app that feeds an endless stream of AI‑generated, 10‑second videos — and climbed to No. 1 on Apple’s US App Store. The app stitches together hyperreal cameos, copyrighted characters, and synthetic soundtracks into an infinite scroll, and leaves engineers, lawyers, and climate researchers asking: how is it made, who pays, and what breaks first?

Sora matters because it compresses in micro‑video form many of AI’s largest technical and social tensions: the scaling costs of generative models, the ethics of identity and copyright, and the environmental footprint of media on demand. OpenAI’s gamble — free, unlimited video generation today, monetization later — tests whether consumers will accept endlessly synthetic entertainment and whether the company can afford the compute and legal headaches that follow.

How Sora probably stitches 10 seconds of illusion

This piece digs into the likely engineering behind Sora’s feed‑generation pipeline, the real economics of producing 10‑second videos at scale, and the regulatory and fairness risks that will determine whether Sora is a flash in the app‑store pan or a new battleground for content, climate, and civil‑rights harms.

How Sora probably stitches 10 seconds of illusion

OpenAI hasn’t published a full architecture paper for Sora, but the observable product gives strong clues. Users create “cameos” — personalized photoreal avatars and voice clones — then place them inside 1–10 second clips alongside AI‑generated scenery, props, and music. Under the hood, that requires a cascade of models: a multimodal planner (an LLM or multimodal transformer) that turns a prompt into scene instructions; a text‑to‑video core that enforces frame‑by‑frame coherence; and smaller specialized nets for face reenactment, lip sync, and audio generation.

The real cost: compute, cloud, and the emissions ledger

Most consumer video generators today use diffusion models adapted to temporal data: latent diffusion operating on compressed frame representations with temporal conditioning to preserve motion. Another common trick is a two‑stage pipeline: generate a low‑resolution animation first, then upsample and refine with a separate super‑resolution model. That design saves compute and reduces artifacts. For quality cameos and voice clones, Sora needs separate identity modules — one for appearance (neural face/head models) and one for voice (waveform synthesizers) — which are then composited into the scene using segmentation masks and depth‑aware blending.

Beyond model choice, the engineering challenge is sequencing: producing a stable 24–30 fps output that feels coherent across time. OpenAI likely leverages cached intermediate representations — reusing a user’s cameo embedding across multiple videos — and applies constrained sampling to avoid hallucinated facial drift. These are algorithmic bandwidth‑saving measures that reduce per‑clip compute without materially lowering perceived quality.

The real cost: compute, cloud, and the emissions ledger

Copyright, identity, and the liability calculus

Video is orders of magnitude more expensive than text. OpenAI’s CEO admitted on October 3, 2025, that “We are going to have to somehow make money for video generation,” implicitly acknowledging a steeper cost curve. A single high‑quality 10‑second clip can require dozens to hundreds of GPU‑seconds on modern accelerators depending on resolution and model architecture — which translates directly into rack power and cloud bills.

Put another way: whereas a ChatGPT text reply might be a few milliseconds of GPU time, frame‑based generators commonly run tens to hundreds of GPU‑seconds. Multiply that by millions of daily viewers, and the electricity demand becomes nontrivial. OpenAI has already been part of industry moves to lock in data‑center capacity and power deals; what’s new with Sora is the potential to turn every scroll into sustained inference load instead of intermittent LLM queries.

Cost can be mitigated with engineering: lower‑precision arithmetic (bfloat16 and FP8), model quantization, distillation into smaller dedicated nets for 10‑second content, and on‑device caching for repeated assets. But these tricks reduce fidelity and raise complex fairness questions: will lower‑resource users get lower‑quality generations that reinforce existing visual biases? The tradeoffs are technical and ethical at once.

Applications, winners and losers — the human ledger

Copyright, identity, and the liability calculus

Sora’s feed is awash in trademarked characters, copyrighted music, and deepfaked likenesses — a legal thicket OpenAI is already hedging against. According to reporting, OpenAI told rights holders they must opt out if they don’t want their IP included, a reversal of standard opt‑in practice that invites litigation. The platform also offers per‑cameo restrictions (Bill Peebles, head of Sora, announced controls on October 5), but enforcement at scale is a technical problem as much as a legal one.

From an AI/ML perspective, rights management requires provenance and filtering layers: copyright classifiers to block copyrighted audio or image material; identity verification to link claimed cameos to verified owners; and policy models that flag political or sexual misuse. Each additional filter adds latency and compute, compounding Sora’s economics. And classifiers themselves have error rates — biased false positives or negatives — which can disproportionately affect marginalized creators.

There’s also a social cost. The app normalizes an environment where synthetic personas are ubiquitous. That shifts the burden to individuals to monitor and restrict their likenesses, placing responsibility on end users rather than platforms. This model of distributed consent is both brittle and expensive to scale.

Sources