Tutorials Cloudflare Feature Focus

Workers AI: Free Inference at the Edge

CloudflareCloudflare Feature Focus22 minMay 19, 2026Intermediate

For most of the last decade, "add AI to your app" meant signing up for an API, storing an API key as a secret, and paying per-token. Workers AI tilts that calculus by running a curated catalogue of open-source models directly on Cloudflare's GPUs, exposed through a single binding in your Worker, with a meaningful free-tier neuron budget. No API key. No SDK. One model call is one method call.

This is the working guide: what env.AI actually is, which models are worth using today, the pricing model (neurons, not tokens), real text and image generation code from this site's backend, and an honest take on when Workers AI is the right answer vs. when you should reach for OpenRouter, OpenAI, or Anthropic instead.

The binding

[ai]
binding = "AI"

That's the whole wrangler.toml declaration. The runtime then makes env.AI available in every handler — typed, no SDK, no auth header to set. Everything is env.AI.run(modelId, input):

const out = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    { role: 'system', content: 'You are a concise assistant.' },
    { role: 'user',   content: 'Give me three names for a coffee subscription.' },
  ],
});

Model IDs are namespaced by family — @cf/meta/..., @cf/black-forest-labs/..., @cf/openai/whisper, and so on. The full catalogue lives in the Cloudflare AI Models dashboard.

The catalogue, by use case

Workers AI doesn't try to compete with the closed frontier on raw capability. It curates open-source models that fit on the platform's GPUs and run cheaply per invocation. The ones that matter for a typical product:

TaskModelNotes
Chat / instruction-following@cf/meta/llama-3.1-8b-instruct, @cf/meta/llama-3.3-70b-instruct-fp8-fastGood for short summaries, classification, structured extraction
Vision (image-in)@cf/meta/llama-3.2-11b-vision-instructImage + text in, text out — SAS uses it for App Store screenshot critique
Image generation@cf/black-forest-labs/flux-1-schnell, @cf/stabilityai/stable-diffusion-xl-base-1.0Flux Schnell is fast and Apache-licensed; SDXL when you need fine control
Speech-to-text@cf/openai/whisper, @cf/openai/whisper-large-v3-turboThe classic Whisper, deployed at the edge
Text embeddings@cf/baai/bge-base-en-v1.5, @cf/baai/bge-large-en-v1.5Pair with Vectorize for cheap RAG
Reranking@cf/baai/bge-reranker-baseThe "cleaner" half of a RAG pipeline
Code@hf/thebloke/deepseek-coder-6.7b-instruct-awqSmaller code-tuned model

There are dozens more, and the list moves quickly. Treat the catalogue page as the source of truth; the families above are the stable, production-grade subset.

Pricing: neurons, not tokens

Workers AI bills in neurons — Cloudflare's normalised unit that covers tokens, image steps, audio seconds, and embedding calls under one ledger. Every model has a published "neurons per call" number. The Workers Paid plan includes 10,000 neurons/day free; beyond that, you're billed per million neurons consumed.

In practice, neurons translate roughly like this for everyday usage:

For a product that calls the AI binding occasionally (a few hundred invocations per day), Workers AI is usually free. For a product that runs AI in the hot path of every page view, you'll spend — but still less than what the same workload costs against a closed-source API.

Image generation — the SAS pattern

The Mac app's icon-maker and design-asset features both route through /api/ai/generate-image, which is a Worker that calls Workers AI Flux Schnell and stores the result in R2. Here's the actual function from saas/src/index.js:

async function generateImageViaWorkersAI(env, { prompt, width = 1024, height = 1024 }) {
  const result = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
    prompt: String(prompt).slice(0, 2048),
    num_steps: 4,
    width: Math.max(256, Math.min(parseInt(width) || 1024, 2048)),
    height: Math.max(256, Math.min(parseInt(height) || 1024, 2048)),
  });
 
  // Workers AI Flux returns { image: <base64-encoded PNG bytes> }.
  const b64 = result?.image;
  if (!b64 || typeof b64 !== 'string') {
    throw new Error('Workers AI returned no image');
  }
  const binary = atob(b64);
  const bytes = new Uint8Array(binary.length);
  for (let i = 0; i < binary.length; i++) bytes[i] = binary.charCodeAt(i);
  return bytes;
}

A few practical notes that come from running this in production:

  1. Flux Schnell is fast — typical 1024×1024 generation in 1–3 seconds. That's the whole reason it replaced the previous external image provider (Krea, since deprecated). A real-time UI can call it and show the result without a loading-screen apology.
  2. num_steps: 4 is the documented Flux Schnell sweet spot. Anything higher buys little quality; anything lower starts shedding detail.
  3. Output is base64-encoded PNG, not raw bytes. Decode before piping to R2.
  4. Prompts beyond ~2048 chars are accepted but truncated under the hood — clip them yourself so behaviour is predictable.

Once you have the bytes, the R2 store is a one-liner:

await env.SCREENS.put(filename, imgBytes, {
  httpMetadata: { contentType: 'image/png' },
});

The whole Worker — prompt in, public R2 URL out — fits in 40 lines.

Vision — image-in, text-out

SAS uses the Llama 3.2 11B vision model in the "AI App Store critique" endpoint: feed it a screenshot of an App Store listing, get back a critique focused on subtitle, screenshot order, and conversion concerns. The runtime call:

const aiResp = await env.AI.run('@cf/meta/llama-3.2-11b-vision-instruct', {
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: prompt },
        { type: 'image_url', image_url: { url: dataUrl } },
      ],
    },
  ],
  max_tokens: Math.min(parseInt(body.max_tokens) || 1024, 2048),
});
const content = aiResp?.response || aiResp?.result?.response || null;

The image can be a data: URL or an https:// URL the model can fetch. For a Worker that already has the bytes in memory, embedding them as data:image/png;base64,... is the simplest path and avoids any second round-trip.

Vision quality on the open-source side is roughly "good enough for descriptive critique, classification, and OCR-style extraction" and explicitly not good enough for tasks that require frontier-level reasoning over an image. Calibrate expectations — and have a fallback path for when the model returns nothing usable.

The graceful-fallback pattern

A real production endpoint should treat Workers AI as the cheap first try and fall back to a stronger external model on failure or low-confidence output. SAS uses this exact shape for vision:

let content = null;
let provider = null;
try {
  const aiResp = await env.AI.run('@cf/meta/llama-3.2-11b-vision-instruct', { /* ... */ });
  content = aiResp?.response || null;
  if (typeof content === 'string' && content.trim()) {
    provider = 'workers-ai:llama-3.2-11b-vision';
  } else {
    content = null;
  }
} catch (e) {
  // Workers AI unavailable or over quota — fall through to OpenRouter
}
 
if (!content) {
  const orResp = await fetch('https://openrouter.ai/api/v1/chat/completions', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${env.OPENROUTER_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({ /* OpenRouter payload */ }),
  });
  // ...
}

The first call costs nothing (within free-tier neurons) and succeeds 90%+ of the time for the typical input. The fallback covers the 5–10% where Workers AI either errors, returns an empty response, or trips on something outside the open model's competence — at the price of an external API call, but only for those cases.

This is the single pattern that has saved us the most on AI spend, with no impact on UX quality. Pair it with a per-route cache (KV or D1) on the output and the bill shrinks again.

Streaming responses

Long generations should stream. Workers AI supports streaming via { stream: true }:

const stream = await env.AI.run(
  '@cf/meta/llama-3.1-8b-instruct',
  {
    messages: [{ role: 'user', content: prompt }],
    stream: true,
  },
);
return new Response(stream, {
  headers: { 'content-type': 'text/event-stream' },
});

The returned object is a ReadableStream<Uint8Array> of SSE-formatted bytes — exactly the shape the browser EventSource API expects. For chat-shaped UIs, streaming is what turns "the page is frozen" into "the model is typing back," and it's free.

Embeddings + Vectorize = RAG without a separate database

The "right" Cloudflare RAG stack is:

  1. Embed documents with @cf/baai/bge-base-en-v1.5 (768-dim vector per chunk).
  2. Store the vector in Vectorize, a managed vector database that ships in the same Workers binding model as R2/D1.
  3. Query by embedding the user's question with the same model and asking Vectorize for the top K matches.
  4. Optionally rerank the top K with @cf/baai/bge-reranker-base before feeding the winners to the chat model.

All four steps run inside one Worker handler, with no external HTTP. For most "let me ask questions over my docs" use cases that's the entire stack — no Pinecone account, no Weaviate VM, no LangChain abstraction.

Where Workers AI honestly loses

Be straightforward about the gaps:

Privacy and data residency

Workers AI invocations are not logged or used for model training by Cloudflare. The prompts and outputs flow through the AI binding and disappear; there's no cross-customer leakage and no "your data improves the model" small print. This matters for products handling user content — it's the same default you'd get from running on your own GPU, without the GPU.

Pricing scenarios at a glance

ScenarioDaily callsMostlyLikely cost
Hobby project100Free tier neurons$0
Indie SaaS, light AI10kA few hundred image gens + chat$0–5/mo
Indie SaaS, AI in hot path200kChat + vision per page view$20–80/mo
Production app, heavy use5MStreaming chat + RAG + imageSeveral hundred $/mo

Compare to the same workloads on closed APIs (OpenAI / Anthropic) and Workers AI is usually 3–10× cheaper. The trade-off is the model ceiling — if you genuinely need Claude Opus or GPT-5, no amount of neuron pricing fixes that.

The pros and cons cheat sheet

Pros

Cons

When to reach for Workers AI

Use Workers AI when any of the following is true:

Reach for OpenRouter / Anthropic / OpenAI when any of the following is true:

Most production stacks end up using both — Workers AI for the 90% of calls where it's enough, an external API for the 10% where it isn't, with a per-route cache in front of both. That's what this site's backend does, and it's the cheapest, fastest, and most reliable AI architecture we've shipped.

One piece left: wiring it all together

You've now seen the full Cloudflare storage-and-compute toolkit through working production code: Workers as the runtime, R2 for bytes, D1 for relational state, KV for read-cached edge state, Durable Objects for strong consistency and real-time, and Workers AI for inference. Picking the right tool for the right slot of your architecture is most of what "Cloudflare expertise" actually means — and you now have a working mental map for every one of them.

But notice how every one of those tools reached your code the same way: a binding in wrangler.toml, surfaced as env.SOMETHING. The final chapter is about that wiring itself — Ch 7: wrangler.toml & .env.local maps the four places your config can live, why secrets never touch your repo, the NEXT_PUBLIC_ trap that ships keys to every browser, and the build-time-vs-runtime gotcha that makes a value "undefined only in production." If you'd rather see all six tools work together first, the rest of simpleappshipper.com's source code is open: the backend Worker handles auth, billing, video gating, AI generation, and CI webhooks in one large route surface, and the patterns from these chapters are the only ones it uses.

Ch 5: Durable Objects — Strong Consistency at the EdgeCh 7: wrangler.toml & .env.local — Config, Bindings & Secrets
Course PlatformBuild a Course Platform on CloudflareBuild a paid video course platform with Cloudflare Workers, R2, D1, auth, Stripe, and paywalls.Production WebProduction Web Apps SeriesProduction patterns for web apps: caching, rate limiting, webhooks, queues, cron jobs, and idempotency.WebUltimate Web Development SeriesWeb development tutorials for HTML, CSS, JavaScript, Next.js, Workers, databases, and production shipping.

Ship your apps faster

When you're ready to publish your Swift app to the App Store, Simple App Shipper handles metadata, screenshots, TestFlight, and submissions — all in one place.

Try Simple App Shipper
5 free articles remainingSubscribe for unlimited access