Where Does Your Code Go? A Guide to AI Inference Data Residency

When you send a prompt to an LLM API, that prompt travels to a server in a specific physical location and is processed there. The jurisdiction those servers sit in determines the legal framework governing your data. For many teams, that matters more than the model's benchmark score.

This guide explains what data residency means in the context of LLM inference, what the landscape looks like today, and the concrete choices available to teams that want control over where their prompts land.

What data residency means for LLM inference

Data residency, in its simplest form, refers to the physical or geographic location where data is processed and stored. For inference workloads, that covers where your prompt is sent, where the computation runs, and where logs or caching might persist — not just where model weights live.

Most LLM providers run inference in data centers. A request from a developer in London hitting a provider's API doesn't necessarily stay in London. Without explicit regional routing, that request might be processed anywhere the provider runs capacity. That's how cloud infrastructure typically works: load is routed to available compute, regardless of where the caller is.

This matters for three reasons. First, different legal jurisdictions impose different obligations on data that passes through them: government access laws, data protection regulations, and industry compliance frameworks all vary by country. Second, corporate data governance policies often specify where proprietary information may be processed. Third, for certain regulated industries, residency is a hard contractual or regulatory requirement.

Your prompt is your source code

When you use an AI coding tool, the prompt contains your actual work product, not just a question. A typical coding session surfaces file contents, function signatures, variable names, business logic, API patterns, and sometimes credentials or environment variable references that accidentally end up in context. The prompt is, effectively, a window into your codebase.

This is worth stating plainly because the mental model many developers have ("I'm just asking a question") undersells what's being transmitted. For open-source projects with no proprietary content, this is a non-issue. For a team building a product with a unique technical approach, the prompt stream from an AI coding tool is a meaningful slice of IP.

Data residency covers regulatory compliance, but it's also about deciding where your source code goes and under what legal framework it sits while being processed.

Where do popular providers run inference?

This varies significantly by provider and is worth researching directly before committing to any tool for production use.

Anthropic (Claude) runs inference on US-based infrastructure by default. AWS-hosted Claude deployments via Bedrock can be configured to specific AWS regions, giving enterprise customers more control.

OpenAI similarly operates primarily out of US infrastructure, with some Azure-backed enterprise options that provide region selection.

Many open-model providers, including providers of models developed outside the US or Europe, run their native APIs from their home-country infrastructure. That's how they're built: the company is there, the data centers are there. The fact is publicly available, but rarely prominently documented, and developers often don't think to check.

The practical implication: if you're calling an API provided by a company headquartered in a particular country, the default inference location is often that same country unless the provider explicitly states otherwise. The best approach is to read the provider's documentation and, if jurisdiction matters for your context, ask them directly.

For an in-depth look at the specific question of sending code to overseas APIs, see our post on whether it's safe to send code to overseas LLM APIs.

Residency vs privacy vs sovereignty

These three terms get conflated. They're related but distinct.

Data residency is the physical location question: where does data sit and get processed? Residency is often the lever teams can directly control, by choosing a provider that runs inference in a specific region or self-hosting on their own infrastructure.

Data privacy is about access control and confidentiality: who can read the data, under what circumstances, and with what protections? Privacy protections include encryption in transit, contractual data-handling commitments, and policies around whether your prompts are used for training. A provider can have data in your preferred region but still have broad access rights in its terms of service.

Data sovereignty is the governance layer: the right of a jurisdiction (or organization) to control data about its subjects or operations under its own legal framework. In the EU context this is closely tied to GDPR. In enterprise contexts, it often means the organization wants to be the final authority over data access rather than relying on contractual assurances with a foreign vendor.

Understanding which of these you need shapes what you should evaluate. Residency is often the tractable technical problem. Privacy is addressed by ToS and contractual terms. Sovereignty is harder: it usually requires your data to stay within a legal framework you control. See our guide on GDPR, data residency, and LLMs for dev teams for the EU-specific angle.

How to control where inference happens

There are four practical options, roughly ordered by how much control you get versus how much complexity you take on:

1. Provider-native region selection. Some enterprise-tier providers (Anthropic via Bedrock, OpenAI via Azure) let you pin inference to a specific geographic region. This is the lowest-complexity path if the available regions match your requirements.

2. Use a managed inference proxy on Western infrastructure. A proxy that serves open-weight models on infrastructure in your preferred region gives you residency control without managing GPUs. The quality of residency depends on the proxy operator's infrastructure commitments, so verify what they actually guarantee.

3. Self-hosted inference. Run the model weights on your own cloud or on-premises hardware. Highest control, highest operational cost. The right choice for teams with strict air-gap requirements or classified-data environments.

4. Model + network-layer controls. For teams where residency is only part of a larger security posture, pairing a residency-aware inference provider with egress controls, network monitoring, and DLP tooling gives defense in depth.

Sota's approach

Sota runs inference for GLM-5.2 and Kimi K2.7 Code on Cloudflare's global network. Current inference locations include US (New York), UK (London), Germany, Japan, and Australia. Requests are not routed to Z.ai or Moonshot's native infrastructure.

Both GLM-5.2 and Kimi K2.7 Code are open-weight models with publicly available weights, but the native APIs for these models are operated by their respective developers in China. Sota serves those same model weights from Western infrastructure, so teams that want access to frontier open models can control inference location without sacrificing model quality. See our post on data sovereignty for AI coding tools for how this fits into a broader team security posture.

For teams in the EU or UK building toward GDPR or UK GDPR compliance, inference running in Germany or the UK is meaningful: it is a foundational element of a compliant data flow, though not determinative on its own. Sota's Starter plan starts at $25/month; Pro at $125/month.

If inference location matters to your team, the right starting point is to verify exactly where each tool in your workflow sends your prompts.

Get started with Sota to run frontier open-weight models on Cloudflare's Western network, with inference in the US, UK, Germany, Japan, or Australia.