Open Models, Western Infrastructure: Frontier Open LLMs Without Off-shoring Your Source

You can run frontier open-weight coding models on Western infrastructure without managing GPUs. Model quality and inference location are no longer in conflict: the capability you'd get from a direct API and the compliance posture you need can coexist. This post explains why, and what the practical setup looks like.

The best of both worlds

The conventional framing presents a tradeoff: use a frontier closed model (Anthropic, OpenAI) on Western infrastructure, or use an open model with strong coding capability knowing the native API runs overseas. Teams with data residency requirements have been stuck between capability and compliance.

That tradeoff is a product of confusing model provenance with inference location. Where a model was developed, and where inference runs when you use it via API, are separable questions. Open-weight models break the link between the two.

GLM-5.2 was developed by Z.ai, a Chinese AI company. Kimi K2.7 Code was developed by Moonshot AI, also Chinese. Their native APIs run inference in China. Both models have open weights, though: the trained parameters are publicly available, which means anyone can deploy them. Including on Cloudflare's global network, with points of presence in US (New York), UK (London), Germany (Frankfurt), Japan (Tokyo), and Australia (Sydney).

Why open weights enable infrastructure choice

Closed models (Claude, GPT-4, Gemini) are available only through their developer's API. There is no path to running Claude on your own infrastructure or on a third-party provider's servers. If you use Claude, your prompts go to Anthropic's infrastructure. That's a deliberate business and policy choice by Anthropic, separate from any technical constraint.

Open-weight models work differently. The weights are public, and the model architecture is documented. Any sufficiently capable operator can load the weights, run a serving framework, and expose an OpenAI-compatible API endpoint. Same weights, same architecture, same outputs for equivalent inputs: the model behavior is identical regardless of who runs the inference.

This decoupling is what makes the "Western infrastructure" option viable for open models at all. The ability to choose an inference location is the natural consequence of open-weight model design. The model developer publishes the weights, and the inference operator chooses where to run them.

The consequence for data residency: open models let you choose where inference runs. Closed models don't. For teams where inference jurisdiction matters, open models are the only realistic path to frontier capability with infrastructure choice.

What Western-hosted inference gives you

Running open-model inference on Western infrastructure provides three specific properties that matter for data governance:

Residency. Your prompts are processed in data centers governed by Western legal frameworks: EU/EEA, US, UK, or Australia, depending on which node handles the request. The data never routes through infrastructure in other jurisdictions.

Jurisdiction. The legal framework that applies to the processing of your data is Western law. For EU teams, that means GDPR protections and the legal infrastructure around international data transfers. For US teams, that means operating under US law rather than the law of countries with materially different government access frameworks.

Regional latency. Inference on a geographically distributed network with PoPs in your region produces lower latency than routing to a provider's infrastructure on another continent. A developer in London hitting a UK PoP gets faster responses than the same developer routing to a US-only endpoint, and substantially faster than routing to Asia.

None of these properties are sufficient on their own for a complete compliance posture. You also need appropriate contractual terms, clear data handling policies, and logging controls. Residency and jurisdiction are prerequisites, though, and managed Western inference satisfies them at the infrastructure layer.

How Sota does it

Sota runs GLM-5.2 and Kimi K2.7 Code on Cloudflare's global network. Current inference locations: US (New York), UK (London), Germany, Japan, and Australia. Requests sent to Sota's endpoint are processed by Cloudflare's infrastructure; they are not forwarded to Z.ai's or Moonshot's native APIs.

The setup from a developer's perspective is an OpenAI-compatible API endpoint. You configure ANTHROPIC_BASE_URL (or the equivalent for your tool) to point at Sota's proxy, and your existing tooling works without modification. Claude Code, LangChain, LiteLLM, and any other framework that speaks the OpenAI chat completions format will work.

GLM-5.2 is Sota's default model, a strong general-purpose open-weight model that handles the everyday coding workload well across mainstream languages. Kimi K2.7 Code is the specialized option for long agentic loops and large codebase analysis, with a very large context window optimized for software engineering tasks.

Sota's pricing is flat-rate per user: Starter plan at $25/month, Pro plan at $125/month. This matters for volume use cases, because the per-token billing structure from direct provider APIs becomes significant at scale, and flat-rate pricing makes budget planning tractable.

Capability vs the closed frontier

The honest comparison: GLM-5.2 and Kimi K2.7 Code are competitive with closed frontier models on the everyday coding tasks that make up the majority of developer time. Refactoring, test generation, documentation, function generation, migration scripts, code explanation: on all of these, a strong open model produces output that's directly useful without further editing.

The cases where Claude (particularly Opus) maintains a clear edge are harder reasoning tasks: complex multi-file architectural refactors where instruction-following precision matters throughout, deeply ambiguous requirements that need sophisticated synthesis, cross-domain reasoning that integrates code with business context. These are real advantages, but they're not the majority of what coding tools are used for.

For teams where data residency is a hard requirement, the relevant question is whether GLM-5.2 is capable enough for the actual workload, and whether inference location matters more than the capability gap on the hardest tasks. For many teams, the answer is yes.

For a direct model comparison, see our post on the best open-source coding models in 2026. For how data sovereignty fits into a team security posture, see data sovereignty for AI coding tools. For the Claude Code workflow specifically, see our guide on Claude Code alternatives using open models.

Get started

The path from "using a direct provider API" to "using the same model on Western infrastructure" is a configuration change, not a migration. Two environment variables, five minutes, no infrastructure to manage.

Get started with Sota to run GLM-5.2 and Kimi K2.7 Code on Cloudflare's Western network. Inference in the US, UK, Germany, Japan, or Australia. OpenAI-compatible API. Flat-rate pricing.