Skip to main content

Why Text-to-SQL Breaks in the Enterprise

Samuel Nagy
Samuel Nagy
VP of Strategic Growth

Ask a question, get an answer: text-to-SQL is the most seductive demo in data. Then you point it at a real warehouse and the answers quietly go wrong. The model can write SQL. What it cannot do, on its own, is know what your data means.

The text-to-SQL promise, and where it cracks

The pitch is irresistible. Anyone in the business types a question in plain English ("what was Q2 revenue in EMEA?") and a model turns it into SQL, runs it, and returns the answer. No tickets to the data team, no waiting, no learning SQL. Every major warehouse now ships a version of this: Snowflake Cortex Analyst, Databricks Genie, and Gemini in BigQuery all promise natural-language access to your data.

And in the demo, it works beautifully. Then you connect it to a real enterprise warehouse and something unsettling happens: the answers stop being obviously wrong and start being quietly wrong. The query runs, a number comes back, it looks plausible, but it is off because the model counted churned trials as active customers, or summed a revenue column that was already net of refunds. The failure is not loud. It is a confident, well-formatted, incorrect answer, which is the most dangerous kind.

The instinct is to blame the model's SQL. That is almost never the problem. The model is excellent at SQL syntax and poor at knowing what your data means, and in the enterprise, meaning is everything.

Why it works in the demo and breaks in production

The gap between the demo and production is not subtle, and it has now been measured. The Spider 2.0 benchmark, released in 2025, evaluated leading models on realistic enterprise text-to-SQL workflows instead of the small, tidy databases that earlier benchmarks used. The result was stark: models that score around 87% on simple academic schemas solved only about 10% of real enterprise tasks. The problem everyone thought was "solved" was solved only for toy data.

What changed between the toy and the real thing? Scale and ambiguity. Enterprise schemas in that benchmark averaged hundreds of columns, with some exceeding three thousand, plus heavily abbreviated names and the business knowledge needed to use them scattered across roughly 1.5 million tokens of documentation. A demo schema has ten clean tables where every column name explains itself. Your warehouse has thousands of columns named amt_net_fx, three tables that all look like they hold orders, and a rule about what counts as revenue that exists only in a finance analyst's memory.

Even Snowflake, whose own tool performs well, says it plainly: building accurate text-to-SQL is hard, even for humans. The model is not the bottleneck. The missing context is.

Where Text-to-SQL Breaks THE MODEL WRITES SQL BUT GUESSES THE MEANING Your question"Q2 revenue in EMEA?" Model writes SQLinfers the meaning Resultlooks right, may be wrong Three things it has to guess What "revenue" meansnet of refunds? which currency? Which table to usejoins, grain, the right source What's authoritativeapproved source? sensitive data?
Click to enlarge

Failure mode 1: it doesn't know what your metrics mean

This is the big one. "Revenue" is not a column; it is a definition. Is it gross or net of refunds? Does it include intercompany transactions? Booked or recognized? "Active customer" is worse: every company draws the line differently, and the line usually involves exclusions (churned trials, internal test accounts, suspended seats) that are nowhere in the schema. When a model meets the question "how many active customers do we have?", it sees a customers table and a status column and makes a reasonable, confident, wrong guess.

This is exactly why the accurate text-to-SQL systems all lean on a semantic layer. Snowflake credits Cortex Analyst's 90%+ accuracy and its roughly 2x edge over single-prompt LLMs to the semantic model (the curated description of measures, filters and joins), not to the raw model. Databricks Genie reads the semantic metadata in metric views for the same reason. The intelligence that fixes this failure mode is not a better model; it is a definition the model can read.

"The model isn't bad at SQL. It's blind to your business."

Failure mode 2: it picks or joins the wrong table

Even when the model understands the metric, it still has to find the data. In a real warehouse there are usually several candidates: a raw ingestion table, a staging copy, a deprecated legacy table left in place "just in case," and a certified mart that finance actually trusts. They have similar names and overlapping columns. Nothing in the schema tells the model which one is correct, so it picks the one whose name matches the question most literally, which is frequently the wrong one.

Joins compound the risk. Choosing the wrong key, joining at the wrong grain, or fanning out a one-to-many relationship silently double-counts rows and inflates every aggregate downstream. The query is valid SQL and the number is wrong. This is why a data catalog and lineage matter for AI accuracy and not just for governance: they record which assets exist, which are certified, and how they relate: the map the model otherwise has to reconstruct by guessing.

Failure mode 3: it can't tell what's authoritative or sensitive

The third failure mode is the one that turns a wrong answer into a real incident. A text-to-SQL tool, left to its own devices, has no concept of authority or sensitivity. It does not know that the certified revenue mart is the official source and the analyst's sandbox table is not. And it does not know that a column contains salary, health or other sensitive data that this particular user is not allowed to see.

So it will happily answer "show me everyone's compensation" if the data is reachable, because nothing told it not to. Accuracy and governance are the same problem here: the tool needs to know which source is authoritative and which data is restricted, and both of those facts live in classification and policy, not in the schema. Without that context, a capable assistant becomes a fast, friendly way to leak data or report unofficial numbers as fact.

The fix: a governed semantic layer the tool reads

The pattern across all three failure modes is identical: the model is forced to guess something that should have been told to it. The fix, therefore, is not a smarter model; it is to stop making it guess. That means putting the meaning, the right sources, and the rules into a governed layer the tool reads before it writes a line of SQL.

Concretely, that layer has four parts. A business glossary defines each metric and entity once, authoritatively, with its edge cases, fixing failure mode 1. A catalog records what exists and which assets are certified, and lineage records how they relate and where they came from, fixing failure mode 2. And classification with access policy marks what is authoritative and what is sensitive, fixing failure mode 3. Authored once, by the people who own the data, this is the same semantic layer the warehouse-native tools already prove the value of, just governed centrally instead of re-created inside each tool. If you want the deeper comparison of approaches, see RAG vs. semantic layer.

Ungoverned Guessing vs Governed Context STOP MAKING THE MODEL GUESS UNGOVERNED Your question Model guesses the meaning infers metrics, joins, sources Answer: unverified GOVERNED · OVER MCP Your question Model reads governed context glossary · catalog · lineage · policy Answer: traceable
Click to enlarge

Delivering that context over MCP

Governing context in one place only helps if your AI tools can actually read it. If every tool keeps its own private semantic model, you are back to defining "revenue" five times in five places and watching them drift apart. The answer is a single, open delivery channel: the Model Context Protocol (MCP).

Through an MCP Server, your glossary, catalog, lineage and classification become one endpoint that any compatible assistant or agent can query. You govern the meaning once and serve it everywhere: a warehouse copilot, a BI assistant and a custom agent all draw on the same governed definitions and the same policies. The next AI tool you adopt becomes one more consumer of trusted context rather than one more place that guesses on its own.

Making Cortex, Genie & Gemini more accurate with Dawiso

None of this is an argument against warehouse-native text-to-SQL. Cortex Analyst, Genie and Gemini are genuinely good inside their platforms, precisely because they are built to consume a semantic model. The limitation is scope: each one governs context only for its own platform, so an organization running more than one ends up maintaining the same definitions in several places, with no shared source of truth and no guarantee they agree.

This is the gap Dawiso fills. It assembles the governed foundation these tools depend on: a Data Catalog of what exists, a Business Glossary of what each metric means, Interactive Lineage of where data came from, and classification of what is sensitive, all across more than 40 platforms, with AI assistance to draft descriptions and classification for human review. Through the Context Layer and its MCP Server, Dawiso serves that one governed source to any MCP-compatible tool, so Cortex, Genie, Gemini and your own agents all write SQL against the same meaning. For a concrete pairing, see how Snowflake Cortex agents and semantic views are governed by Dawiso. Text-to-SQL does not break because the model can't write SQL. It breaks when the model has to guess what your data means. So give it the answer, once, and govern it.

FAQ

Why does text-to-SQL fail on enterprise data when it works in demos?
Demos run against small, clean, well-named schemas where the meaning of every column is obvious. Enterprise warehouses are the opposite: hundreds or thousands of columns, abbreviated and ambiguous names, multiple tables that look like they answer the same question, and business rules that live in people's heads rather than in the schema. The model is just as good at writing SQL syntax in both cases. What changes is how much it has to guess about meaning, and in the enterprise, it has to guess about almost everything. The Spider 2.0 benchmark captured this gap precisely: top models that score around 87% on simple academic schemas solve only about 10% of real enterprise text-to-SQL tasks.
Are Snowflake Cortex Analyst and Databricks Genie accurate enough for production?
Inside their own platforms, and when given a good semantic model, yes. Snowflake reports that Cortex Analyst reaches 90%+ accuracy on real-world business intelligence questions and is roughly twice as accurate as a single-prompt LLM, but it credits the semantic model (a curated description of metrics, joins and filters) for that accuracy, not the raw model. Databricks Genie behaves the same way: it reads the semantic metadata in metric views to produce governed answers. The pattern is consistent across vendors: the tool is excellent when it is fed governed business context, and unreliable when it has to infer that context on its own.
What is a semantic layer and how does it improve text-to-SQL accuracy?
A semantic layer is a curated description of your data in business terms: what each metric means, how it is calculated, which tables and joins are correct, and which source is authoritative. When a text-to-SQL tool reads from a semantic layer instead of inferring everything from raw column names, the largest source of error, guessing the meaning, disappears. That is why every accurate enterprise text-to-SQL system, including the warehouse-native ones, is built on top of a semantic model rather than the bare schema.
How does a business glossary reduce text-to-SQL errors?
A business glossary is where you define each metric and entity once, authoritatively: what "active customer," "net revenue," or "qualified lead" means, including the edge cases. When that glossary is connected to the underlying tables and served to the text-to-SQL tool, the tool no longer has to reconstruct your definitions from column names. It applies the approved definition instead of a plausible guess, which is the difference between a number you can act on and one that is confidently wrong.
How does Dawiso make text-to-SQL tools more accurate?
Dawiso assembles the governed context these tools depend on: a data catalog of what exists, a business glossary of what each metric means, interactive lineage of where data came from, and classification of what is sensitive, all across more than 40 platforms. Through its Context Layer and MCP Server, Dawiso serves that governed context to any MCP-compatible assistant or agent. So instead of every warehouse copilot reinventing your definitions, Cortex, Genie, Gemini and your custom agents all read from the same trusted source of meaning.

See it in action

Dawiso Business Glossary

Define every metric once, govern it, and serve it to the AI tools that write your SQL.