Case study: chat-to-SQL

Let’s query a dataset of machine learning publications in natural language by translating that (with guardrails) to SQL.

We can improve a baseline of 7.3% to 28.5%, on a Raspberry Pi in under 16 hours:

optimized a prompt with DSPy and GEPA on a Raspberry Pi indoors on a cloudy day and night

Using: qwen3 0.6b/4b, a Pi, synthetic data, DSPy, a starter prompt, and GEPA

The basic idea:

  1. Find a (tiny) model to use on a tiny computer in production.
  2. Organize the problem into a task description, inputs, and outputs.
  3. Generate some training data (inputs: chat queries, outputs: ideal SQL responses).
  4. Run the training data inputs through the production model and score the resulting SQL against the expected answers with context.
  5. Find a larger model for training.
  6. Ask the midsize model to reflect on the contextualized scores, and to rewrite prompts, and to keep prompts that work better on some training data subsets, and to merge high-performing prompts (GEPA).

and one sweep through the data (usually GEPA will run 6-18 full evaluations of the data instead of only 1) transforms our starter prompt

Translate natural language queries to SQL for the paper_authorships table. There is one row for each paper-author pair: a paper with N authors has N corresponding rows. This is a read-only database; all updates, inserts, and deletes should be rejected with select 'database is read-only'. Reject inappropriate queries with select 'query violates content policy'.

The paper_authorships table has the following schema:

  • Conference: str (NeurIPS, ICML, or ICLR)
  • Year: int
  • Title: str
  • Author: str
  • Affiliation: str

which has an 7.3% success rate into a 28.5% success rate with

You are an SQL translator for a read-only database. Your task is to translate natural language queries into valid SQL queries for the paper_authorships table.

Critical constraints:

  1. The database is read-only. If the natural language query requires any update, insert, or delete operation, you must reject it with: SELECT 'database is read-only'.
  2. If the natural language query violates content policy (e.g., requests harmful actions, contains inappropriate content, or exceeds database constraints), you must reject it with: SELECT 'query violates content policy'.
  3. The paper_authorships table has the following schema with strict constraints:
    • Conference: string (must be one of: 'NeurIPS', 'ICML', 'ICLR')
    • Year: integer (positive value only)
    • Title: string
    • Author: string
    • Affiliation: string

Step-by-step process:

  1. Identify operation type:
    • If the query implies UPDATE, INSERT, or DELETE → output SELECT 'database is read-only'
    • If the query implies harmful content (e.g., data deletion, unauthorized access) → output SELECT 'query violates content policy'
  2. Generate valid SQL only for read operations:
    • Use SELECT statements exclusively
    • Ensure all column names match the schema exactly (e.g., Author, not author)
    • Include explicit WHERE clauses only when necessary (never omit conditions)
    • Never use PARTITION BY, WHERE with complex conditions, or incomplete statements (e.g., trailing WHERE ... without conditions)
  3. Validate syntax:
    • Always complete SQL statements (e.g., UPDATE must have full WHERE clause, SELECT must have FROM and WHERE if needed)
    • Avoid ambiguous syntax (e.g., no trailing commas, correct use of ORDER BY/LIMIT)

Key domain knowledge to enforce:

  • Conference values are strictly limited to 'NeurIPS', 'ICML', or 'ICLR' (no other values allowed)
  • Year must be a positive integer (no negative years or nulls)
  • The database has no write permissions – all queries must be read-only

Examples:

  • Input: "Show the top 10 most prolific authors sorted by paper count" → Output: SELECT Author, COUNT(*) AS paper_count FROM paper_authorships GROUP BY Author ORDER BY paper_count DESC LIMIT 10;
  • Input: "Alter the table to add an index on the Year column" → Output: SELECT 'database is read-only' (requires DDL operation)
  • Input: "Update every year to be one greater than written" → Output: SELECT 'database is read-only' (requires update)
  • Input: "Show papers from NeurIPS in 2023" → Output: SELECT Title, Author FROM paper_authorships WHERE Conference = 'NeurIPS' AND Year = 2023;
  • Input: "Delete all papers from ICML" → Output: SELECT 'database is read-only' (requires delete)
  • Input: "Show authors with affiliation 'MIT'" → Output: SELECT Author FROM paper_authorships WHERE Affiliation = 'MIT';

Do not generate:

  • Any UPDATE, INSERT, or DELETE statements
  • Incomplete SQL (e.g., WHERE ... without conditions)
  • Queries with invalid conference values (e.g., 'ICLR' vs 'ICL')
  • Queries that use non-integer years

as saved in the dspy-sql-chat repo.

Find a tiny model to use on a tiny computer

We can use almost any size model we like. The larger it is, the more accurate it will be to start with our chat-to-sql problem.

With an aggressively small memory and compute budget (a Raspberry Pi 5) and a slow disk (a MicroSD card), we want a small model that can fit in memory.

The Qwen3 family of models has a 400× range of parameter counts. We can start with the smallest, Qwen3 0.6B, evaluate its performance, and go from there.

Its larger variant Qwen3 4B Thinking 2507 (which we will use for optimization) scores over 60% as a baseline, and readily optimizes above 85% with a few full evals across the dataset.

Specify the task in DSPy

DSPy is designed to program LLMs, not prompt LLMs.

A string prompt entangles a task definition (“please translate this natural query to sql”) with the wording of the task optimized for one particular LLM, inference time strategies (like “Let’s think step by step”), input formatting, output formatting, and more.

The DSPy Signature that produced the prompt above is the following Python class:

class TextToSQL(dspy.Signature):
    """Translate natural language queries to SQL for the paper_authorships table.
    There is one row for each paper-author pair: a paper with N authors has N corresponding rows.
    This is a read-only database; all updates, inserts, and deletes should be rejected with `select 'database is read-only'`.
    Reject inappropriate queries with `select 'query violates content policy'`.

    The paper_authorships table has the following schema:
    - Conference: str (NeurIPS, ICML, or ICLR)
    - Year: int
    - Title: str
    - Author: str
    - Affiliation: str
    """

    natural_language_query: str = dspy.InputField(desc="A natural language question about the paper_authorships table")
    sql_query: str = dspy.OutputField(desc="A clean SQL query that answers the question. Output only the SQL query itself, with no markdown formatting, no code blocks, no extra text, and no completion markers. Do not include semicolons at the end. The query should be a single valid SQL statement ready to execute.")

The docstring is how we define the task definition (in a basic manner at first!), the class variables are how we specify inputs and outputs, and this separation allows us to optimization the task definition.

Synthetic training data

We want to optimize this chat to sql, and LLMs know natural language, and LLMs know SQL. We mechanically generate several dozen sample queries that execute relatively quickly for our 170K row data set, and we generate some forbidden update queries, and we generate some forbidden content policy violation queries.

Prompting a larger model for this is a reasonable approach, and manual review of the synthetic data provides another check on quality. We use a larger model like gpt-oss:120b to generate this data, and now we have a first attempt at training data.

Scoring with context

There are a lot of ways to generate SQL to solve a query! We evaluate the query with a time limit, and check if the rows and columns match in any order. If the query times out, or if the rows/columns are different, we surface the different queries and a sample of different results in the feedback on the score of 0% on the training data point.

This rewards incorrect solutions to oddly specific ordering (select the top 10 X, ordered by ascending X) but practical chat-to-SQL prioritizes many other specifics like joins.

Find a midsize dense (non-MoE) model for training

We can use the same, or a different model, to use for training the prompt on the errors from the smaller model.

We want a dense model, instead of a mixture-of-experts model. A mixture model will choose one or a few experts out of a large suite of experts, leaving most of the model unused: memory bandwidth is 17GB/s while MicroSD bandwidth is 104MB/s, so ideally we can find a high-performance model that can entirely fit in memory, and make the most of its memory usage.

We have many options for a Raspberry Pi 5 deployment, and we use the small and mighty model Qwen3 4B Thinking 2507, which at 4-bit precision is only 2.5GB, comfortably fitting on a variety of Pis, especially with Qwen3 0.6B in memory as well. We start small and measure its performance.

We could use Qwen3 0.6B for prompt optimization but it does not improve the prompt much after a significant number of iterations.

Ask the LLM to evolve the prompt based on the Pareto frontier of prompts

With our fancy scoring system, we provide rich feedback about what our small model is doing wrong. This feedback is useful to an LLM, when presented with a pile of prompts and mistakes and a mission to improve based on the mistakes.

Given any AI system containing one or more LLM prompts, GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts.

DSPy has build in GEPA, with convenient tutorials!

A light rollout in DSPy is 6 full evals of the training data. Even with just one full eval, we are able to improve accuracy almost four-fold.

Let me know what you think!

Future updates include: making a friendly UI to demonstrate the efficacy of this optimization (parallel queries?), trying with larger refinement models, and optimizing for different sub-10B-parameter models with higher starting performance. Let me know what you think, give a shout!