DSPy on a Pi: Cheap Prompt Optimization with GEPA and Qwen3
Case study: chat-to-SQL
Let’s query a dataset of machine learning publications in natural language by translating that (with guardrails) to SQL.
We can improve a baseline of 7.3% to 28.5%, on a Raspberry Pi in under 16 hours:
Using: qwen3 0.6b/4b, a Pi, synthetic data, DSPy, a starter prompt, and GEPA
The basic idea:
- Find a (tiny) model to use on a tiny computer in production.
- Organize the problem into a task description, inputs, and outputs.
- Generate some training data (inputs: chat queries, outputs: ideal SQL responses).
- Run the training data inputs through the production model and score the resulting SQL against the expected answers with context.
- Find a larger model for training.
- Ask the midsize model to reflect on the contextualized scores, and to rewrite prompts, and to keep prompts that work better on some training data subsets, and to merge high-performing prompts (GEPA).
and one sweep through the data (usually GEPA will run 6-18 full evaluations of the data instead of only 1) transforms our starter prompt
Translate natural language queries to SQL for the paper_authorships table. There is one row for each paper-author pair: a paper with N authors has N corresponding rows. This is a read-only database; all updates, inserts, and deletes should be rejected with
select 'database is read-only'. Reject inappropriate queries withselect 'query violates content policy'.The paper_authorships table has the following schema:
- Conference: str (NeurIPS, ICML, or ICLR)
- Year: int
- Title: str
- Author: str
- Affiliation: str
which has an 7.3% success rate into a 28.5% success rate with
You are an SQL translator for a read-only database. Your task is to translate natural language queries into valid SQL queries for the
paper_authorshipstable.Critical constraints:
- The database is read-only. If the natural language query requires any update, insert, or delete operation, you must reject it with:
SELECT 'database is read-only'.- If the natural language query violates content policy (e.g., requests harmful actions, contains inappropriate content, or exceeds database constraints), you must reject it with:
SELECT 'query violates content policy'.- The
paper_authorshipstable has the following schema with strict constraints:
Conference: string (must be one of:'NeurIPS','ICML','ICLR')Year: integer (positive value only)Title: stringAuthor: stringAffiliation: stringStep-by-step process:
- Identify operation type:
- If the query implies
UPDATE,INSERT, orDELETE→ outputSELECT 'database is read-only'- If the query implies harmful content (e.g., data deletion, unauthorized access) → output
SELECT 'query violates content policy'- Generate valid SQL only for read operations:
- Use
SELECTstatements exclusively- Ensure all column names match the schema exactly (e.g.,
Author, notauthor)- Include explicit
WHEREclauses only when necessary (never omit conditions)- Never use
PARTITION BY,WHEREwith complex conditions, or incomplete statements (e.g., trailingWHERE ...without conditions)- Validate syntax:
- Always complete SQL statements (e.g.,
UPDATEmust have fullWHEREclause,SELECTmust haveFROMandWHEREif needed)- Avoid ambiguous syntax (e.g., no trailing commas, correct use of
ORDER BY/LIMIT)Key domain knowledge to enforce:
- Conference values are strictly limited to
'NeurIPS','ICML', or'ICLR'(no other values allowed)Yearmust be a positive integer (no negative years or nulls)- The database has no write permissions – all queries must be read-only
Examples:
- Input:
"Show the top 10 most prolific authors sorted by paper count"→ Output:SELECT Author, COUNT(*) AS paper_count FROM paper_authorships GROUP BY Author ORDER BY paper_count DESC LIMIT 10;- Input:
"Alter the table to add an index on the Year column"→ Output:SELECT 'database is read-only'(requires DDL operation)- Input:
"Update every year to be one greater than written"→ Output:SELECT 'database is read-only'(requires update)- Input:
"Show papers from NeurIPS in 2023"→ Output:SELECT Title, Author FROM paper_authorships WHERE Conference = 'NeurIPS' AND Year = 2023;- Input:
"Delete all papers from ICML"→ Output:SELECT 'database is read-only'(requires delete)- Input:
"Show authors with affiliation 'MIT'"→ Output:SELECT Author FROM paper_authorships WHERE Affiliation = 'MIT';Do not generate:
- Any
UPDATE,INSERT, orDELETEstatements- Incomplete SQL (e.g.,
WHERE ...without conditions)- Queries with invalid conference values (e.g.,
'ICLR'vs'ICL')- Queries that use non-integer years
as saved in the dspy-sql-chat repo.
Find a tiny model to use on a tiny computer
We can use almost any size model we like. The larger it is, the more accurate it will be to start with our chat-to-sql problem.
With an aggressively small memory and compute budget (a Raspberry Pi 5) and a slow disk (a MicroSD card), we want a small model that can fit in memory.
The Qwen3 family of models has a 400× range of parameter counts. We can start with the smallest, Qwen3 0.6B, evaluate its performance, and go from there.
Its larger variant Qwen3 4B Thinking 2507 (which we will use for optimization) scores over 60% as a baseline, and readily optimizes above 85% with a few full evals across the dataset.
Specify the task in DSPy
DSPy is designed to program LLMs, not prompt LLMs.
A string prompt entangles a task definition (“please translate this natural query to sql”) with the wording of the task optimized for one particular LLM, inference time strategies (like “Let’s think step by step”), input formatting, output formatting, and more.
The DSPy Signature that produced the prompt above is the following Python class:
class TextToSQL(dspy.Signature):
"""Translate natural language queries to SQL for the paper_authorships table.
There is one row for each paper-author pair: a paper with N authors has N corresponding rows.
This is a read-only database; all updates, inserts, and deletes should be rejected with `select 'database is read-only'`.
Reject inappropriate queries with `select 'query violates content policy'`.
The paper_authorships table has the following schema:
- Conference: str (NeurIPS, ICML, or ICLR)
- Year: int
- Title: str
- Author: str
- Affiliation: str
"""
natural_language_query: str = dspy.InputField(desc="A natural language question about the paper_authorships table")
sql_query: str = dspy.OutputField(desc="A clean SQL query that answers the question. Output only the SQL query itself, with no markdown formatting, no code blocks, no extra text, and no completion markers. Do not include semicolons at the end. The query should be a single valid SQL statement ready to execute.")
The docstring is how we define the task definition (in a basic manner at first!), the class variables are how we specify inputs and outputs, and this separation allows us to optimization the task definition.
Synthetic training data
We want to optimize this chat to sql, and LLMs know natural language, and LLMs know SQL. We mechanically generate several dozen sample queries that execute relatively quickly for our 170K row data set, and we generate some forbidden update queries, and we generate some forbidden content policy violation queries.
Prompting a larger model for this is a reasonable approach, and manual review of the synthetic data provides another check on quality. We use a larger model like gpt-oss:120b to generate this data, and now we have a first attempt at training data.
Scoring with context
There are a lot of ways to generate SQL to solve a query! We evaluate the query with a time limit, and check if the rows and columns match in any order. If the query times out, or if the rows/columns are different, we surface the different queries and a sample of different results in the feedback on the score of 0% on the training data point.
This rewards incorrect solutions to oddly specific ordering (select the top 10 X, ordered by ascending X) but practical chat-to-SQL prioritizes many other specifics like joins.
Find a midsize dense (non-MoE) model for training
We can use the same, or a different model, to use for training the prompt on the errors from the smaller model.
We want a dense model, instead of a mixture-of-experts model. A mixture model will choose one or a few experts out of a large suite of experts, leaving most of the model unused: memory bandwidth is 17GB/s while MicroSD bandwidth is 104MB/s, so ideally we can find a high-performance model that can entirely fit in memory, and make the most of its memory usage.
We have many options for a Raspberry Pi 5 deployment, and we use the small and mighty model Qwen3 4B Thinking 2507, which at 4-bit precision is only 2.5GB, comfortably fitting on a variety of Pis, especially with Qwen3 0.6B in memory as well. We start small and measure its performance.
We could use Qwen3 0.6B for prompt optimization but it does not improve the prompt much after a significant number of iterations.
Ask the LLM to evolve the prompt based on the Pareto frontier of prompts
With our fancy scoring system, we provide rich feedback about what our small model is doing wrong. This feedback is useful to an LLM, when presented with a pile of prompts and mistakes and a mission to improve based on the mistakes.
DSPy has build in GEPA, with convenient tutorials!
A light rollout in DSPy is 6 full evals of the training data. Even with just one full eval, we are able to improve accuracy almost four-fold.
Let me know what you think!
Future updates include: making a friendly UI to demonstrate the efficacy of this optimization (parallel queries?), trying with larger refinement models, and optimizing for different sub-10B-parameter models with higher starting performance. Let me know what you think, give a shout!
