SOTA, general text2sql pipeline based on research results.
Find a file
2026-04-01 16:11:24 +02:00
.chromadb Enabled liteLLM, add first evaluation 2026-04-01 16:11:24 +02:00
data Enabled liteLLM, add first evaluation 2026-04-01 16:11:24 +02:00
results Enabled liteLLM, add first evaluation 2026-04-01 16:11:24 +02:00
scripts Enabled liteLLM, add first evaluation 2026-04-01 16:11:24 +02:00
src Enabled liteLLM, add first evaluation 2026-04-01 16:11:24 +02:00
tests Initial Commit 2026-04-01 12:07:45 +02:00
.env.example Enabled liteLLM, add first evaluation 2026-04-01 16:11:24 +02:00
.gitignore Initial Commit 2026-04-01 12:07:45 +02:00
adaptation-plan.md Initial Commit 2026-04-01 12:07:45 +02:00
pyproject.toml Enabled liteLLM, add first evaluation 2026-04-01 16:11:24 +02:00
README.md Initial Commit 2026-04-01 12:07:45 +02:00

t2sql-framework

An explicit-stage text-to-SQL pipeline with tracing, evaluation harness, and a data-driven iteration loop.


Conceptual Goal

Current text-to-SQL libraries wrap everything in an agent loop (see https://github.com/Text2SqlAgent/text2sql-framework, accessed 01.04.2026). This project takes the opposite approach: four hard stage boundaries enforced in code, so every failure is isolable, every stage is independently testable, and token budgets are deterministic, that was derived from systematically accessing the capabilities in a scientific review, updated to modern LLM infrastructure.

This costs flexibility but buys testability, deterministic token budgets, and isolable failure points. If observability matters less than shipping speed, the text2sql-framework agent loop is a legitimate alternative, but for now post-poned.

The pipeline flow is:

question
  → [stage 2] schema linking — retrieve relevant DDL + FK hints
  → [stage 3] generation    — assemble prompt + call LLM + extract SQL
  → [stage 4] refinement    — execute → if error, correct (up to 3 rounds)
  → (sql, dataframe, trace)

Every ask() call returns a structured QueryTrace. Accumulated traces feed AnalysisEngine, which surfaces retrieval noise, synonym mismatches, and schema gaps — the evidence base for deciding what to improve next.

The LLM abstraction (llm/base.py) targets OpenAI-compatible endpoints, covering self-hosted models, rented inference servers, and OpenAI cloud via a configurable base_url. The embedding function is injectable, so schema content can stay on your infrastructure.


2. Repository Structure

src/
├── pipeline/
│   ├── __init__.py
│   ├── orchestrator.py       # ask() — calls stages 2→3→4 in sequence, returns (sql, df, trace)
│   ├── stage2_schema.py      # Schema linking: entity extraction → filtered DDL + FK annotation
│   ├── stage3_generate.py    # Prompt assembly + LLM call + SQL extraction
│   └── stage4_refine.py      # Execution feedback: run → error → correct (max 3 rounds)
│
├── prompt/
│   ├── __init__.py
│   ├── builder.py            # Assembles message list from stage 2+3 outputs
│   ├── templates.py          # System instruction, FK hints, type-specific hints
│   └── dialects.py           # Per-dialect catalog SQL templates (from text2sql-framework)
│
├── retrieval/
│   ├── __init__.py
│   ├── base.py               # Abstract interface + injectable embed_fn
│   └── <backend>.py          # Concrete implementation: qdrant.py OR pgvector.py
│
├── llm/
│   ├── __init__.py
│   ├── base.py               # Abstract interface: call(system, messages) → str
│   ├── openai.py             # OpenAI implementation (primary — covers self-hosted via base_url)
│   └── anthropic.py          # Anthropic implementation (secondary — if direct API access needed)
│
├── schema/
│   ├── __init__.py
│   └── types.py              # TableDefinition, ColumnDefinition dataclasses (from Vanna)
│
├── tracing/
│   ├── __init__.py
│   ├── tracer.py             # QueryTrace dataclass + Tracer (from text2sql-framework)
│   └── analyze.py            # AnalysisEngine: trace mismatch detection (from text2sql-framework)
│
├── connection/
│   ├── __init__.py
│   └── database.py           # SQLAlchemy wrapper + _is_read_only() guard (from text2sql-framework)
│
├── eval/
│   ├── __init__.py
│   ├── harness.py            # EX/EM scoring
│   └── dataset.py            # Test set loader: list of {question, expected_sql}
│
└── utils/
    ├── __init__.py
    ├── tokens.py             # Token counting (provider-aware)
    └── sql.py                # extract_sql, is_read_only

Setup

Requirements: Python ≥ 3.11, uv

# clone and enter the repo
git clone <repo-url>
cd t2sql-framework

# create virtual environment and install dependencies
uv venv
uv sync

# optional extras
uv sync --extra pgvector    # PostgreSQL + pgvector backend
uv sync --extra qdrant      # Qdrant backend
uv sync --extra anthropic   # Anthropic API support
uv sync --extra dev         # pytest

Activate the environment:

source .venv/bin/activate

Roadmap

  • M0 — Scaffold + Baseline
    • M0.1 — repo structure, pyproject.toml, package skeleton
    • M0.2 — port foundation: utils/, schema/, connection/, tracing/, llm/, retrieval/
    • M0.3 — wire orchestrator.ask() (retrieval → prompt → LLM → SQL)
    • M0.4 — connection layer + read-only guard
    • M0.5 — tracing: ask() returns (sql, df, trace)
    • M0.6 — eval harness: 3050 labeled Q/SQL pairs, baseline EX score
  • M1 — Execution Feedback Loopstage4_refine, retry on error, trace attempt count
  • M2 — Prompt Cleanup (skippable) — k=4, FK hint section, tiktoken token counts
  • M3 — Schema Handling — dynamic discovery (Path A) or static linking (Path B), synonym fixes
  • M4 — Query Type Classifier (validate before building) — only if hardcoded hints move EX
  • M5 — Multi-Candidate Sampling (opt-in) — N candidates, majority-vote on result hash