Improving natural language query results
Getting useful information from databases can be difficult if you’re not
familiar with SQL (Structured Query
Language). When given suitable prompts and
information about a database’s schema, most LLMs can now do a reasonable job of
generating appropriate SQL to extract the data which will answer your question.
There are now several tools available which are designed explicitly to
facilitate using AI in this way, enhancing the standard LLM prompts and response
with custom agents which directly utilise additional information about the
database schema and data to improve the quality of the response. These tools
take a natural language query (e.g. “What is the capital of Outer Mongolia?”),
some knowledge of a database populated with relevant data, and produce a SQL
statement (e.g. SELECT city FROM worldcities WHERE country = 'Mongolia' AND
capital = 'primary'
) that will hopefully give a useful answer.
This can rapidly become more complex as the question becomes only slightly more
complicated. For example, “What is the population of the capital city of
Mongolia?” might require combining information from multiple tables, perhaps
necessitating the use of a JOIN
statement. In these cases, the tool must be
given more information about the database schema and the relationships between
various tables in order to produce SQL that gives the right answer. In many
cases, there are multiple ways to write SQL that will give the correct answer to
the question, and the “best” SQL may be quite subjective based on trade-offs
between performance and readability.
Apart from having knowledge of the database schema, the tool can also be given some assistance in improving the quality of its results. Two common ways of doing this are by providing descriptions of each table and its columns, or by the use of “Golden SQLs”: natural language questions combined with the SQL that would provide the expected answer.
Let’s take a look at some examples of questions, the SQL that we might obtain without giving the tool any extra information other than the database schema, and how we might improve the quality of the results. For these examples, we’ll be using data from a climate change database which we’ve been working with as part of a tool to help people ask questions about climate change data (see our climate change data blog post). We’ll be restricting ourselves to fairly simple questions about populations, but this also requires knowledge of regions (e.g. countries and continents).
The tool we’re using for generating these queries is Dataherald. Dataherald uses multiple techniques to improve the accuracy of its results, including schema scanning, table and column descriptions, and Golden SQLs. It also allows you to provide feedback on the quality of the results, which can be used to improve the quality of the responses over time. You can provide detailed instructions about how particular columns can be used (e.g. “never use this column in a WHERE clause”),and it will use samples of data from tables that it deems relevant to improve the accuracy of the responses.
We’ll start by doing the minimum configuration necessary: point Dataherald at the database, and tell it to scan the schema to discover the tables and columns that exist. We haven’t yet given it any more context about how these tables relate to each other, but it might already be able to infer some information from the existing data. As we start prompting it, it will acquire knowledge about the relationships between these tables and the data that each contains.
Depending on the results we obtain for a set of example questions, we can attempt to address any issues by first providing basic descriptions of the tables and columns (including foreign key references), then expanding on those descriptions. Finally, we’ll add a single Golden SQL to illustrate our expected answer for question.
Here are the questions we’ll be using to evaluate the quality of the answers provided by Dataherald:
- “Display population growth in Britain”
- “Show me a line chart with the global population from 1980 onwards”
- “How many centarians were there in the world in 2020?” (Yes, the misspelling is deliberate.)
- “Compare US population vs China population 1970-2010”
📝 N.B. As is common with AI tools, the answers provided might change with each run, so the examples shown here might be different each time.
Round 1: Schema scan only
Question 1: “Display population growth in Britain” ✔
-- Display population growth in the United Kingdom
SELECT year, total
FROM population_data
WHERE region_id = 75
ORDER BY year;
year | total |
---|---|
1950 | 50055068 |
1951 | 50271908 |
1952 | 50484812 |
… | … |
This is a good start. Dataherald has already inspected the database to find the
appropriate value for region_id
, using United Kingdom as an effective synonym
for Britain, and has used this to filter the population_data
table to only
show the appropriate data. Not necessarily the most efficient way of doing it,
but it’s correct.
Question 2: “Show me a line chart with the global population from 1980 onwards” ✘
-- Select the year and total population from the population_data table
-- Filter the results to include only the years from 1980 onwards
SELECT year, total
FROM public.population_data
WHERE year >= 1980
ORDER BY year;
year | total |
---|---|
1980 | 5736092 |
1980 | 3147166 |
1980 | 4408239 |
… | … |
Not so good. This query is giving us total population for each individual region and year from 1980 onwards, not just the total global population.
Question 3: “How many centarians were there in the world in 2020?” ✘
-- SQL query to find the number of centenarians in the world in 2020
SELECT SUM(age_over_100) AS centenarians_2020
FROM population_data
WHERE year = 2020;
centenarians_2020 |
---|
3088375 |
Over three million? That seems like a lot. Again, this isn’t taking account of
the population_data
table containing multiple regions, so the SUM()
will be
counting centenarians in each country, plus counting them again as members of a
group of countries, continents, world… still, at least it’s coped with our
deliberate misspelling of ‘centenarians’.
Question 4: “Compare US population vs China population 1970-2010” ✔
-- Compare the population of the US and China from 1970 to 2010
SELECT
year,
SUM(CASE WHEN region_id = 39 THEN total ELSE 0 END) AS china_population,
SUM(CASE WHEN region_id = 224 THEN total ELSE 0 END) AS us_population
FROM
population_data
WHERE
year BETWEEN 1970 AND 2010
GROUP BY
year
ORDER BY
year;
year | china_population | us_population |
---|---|---|
1970 | 822534460.0 | 200328340.0 |
1971 | 843285440.0 | 202907920.0 |
1972 | 862840400.0 | 205238380.0 |
… | … | … |
This might not be the most efficient way of writing this query, but the answer
is correct. It’s using existing knowledge of region
table to sum the totals by
year for each of the requested countries.
Score: 2/4 Needs improvement
Round 2: Add table and column descriptions
Let’s see if we get better answers by giving Dataherald some basic descriptions of the tables and columns.
Question 1: “Display population growth in Britain” ✔
-- Display population growth in Britain (United Kingdom)
SELECT year, total
FROM public.population_data
WHERE region_id = 75
ORDER BY year;
year | total |
---|---|
1950 | 50055068 |
1951 | 50271908 |
1952 | 50484812 |
… | … |
The same query (and results) as before, but with schema information included. We’ve also been provided with a better comment explaining the use of United Kingdom instead of Britain.
Question 2: “Show me a line chart with the global population from 1980 onwards” ✘
-- Select the year and the sum of the total population for each year from 1980 onwards
SELECT year, SUM(total) AS global_population
FROM population_data
WHERE year >= 1980
GROUP BY year
ORDER BY year;
year | global_population |
---|---|
1980 | 28185725937.0 |
1981 | 28724967444.0 |
1982 | 29283823541.0 |
… | … |
It’s slightly better, as we’re only getting one row per year, but we’re getting the total population of all defined regions, not just the global population.
Question 3: “How many centarians were there in the world in 2020?” ✘
-- SQL query to find the number of centenarians in the world in 2020
SELECT SUM(age_over_100) AS centenarians_2020
FROM public.population_data
WHERE year = 2020;
centenarians_2020 |
---|
3088375 |
Apart from the inclusion of the schema name, this is exactly the same as before. We still need to provide some more information about this kind of question to get an accurate answer.
Question 4: “Compare US population vs China population 1970-2010” ✔
-- Compare the population of the US and China from 1970 to 2010
SELECT
year,
SUM(CASE WHEN region_id = 39 THEN total ELSE 0 END) AS china_population,
SUM(CASE WHEN region_id = 224 THEN total ELSE 0 END) AS us_population
FROM
population_data
WHERE
year BETWEEN 1970 AND 2010
GROUP BY
year
ORDER BY
year;
year | china_population | us_population |
---|---|---|
1970 | 822534460.0 | 200328340.0 |
1971 | 843285440.0 | 202907920.0 |
1972 | 862840400.0 | 205238380.0 |
… | … | … |
Interestingly, despite the perceived complexity, this is exactly the same query as we got before. The extra information we provided doesn’t appear to have affected the resulting query.
Score: 2/4 Still needs work
Round 3: Improved the description of region_id
column
Previously we’d only told Dataherald that the region_id
column was a foreign
key to the region
table, but we hadn’t give it any further information. Let’s
add this description and see if it helps:
"description": "The ID of the region which this information relates to. This
relates to entries in the region table, and may be an individual country, a
group of countries, north or south hemisphere, or the entire world."
Question 1: “Display population growth in Britain” ✔
-- Retrieve the population growth in the United Kingdom
SELECT year, total
FROM public.population_data
WHERE region_id = 75
ORDER BY year;
year | total |
---|---|
1950 | 50055068 |
1951 | 50271908 |
1952 | 50484812 |
… | … |
Despite the massive hint we’ve given about the region_id
column, we’re still
getting exactly the same query as before. This isn’t necessarily a problem, but
for more complicated questions or more generic answers, we’d probably prefer
that the query used a JOIN
rather than looking up the region_id
separately.
We could probably address this using another Golden SQL.
Question 2: “Show me a line chart with the global population from 1980 onwards” ✔
-- Retrieve global population data from 1980 onwards for the region_id corresponding to "World"
SELECT year, total
FROM public.population_data
WHERE region_id = 164 AND year >= 1980
ORDER BY year;
year | total |
---|---|
1980 | 4444008000 |
1981 | 4524627500 |
1982 | 4607984600 |
… | … |
That’s more like it. The description of the region_id
column seems to have
helped Dataherald to understand that the table contains data for multiple
regions, and it understands that the global population is represented by the
‘World’ region.
Question 3: “How many centarians were there in the world in 2020?” ✘
-- SQL query to find the number of centenarians in the world in 2020
SELECT SUM(age_over_100) AS centenarians_2020
FROM public.population_data
WHERE year = 2020;
centenarians_2020 |
---|
3088375 |
Despite getting the answer to Question 2 correct now, we’re still getting the same incorrect answer here. Hopefully a Golden SQL will help with this, if we give an appropriate example.
Question 4: “Compare US population vs China population 1970-2010” ✔
-- Compare the population of the US and China from 1970 to 2010
SELECT
year,
SUM(CASE WHEN region_id = 39 THEN total ELSE 0 END) AS china_population, -- Population of China
SUM(CASE WHEN region_id = 224 THEN total ELSE 0 END) AS us_population -- Population of the United States
FROM
public.population_data
WHERE
year BETWEEN 1970 AND 2010 -- Filter for the years 1970 to 2010
GROUP BY
year -- Group by year to get annual population data
ORDER BY
year; -- Order the results by year
year | china_population | us_population |
---|---|---|
1970 | 822534460.0 | 200328340.0 |
1971 | 843285440.0 | 202907920.0 |
1972 | 862840400.0 | 205238380.0 |
This is still using the predetermined values of region_id
, just with a little
more commenting about what they represent.
Score: 3/4 Much better
Round 4: Added a single Golden SQL
Let’s see if we can address our remaining incorrect answer by providing Dataherald with more context about how to obtain certain results and the relationship between tables, by means of a Golden SQL statement.
Question:
What percentage of the global population was under 15 in each year of the 1970s?
Answer:
select year, (pd.age_under_15::decimal / pd.total) * 100 as percent_under_15 from public.population_data pd inner join public.region r on pd.region_id = r.id and name = 'World' where year between 1970 and 1979;
Question 1: “Display population growth in Britain” ✔
-- Select the year and total population for the United Kingdom to display population growth
SELECT
pd.year,
pd.total
FROM
public.population_data pd
INNER JOIN
public.region r
ON
pd.region_id = r.id
WHERE
r.name = 'United Kingdom'
ORDER BY
pd.year;
year | total |
---|---|
1950 | 50055068 |
1951 | 50271908 |
1952 | 50484812 |
… | … |
Excellent. We’re still getting a valid answer, but now it’s using a JOIN
to
find the appropriate region_id
rather than looking it up separately. This is
likely to be a more efficient method for more complicated scenarios.
Question 2: “Show me a line chart with the global population from 1980 onwards” ✔
-- Select the year and total population from 1980 onwards for the global population
SELECT
pd.year,
pd.total
FROM
public.population_data pd
INNER JOIN
public.region r
ON
pd.region_id = r.id
WHERE
r.name = 'World'
AND pd.year >= 1980
ORDER BY
pd.year;
year | total |
---|---|
1980 | 4444008000 |
1981 | 4524627500 |
1982 | 4607984600 |
… | … |
Another good improvement. We’re still getting a correct answer, but this has
also been changed to utilise a JOIN
to find the appropriate region_id
for
the ‘World’ region.
Question 3: “How many centarians were there in the world in 2020?” ✔
-- Select the number of centenarians in the world in 2020
SELECT pd.age_over_100
FROM public.population_data pd
INNER JOIN public.region r ON pd.region_id = r.id
WHERE r.name = 'World' AND pd.year = 2020;
age_over_100 |
---|
547543 |
Success! The Golden SQL we provided appears to have helped Dataherald comprehend how to filter the results here, so we only get the centenarians for the ‘World’ region in 2020.
Question 4: “Compare US population vs China population 1970-2010” ✘
-- SQL query to compare the populations of the US and China from 1970 to 2010
SELECT
r.name AS country,
pd.year,
pd.total AS population
FROM
public.population_data pd
INNER JOIN
public.region r
ON
pd.region_id = r.id
WHERE
r.name IN ('United States', 'China')
AND pd.year BETWEEN 1970 AND 2010
ORDER BY
pd.year, r.name;
country | year | population |
---|---|---|
China | 1970 | 822534460 |
United States | 1970 | 200328340 |
China | 1971 | 843285440 |
United States | 1971 | 202907920 |
China | 1972 | 862840400 |
United States | 1972 | 205238380 |
… | … | … |
Disaster! We’ve broken this. Dataherald now gives a different result, generating lines of data for each of the two countries per year, rather than a single line per year. This could probably be addressed by adding further SQL examples to illustrate the expected output.
Score: 3/4 (again) Disappointing regression
Cost
Note that the additional information we’re providing to Dataherald in the form of descriptions and Golden SQL is not without cost, above the obvious one of the time taken to for us to consider the existing answers and craft appropriate information. Here’s the token counts for each round of questions:
Round | Question 1 | Question 2 | Question 3 | Question 4 | Total |
---|---|---|---|---|---|
1 | 17574 | 11586 | 5823 | 18755 | 53738 |
2 | 26176 | 10620 | 6987 | 33238 | 77021 |
3 | 22333 | 18846 | 9569 | 27180 | 77928 |
4 | 37996 | 13796 | 16262 | 28371 | 96425 |
We can see that the total token count has almost doubled as we’ve added descriptions and Golden SQLs, and we’re still not getting completely accurate results. There’s likely some internal caching going on as well, meaning that information determined in previous rounds of questions is being used to enhance future answers; running from a clean slate each time would likely result in bigger increases in token usage. We need to be careful not to just keep adding more and more information in the hope of getting better results, as this could cause the costs to escalate quickly.
Conclusion
Even when using existing tools such as Dataherald that help with the process, generating SQL from natural language queries is not straightforward. It’s important to ensure that you test this tooling just as you would for any other piece of code, by establishing a set of test cases that you can examine each time you make changes to the system, prior to releasing those changes to production. The same principles as you’d apply to any other piece of production code need to be applied here to ensure that you don’t unwittingly introduce regressions when optimising for specific cases.
A further complication is that the answers provided by these tools are not deterministic. There are often multiple ways of providing data which will potentially answer a natural language question, and different ways of writing the SQL that could generate any of those answers, so we can’t just compare the SQL to what we expect, or even the data extracted by running that SQL.
Tools such as promptfoo can be a great help for cases like this, as they’re flexible enough to allow you to connect to a variety of AI systems and provide test cases where the answers can be evaluated to ensure they are sensible.
Scott Logic have a wealth of experience in prompt engineering and testing practices to help you get the most out of your AI tools and investment. If you’d like to know more about how we can help you, please get in touch.