The Schema-Driven LLM Query Pattern

The Schema-Driven LLM Query Pattern

Typically, when working with an LLM, you send your data along with a prompt and receive a response. But this pattern often introduces issues, especially with public LLMs, due to security concerns and token bloat.

There is a different pattern we can use: the Schema-Driven LLM Query Pattern.

With this pattern, we only send the data schema to the LLM along with our question. The LLM returns the query syntax, which we then execute locally against our data. This significantly reduces the amount of data sent to the LLM, helping with both token usage and security.

Below is a quick overview of both patterns:

Let’s now walk through both the legacy pattern and the Schema-Driven LLM Query Pattern, in more detail.

Pattern 1: Sending Full Data to the LLM

In this pattern, you serialise your data (e.g., a DataFrame as JSON) and let the LLM analyse or filter it.

Step 1 - Create the Dataset

import pandas as pd

df = pd.DataFrame(
    {
        "device": ["core-rtr-01", "core-rtr-01", "core-rtr-01"],
        "interface": ["GigabitEthernet0/0", "GigabitEthernet0/1", "GigabitEthernet0/2"],
        "ip_address": ["10.0.0.1", "10.0.1.1", "unassigned"],
        "admin_status": ["up", "up", "down"],
        "oper_status": ["up", "down", "down"],
        "protocol_status": ["up", "down", "down"],
        "input_errors": [0, 12, 0],
        "output_errors": [0, 3, 0],
    }
)

print(df)
# Output: 
#        device           interface  ip_address admin_status oper_status protocol_status  input_errors  output_errors
# 0  core-rtr-01  GigabitEthernet0/0    10.0.0.1           up          up              up             0              0
# 1  core-rtr-01  GigabitEthernet0/1    10.0.1.1           up        down            down            12              3
# 2  core-rtr-01  GigabitEthernet0/2  unassigned         down        down            down             0              0

Step 2 - Send Full Data to the LLM

from openai import OpenAI
import json

client = OpenAI()

data_json = df.to_dict(orient="records")

prompt = f"""
You are given Cisco interface data in JSON format.

Return interfaces that are administratively up but operationally down.

Output rules:
- Return only matching records as JSON.
- Do not include explanations or extra text.

Data:
{json.dumps(data_json, indent=2)}
"""

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)

print(response.choices[0].message.content)
# Output:
# [
#   {
#     "device": "core-rtr-01",
#     "interface": "GigabitEthernet0/1",
#     "ip_address": "10.0.1.1",
#     "admin_status": "up",
#     "oper_status": "down",
#     "protocol_status": "down",
#     "input_errors": 12,
#     "output_errors": 3
#   }
# ]

What Just Happened

  • The dataset was serialized.
  • The dataset was sent to the external LLM.
  • The LLM performed the filtering logic.
  • A computed result was returned.

Though this worked, it doesn’t scale well as your dataset grows, and as you’ll appreciate, the security implications are less than ideal.

Pattern 2: Schema-Driven LLM Query Pattern

With this pattern, we no longer send the raw data. Instead, we send the schema.

Step 1 - Get the Schema

df_schema = df.dtypes

print(df_schema)
# Output:
# df.dtypes
# device               str
# interface            str
# ip_address           str
# admin_status         str
# oper_status          str
# protocol_status      str
# input_errors       int64
# output_errors      int64
# dtype: object

Step 2 - Ask the LLM to Generate the Query Logic

# Before running the code below ensure your OpenAI API key is exported:
# export OPENAI_API_KEY="your_openai_api_key"

from openai import OpenAI

client = OpenAI()

prompt = f"""
You are given a pandas DataFrame schema.

Return interfaces that are administratively up but operationally down.

Output rules:
- Return only valid pandas query syntax as Python code.
- Do not include explanations or extra text.

Schema:
{df_schema}
"""

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": prompt
        }
    ]
)

generated_code = response.choices[0].message.content
print(generated_code)
# result = df[(df['admin_status'] == 'up') & (df['oper_status'] == 'down')]

Step 3 - Execute Locally

result = df[(df["admin_status"] == "up") & (df["oper_status"] == "down")] 
print(result)
# Output: 
#         device            interface  ip_address admin_status oper_status protocol_status  input_errors  output_errors
# 1  core-rtr-01  GigabitEthernet0/1   10.0.1.1          up        down            down            12              3

Or we can print it out as a dictionary to make the output more comparable to the first pattern example.

print(json.dumps(
    result.to_dict(orient="records"),
    indent=2
))
# Output:
# [
#   {
#     "device": "core-rtr-01",
#     "interface": "GigabitEthernet0/1",
#     "ip_address": "10.0.1.1",
#     "admin_status": "up",
#     "oper_status": "down",
#     "protocol_status": "down",
#     "input_errors": 12,
#     "output_errors": 3
#   }
# ]

As you can see, we only sent the schema to the LLM. The LLM returned the query based on our prompt, and we executed that query locally against our data.

Summary

I hope you enjoyed this post and can see the benefits of this pattern.

The real advantage is that it’s not limited to pandas. You can apply the same approach to any dataset where you have a defined schema - REST APIs, GraphQL, databases, even structured log formats.

  • Send the schema.
  • Get back the query.
  • Execute locally.

Thanks again and happy automating!

Subscribe to our newsletter and stay updated.

Don't miss anything. Get all the latest posts delivered straight to your inbox.
Great! Check your inbox and click the link to confirm your subscription.
Error! Please enter a valid email address!