LLM-Engineering; Building a Procurements Analyst AI

Build a working AI analyst from scratch in one file. Learn LLM engineering fundamentals through a real procurement analysis application with structured outputs.

Posted Jan 26, 2026 Updated Apr 30, 2026

By Amine Raji, PhD

12 min read

LLM-Engineering; Building a Procurements Analyst AI

Why I built this

Most LLM tutorials end with “Hello, World!” Mine ends with a working procurement analyst.

After 15+ years at Société Générale, Airbus, and Volvo Cars, I’ve seen what happens when teams skip the fundamentals. When LLMs arrived, the same pattern emerged: “just get it working, we’ll add structure later.”

That never works. So I built this one-file MVP to show the opposite: treat the LLM like software from day one. Schemas. Validation. Retries. The boring stuff that prevents 3 AM debugging sessions.

It’s a procurement intelligence assistant that:

Filters tenders for relevance (cybersecurity / AI / software)
Rates the opportunity (fit, win probability, effort, risks)
Generates structured bid content (executive summary, approach, value prop, timeline)

All of it runs locally through LM Studio using an OpenAI-compatible API. The model is forced to return structured JSON validated with Pydantic, so downstream code stays clean and predictable.

This project is fully available on my GitHub. Make sure to check out the tag v0.1-article-procurement-mvp for this one-file simplified version.

GitHub - aminrj/procurement-ai

Contribute to aminrj/procurement-ai development by creating an account on GitHub.

Why I started here (and why procurement is a great sandbox)

What is a procurement tender?

A procurement tender is a document where an organization (public or private) asks companies to offer a price and plan to do a job or provide a service. The organization then compares the offers and chooses the best one.

Procurement tenders are an underrated playground for applied LLM engineering:

Inputs are messy: long descriptions, vague requirements, inconsistent formatting
Outputs have business impact: go/no-go decisions, prioritization, drafting bids
Constraints are strict: you need repeatable scoring and consistent structure
There’s a natural workflow: filter → evaluate → generate

And from a learning perspective, it forces you to handle the real problems of LLM apps:

structured outputs
retries
temperature control
orchestration
guardrails & branching logic
The MVP in one sentence

A sequential, multi-agent pipeline that turns a tender into a validated decision and draft bid content, with structured JSON enforced via Pydantic.

The core design choice: treat the LLM like software

If you’re new to LLM engineering, here’s the first trap:

You call the model. You get back text. You try to parse it. It breaks. You add more prompts. It breaks differently.

The upgrade is to treat the model output as a contract.

This MVP uses Pydantic models as that contract:

The model must return JSON.
The JSON must match a schema.
Values must fall within constraints.
If anything is wrong, we retry. That’s not “prompting.” That’s engineering.

LM Studio for local LLM development and testing

The architecture

This app is layered in a way that mirrors production systems:

Schema layer: Pydantic models describing expected outputs
LLM infrastructure layer: one service that does API calls + cleaning + validation
Agent layer: business logic prompts (filter, rate, generate)
Orchestration layer: branching workflow + status + timing
Demo layer: main() runs sample tenders and prints a report Here’s the flow:

Application workflow This is a “boring” linear workflow — which is exactly why it’s perfect for learning. Later, you can compare it to LangGraph or more complex agent routing. But first, make the basics solid.

Step 1: define the schemas

Let’s start with the most important part: structured outputs.

Pydantic gives you:

Type enforcement (float vs string, list vs scalar)
Constraint checking (confidence must be 0–1, scores must be 0–10)
Parsing into Python objects you can trust In this MVP, each agent has a corresponding schema:
FilterAgent → FilterResult
RatingAgent → RatingResult
DocumentGenerator → BidDocument Here’s the filtering output model:

class FilterResult(BaseModel): “"”Output from Filter Agent””” is_relevant: bool = Field(description=”Is tender relevant?”) confidence: float = Field(description=”Confidence 0-1”, ge=0, le=1) categories: List[TenderCategory] = Field(description=”Detected categories”) reasoning: str = Field(description=”Explanation for decision”) That ge=0, le=1 is not decoration. It’s the difference between “confidence = 0.92” and “confidence = 9.2” breaking your system silently.

Here’s the rating model (multi-dimensional scoring, strengths, risks):

class RatingResult(BaseModel): overall_score: float = Field(description=”Score 0-10”, ge=0, le=10) strategic_fit: float = Field(description=”Fit score 0-10”, ge=0, le=10) win_probability: float = Field(description=”Win chance 0-10”, ge=0, le=10) effort_required: float = Field(description=”Effort 0-10”, ge=0, le=10) strengths: List[str] = Field(description=”Top 3 strengths”) risks: List[str] = Field(description=”Top 3 risks”) recommendation: str = Field(description=”Go/No-Go with reasoning”) And the bid content model:

class BidDocument(BaseModel): executive_summary: str = Field(description=”2-3 paragraph summary”) technical_approach: str = Field(description=”How we’ll solve it”) value_proposition: str = Field(description=”Why choose us”) timeline_estimate: str = Field(description=”Project timeline”) This is the “contract” mindset:

Don’t accept vague prose. Accept validated data.

Step 2: build a single LLM gateway

Most early LLM prototypes scatter API calls all over the code. That becomes untestable fast.

Instead, this MVP centralizes the LLM interaction in one class: LLMService.

The key method is the heart of the system:

async def generate_structured( self, prompt: str, response_model: BaseModel, system_prompt: str, temperature: float = 0.1, max_retries: int = 3, ) -> BaseModel: This method demonstrates a production-grade pattern:

Build messages (system + user)
Inject schema guidance into the user prompt
Call the LLM API (LM Studio)
Clean the response (remove code fences, extract JSON)
Parse JSON
Validate with Pydantic
Retry if anything fails

Prompt-time schema steering

This MVP doesn’t use function calling. Instead it uses example-driven JSON steering.

The user prompt is built like this:

messages = [ {“role”: “system”, “content”: system_prompt}, {“role”: “user”, “content”: self._build_structured_prompt(prompt, response_model)}, ] Then _build_structured_prompt() injects:

an example JSON object with correct types
strict formatting rules
constraints (confidence 0–1, scores 0–10, enum values, lists required) A snippet:

return f””“{prompt} You must respond with ACTUAL DATA in JSON format, not a schema.Here’s the expected format with CORRECT value types: {example_json} CRITICAL VALUE REQUIREMENTS:

confidence: Use decimal 0-1 (like 0.95, not 9.5)
Categories: Use EXACT enum values: “cybersecurity”, “ai”, “software”, “other” (lowercase)
Scores: Use numbers 0-10 (like 8.5)
Arrays: Use actual lists with 3 items for strengths/risks
All text fields: Provide meaningful actual content FORMATTING RULES:
Start with {{ and end with }}
No explanations before or after JSON
No code blocks or backticks””” This might look verbose, but it’s teaching the model how to behave.

When I’m coding with LLMs, I prefer explicit guardrails over “clever” prompts.

Step 3: clean and validate

Even good models occasionally return:

Markdown fences around the JSON
Commentary before or after the object
Incomplete objects
Extra braces inside reasoning text So the MVP includes _clean_json() to strip markdown and extract the first balanced JSON object.

This is one of those “unsexy” details that separates a demo from a working app.

def _clean_json(self, text: str) -> str: cleaned = text.strip()

Remove markdown code blocks

if cleaned.startswith("```json"):
    cleaned = cleaned[7:]
elif cleaned.startswith("```"):
    cleaned = cleaned[3:]
if cleaned.endswith("```"):
    cleaned = cleaned[:-3]
cleaned = cleaned.strip()
# Find the JSON object by looking for balanced braces
start_idx = cleaned.find('{')
if start_idx == -1:
    return cleaned
brace_count = 0
end_idx = -1
for i, char in enumerate(cleaned[start_idx:], start_idx):
    if char == '{':
        brace_count += 1
    elif char == '}':
        brace_count -= 1
        if brace_count == 0:
            end_idx = i
            break
if end_idx != -1:
    return cleaned[start_idx:end_idx + 1]
return cleaned

The two-stage validation gate

The method then does:

json.loads(cleaned) response_model.model_validate(parsed) That second step is where Pydantic enforces correctness.

If anything fails, we retry:

for attempt in range(max_retries): try: response = await self._call_api(messages, temperature) cleaned = self._clean_json(response) if not cleaned.startswith(‘{‘) or not cleaned.endswith(‘}’): raise ValueError(“Response doesn’t look like JSON”) parsed = json.loads(cleaned) return response_model.model_validate(parsed) except Exception as e: if attempt == max_retries - 1: raise Exception(f”Failed after {max_retries} attempts: {e}”) await asyncio.sleep(2) This gives you a stable contract:

if you get a result, it matches the schema
if not, it fails loudly and predictably
Step 4: the agents

With the infrastructure in place, the agents become clean and readable.

Agent 1: filteragent

The filtering agent answers: “Do we care?”

Key design choice: low temperature.

system = “You are an expert procurement analyst specializing in technology tenders. Be precise and conservative.” return await self.llm.generate_structured( prompt=prompt, response_model=FilterResult, system_prompt=system, temperature=Config.TEMPERATURE_PRECISE, ) The prompt includes explicit criteria:

relevant if it involves cybersecurity / AI / software development
not relevant if hardware, physical infra, catering, etc. This is a classification prompt with reasoning, not “generate content.”

Agent 2: ratingagent

The prompt asks for:

strategic fit
win probability
effort required
strengths and risks
go/no-go recommendation And again: low temperature.

system = “You are a business development expert evaluating tender opportunities. Be analytical and realistic, not optimistic.” The output is forced into RatingResult, so the orchestrator can branch on:

if result.rating_result.overall_score < 7.0: result.status = “rated_low” return result That branch is important: it’s cost control and quality control.

Agent 3: documentgenerator

Now we increase temperature for writing:

temperature=Config.TEMPERATURE_CREATIVE But we still constrain the output via BidDocument.

This is an important lesson:

Creativity does not mean unstructured.

Even “creative generation” should land in a contract if you plan to automate anything downstream.

Step 5: orchestration

The orchestrator ties everything into a coherent pipeline.

It does four jobs:

sequential execution
branching logic
status tracking
timing A key part is the early exit:

if ( not result.filter_result.is_relevant or result.filter_result.confidence < 0.6 ): result.status = “filtered_out” return result Then:

if result.rating_result.overall_score < 7.0: result.status = “rated_low” return result Finally:

result.bid_document = await self.doc_generator.generate( tender, categories, result.rating_result.strengths ) result.status = “complete” This is “agent orchestration,” but it’s intentionally simple. You can understand every branch without mental overhead.

That’s a feature.

A quick demo dataset

The MVP includes sample tenders:

AI cybersecurity platform (should be relevant + high rated)
office furniture (should be filtered out)
custom CRM software (likely relevant) That gives you immediate feedback on whether your prompts and schema steering are working.

SAMPLE_TENDERS = [ Tender(…), Tender(…), Tender(…), ] When you run main(), you get a summary report:

how many were relevant
how many rated high
how many documents generated
processing time This is the beginning of an evaluation loop.

If you build more samples (including tricky borderline cases), this becomes the foundation of a real test suite.

What I learned

Structured outputs are not optional. The fastest path to reliability: define schema, steer the model toward JSON, validate and retry. Skip validation and your app becomes fragile.
Prompts get easier when the schema is clear. When you know the output fields, prompts become focused: return these categories, return 3 strengths, return these four sections. The schema eliminates ambiguity.
Temperature is a tool, not a vibe. I used 0.1 for filtering and rating (precision), 0.7 for document generation (variation). It’s not about “better answers” — it’s about matching the mode to the task.
Orchestration is where business logic lives. The model does analysis, but the system makes decisions. Exit early if irrelevant, exit if the score is too low, only generate proposals when worth it. That’s a product mindset.
Debugging LLM apps is mostly debugging output shape. Most failures are formatting or schema mismatch. The code prints the raw and cleaned response on each retry, which tells you exactly where things went wrong.

Where this goes next

This MVP is intentionally a “one-file learning artifact.” But it points directly to the next iterations:

Real ingestion: replace SAMPLE_TENDERS with a scraper — HTML/PDF parsing, normalization into Tender objects, JSONL for storage.
Persistence + UI: store ProcessedTender objects in Postgres or SQLite, add a simple web interface to browse decisions and export results.
Evaluation harness: build a labeled dataset of 50–200 tenders with expected relevance, category, and score ranges. Track false positives, rating stability, and output validity rate. This becomes your real test suite.
Stronger guardrails: schema-aware repair prompts on parse failures, response-format controls where supported, field constraints like exactly 3 strengths/risks.
Orchestration comparison: once the linear flow is solid, try LangGraph, add optional agents (compliance, risk), or layer in retrieval over past bids and company capabilities.
Closing: the big lesson

LLMs are probabilistic text generators. Wrap them in Pydantic schemas, validation gates, retries, and orchestration logic, and they start to behave like reliable components.

This one-file MVP is my first documented step in that direction. Next I’ll move from “demo tenders” to real ingestion, persistence, evaluation, and automation.

If you’re also learning LLM engineering: start with something that forces structure and decisions. You’ll learn more in a week than you’ll learn from months of prompt tinkering.

Appendix: the pattern to reuse everywhere

If you only take one piece from this article, take this pattern:

  
# 1) Define schema with Pydantic
class OutputModel(BaseModel):
    field_a: str
    score: float = Field(ge=0, le=10)

# 2) Call LLM and force JSON shape
result = await llm.generate_structured(
    prompt="Do the task and return JSON.",
    response_model=OutputModel,
    system_prompt="Be precise.",
    temperature=0.1,
)

# 3) Now you have a validated object, not messy text
print(result.score)

That’s the foundation of LLM engineering.

Resources & next steps

Read the Code: github.com/aminrj/procurement-ai — Tag v0.1-article-procurement-mvp for the one-file version

Follow the Series:

Part 2: Build Production-Ready LLM Agents
Part 3: From MVP to Production SaaS

Starting an LLM Project? If you’re building AI systems and want to avoid the “prototype trap” where demos work but production fails, let’s talk. I offer a free 30-minute assessment where we’ll review your architecture approach and flag potential pitfalls before you hit them.

Connect with me on LinkedIn, and star the procurement-ai repository to follow the journey from MVP to production.

Thanks for reading.

The Security Lab Newsletter

This post is the article. The newsletter is the lab.

Subscribers get what doesn't fit in a post: the full attack code with annotated results, the measurement methodology behind the numbers, and the week's thread — where I work through a technique or incident across several days of testing rather than a single draft. The RAG poisoning work, the MCP CVE analysis, the red-teaming patterns — all of it started as a newsletter thread before it became a post. One email per week. No sponsored content. Unsubscribe any time.

Join the lab — it's free

Already subscribed? Browse the back-issues →

LLM

This post is licensed under CC BY 4.0 by the author.

LLM-Engineering; Building a Procurements Analyst AI

Why I built this

Why I started here (and why procurement is a great sandbox)

What is a procurement tender?

The MVP in one sentence

The core design choice: treat the LLM like software

The architecture

Step 1: define the schemas

Step 2: build a single LLM gateway

Prompt-time schema steering

Step 3: clean and validate

Remove markdown code blocks

Step 4: the agents

Agent 1: filteragent

Agent 2: ratingagent

Agent 3: documentgenerator

Step 5: orchestration

A quick demo dataset

What I learned

Where this goes next

Closing: the big lesson

Appendix: the pattern to reuse everywhere

Resources & next steps

This post is the article. The newsletter is the lab.

Found this useful?

Trending Tags