Why I built this
Most LLM tutorials end with “Hello, World!” Mine ends with a working procurement analyst.
After 15+ years at Société Générale, Airbus, and Volvo Cars, I’ve seen what happens when teams skip the fundamentals. When LLMs arrived, the same pattern emerged: “just get it working, we’ll add structure later.”
That never works. So I built this one-file MVP to show the opposite: treat the LLM like software from day one. Schemas. Validation. Retries. The boring stuff that prevents 3 AM debugging sessions.
It’s a procurement intelligence assistant that:
- Filters tenders for relevance (cybersecurity / AI / software)
- Rates the opportunity (fit, win probability, effort, risks)
- Generates structured bid content (executive summary, approach, value prop, timeline)
All of it runs locally through LM Studio using an OpenAI-compatible API. The model is forced to return structured JSON validated with Pydantic, so downstream code stays clean and predictable.
This project is fully available on my GitHub. Make sure to check out the tag v0.1-article-procurement-mvp for this one-file simplified version.
GitHub - aminrj/procurement-ai
Contribute to aminrj/procurement-ai development by creating an account on GitHub.
Why I started here (and why procurement is a great sandbox)
What is a procurement tender?
A procurement tender is a document where an organization (public or private) asks companies to offer a price and plan to do a job or provide a service. The organization then compares the offers and chooses the best one.
Procurement tenders are an underrated playground for applied LLM engineering:
- Inputs are messy: long descriptions, vague requirements, inconsistent formatting
- Outputs have business impact: go/no-go decisions, prioritization, drafting bids
- Constraints are strict: you need repeatable scoring and consistent structure
- There’s a natural workflow: filter → evaluate → generate
And from a learning perspective, it forces you to handle the real problems of LLM apps:
- structured outputs
- retries
- temperature control
- orchestration
- guardrails & branching logic
The MVP in one sentence
A sequential, multi-agent pipeline that turns a tender into a validated decision and draft bid content, with structured JSON enforced via Pydantic.
The core design choice: treat the LLM like software
If you’re new to LLM engineering, here’s the first trap:
You call the model. You get back text. You try to parse it. It breaks. You add more prompts. It breaks differently.
The upgrade is to treat the model output as a contract.
This MVP uses Pydantic models as that contract:
- The model must return JSON.
- The JSON must match a schema.
- Values must fall within constraints.
- If anything is wrong, we retry. That’s not “prompting.” That’s engineering.
LM Studio for local LLM development and testing
The architecture
This app is layered in a way that mirrors production systems:
- Schema layer: Pydantic models describing expected outputs
- LLM infrastructure layer: one service that does API calls + cleaning + validation
- Agent layer: business logic prompts (filter, rate, generate)
- Orchestration layer: branching workflow + status + timing
- Demo layer:
main() runs sample tenders and prints a report Here’s the flow:
Application workflow This is a “boring” linear workflow — which is exactly why it’s perfect for learning. Later, you can compare it to LangGraph or more complex agent routing. But first, make the basics solid.
Step 1: define the schemas
Let’s start with the most important part: structured outputs.
Pydantic gives you:
class FilterResult(BaseModel): “"”Output from Filter Agent””” is_relevant: bool = Field(description=”Is tender relevant?”) confidence: float = Field(description=”Confidence 0-1”, ge=0, le=1) categories: List[TenderCategory] = Field(description=”Detected categories”) reasoning: str = Field(description=”Explanation for decision”) That ge=0, le=1 is not decoration. It’s the difference between “confidence = 0.92” and “confidence = 9.2” breaking your system silently.
Here’s the rating model (multi-dimensional scoring, strengths, risks):
class RatingResult(BaseModel): overall_score: float = Field(description=”Score 0-10”, ge=0, le=10) strategic_fit: float = Field(description=”Fit score 0-10”, ge=0, le=10) win_probability: float = Field(description=”Win chance 0-10”, ge=0, le=10) effort_required: float = Field(description=”Effort 0-10”, ge=0, le=10) strengths: List[str] = Field(description=”Top 3 strengths”) risks: List[str] = Field(description=”Top 3 risks”) recommendation: str = Field(description=”Go/No-Go with reasoning”) And the bid content model:
class BidDocument(BaseModel): executive_summary: str = Field(description=”2-3 paragraph summary”) technical_approach: str = Field(description=”How we’ll solve it”) value_proposition: str = Field(description=”Why choose us”) timeline_estimate: str = Field(description=”Project timeline”) This is the “contract” mindset:
Don’t accept vague prose. Accept validated data.
Step 2: build a single LLM gateway
Most early LLM prototypes scatter API calls all over the code. That becomes untestable fast.
Instead, this MVP centralizes the LLM interaction in one class: LLMService.
The key method is the heart of the system:
async def generate_structured( self, prompt: str, response_model: BaseModel, system_prompt: str, temperature: float = 0.1, max_retries: int = 3, ) -> BaseModel: This method demonstrates a production-grade pattern:
- Build messages (system + user)
- Inject schema guidance into the user prompt
- Call the LLM API (LM Studio)
- Clean the response (remove code fences, extract JSON)
- Parse JSON
- Validate with Pydantic
- Retry if anything fails
Prompt-time schema steering
This MVP doesn’t use function calling. Instead it uses example-driven JSON steering.
The user prompt is built like this:
messages = [ {“role”: “system”, “content”: system_prompt}, {“role”: “user”, “content”: self._build_structured_prompt(prompt, response_model)}, ] Then _build_structured_prompt() injects:
- an example JSON object with correct types
- strict formatting rules
- constraints (confidence 0–1, scores 0–10, enum values, lists required) A snippet:
return f””“{prompt} You must respond with ACTUAL DATA in JSON format, not a schema.Here’s the expected format with CORRECT value types: {example_json} CRITICAL VALUE REQUIREMENTS:
- confidence: Use decimal 0-1 (like 0.95, not 9.5)
- Categories: Use EXACT enum values: “cybersecurity”, “ai”, “software”, “other” (lowercase)
- Scores: Use numbers 0-10 (like 8.5)
- Arrays: Use actual lists with 3 items for strengths/risks
- All text fields: Provide meaningful actual content FORMATTING RULES:
- Start with {{ and end with }}
- No explanations before or after JSON
- No code blocks or backticks””” This might look verbose, but it’s teaching the model how to behave.
When I’m coding with LLMs, I prefer explicit guardrails over “clever” prompts.
Step 3: clean and validate
Even good models occasionally return:
- Markdown fences around the JSON
- Commentary before or after the object
- Incomplete objects
- Extra braces inside reasoning text So the MVP includes _clean_json() to strip markdown and extract the first balanced JSON object.
This is one of those “unsexy” details that separates a demo from a working app.
def _clean_json(self, text: str) -> str: cleaned = text.strip()
Remove markdown code blocks
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| if cleaned.startswith("```json"):
cleaned = cleaned[7:]
elif cleaned.startswith("```"):
cleaned = cleaned[3:]
if cleaned.endswith("```"):
cleaned = cleaned[:-3]
cleaned = cleaned.strip()
# Find the JSON object by looking for balanced braces
start_idx = cleaned.find('{')
if start_idx == -1:
return cleaned
brace_count = 0
end_idx = -1
for i, char in enumerate(cleaned[start_idx:], start_idx):
if char == '{':
brace_count += 1
elif char == '}':
brace_count -= 1
if brace_count == 0:
end_idx = i
break
if end_idx != -1:
return cleaned[start_idx:end_idx + 1]
return cleaned
|
The two-stage validation gate
The method then does:
json.loads(cleaned) response_model.model_validate(parsed) That second step is where Pydantic enforces correctness.
If anything fails, we retry:
for attempt in range(max_retries): try: response = await self._call_api(messages, temperature) cleaned = self._clean_json(response) if not cleaned.startswith(‘{‘) or not cleaned.endswith(‘}’): raise ValueError(“Response doesn’t look like JSON”) parsed = json.loads(cleaned) return response_model.model_validate(parsed) except Exception as e: if attempt == max_retries - 1: raise Exception(f”Failed after {max_retries} attempts: {e}”) await asyncio.sleep(2) This gives you a stable contract:
- if you get a result, it matches the schema
- if not, it fails loudly and predictably
Step 4: the agents
With the infrastructure in place, the agents become clean and readable.
Agent 1: filteragent
The filtering agent answers: “Do we care?”
Key design choice: low temperature.
system = “You are an expert procurement analyst specializing in technology tenders. Be precise and conservative.” return await self.llm.generate_structured( prompt=prompt, response_model=FilterResult, system_prompt=system, temperature=Config.TEMPERATURE_PRECISE, ) The prompt includes explicit criteria:
- relevant if it involves cybersecurity / AI / software development
- not relevant if hardware, physical infra, catering, etc. This is a classification prompt with reasoning, not “generate content.”
Agent 2: ratingagent
The prompt asks for:
- strategic fit
- win probability
- effort required
- strengths and risks
- go/no-go recommendation And again: low temperature.
system = “You are a business development expert evaluating tender opportunities. Be analytical and realistic, not optimistic.” The output is forced into RatingResult, so the orchestrator can branch on:
if result.rating_result.overall_score < 7.0: result.status = “rated_low” return result That branch is important: it’s cost control and quality control.
Agent 3: documentgenerator
Now we increase temperature for writing:
temperature=Config.TEMPERATURE_CREATIVE But we still constrain the output via BidDocument.
This is an important lesson:
Creativity does not mean unstructured.
Even “creative generation” should land in a contract if you plan to automate anything downstream.
Step 5: orchestration
The orchestrator ties everything into a coherent pipeline.
It does four jobs:
- sequential execution
- branching logic
- status tracking
- timing A key part is the early exit:
if ( not result.filter_result.is_relevant or result.filter_result.confidence < 0.6 ): result.status = “filtered_out” return result Then:
if result.rating_result.overall_score < 7.0: result.status = “rated_low” return result Finally:
result.bid_document = await self.doc_generator.generate( tender, categories, result.rating_result.strengths ) result.status = “complete” This is “agent orchestration,” but it’s intentionally simple. You can understand every branch without mental overhead.
That’s a feature.
A quick demo dataset
The MVP includes sample tenders:
- AI cybersecurity platform (should be relevant + high rated)
- office furniture (should be filtered out)
- custom CRM software (likely relevant) That gives you immediate feedback on whether your prompts and schema steering are working.
SAMPLE_TENDERS = [ Tender(…), Tender(…), Tender(…), ] When you run main(), you get a summary report:
- how many were relevant
- how many rated high
- how many documents generated
- processing time This is the beginning of an evaluation loop.
If you build more samples (including tricky borderline cases), this becomes the foundation of a real test suite.
What I learned
Structured outputs are not optional. The fastest path to reliability: define schema, steer the model toward JSON, validate and retry. Skip validation and your app becomes fragile.
Prompts get easier when the schema is clear. When you know the output fields, prompts become focused: return these categories, return 3 strengths, return these four sections. The schema eliminates ambiguity.
Temperature is a tool, not a vibe. I used 0.1 for filtering and rating (precision), 0.7 for document generation (variation). It’s not about “better answers” — it’s about matching the mode to the task.
Orchestration is where business logic lives. The model does analysis, but the system makes decisions. Exit early if irrelevant, exit if the score is too low, only generate proposals when worth it. That’s a product mindset.
Debugging LLM apps is mostly debugging output shape. Most failures are formatting or schema mismatch. The code prints the raw and cleaned response on each retry, which tells you exactly where things went wrong.
Where this goes next
This MVP is intentionally a “one-file learning artifact.” But it points directly to the next iterations:
Real ingestion: replace SAMPLE_TENDERS with a scraper — HTML/PDF parsing, normalization into Tender objects, JSONL for storage.
Persistence + UI: store ProcessedTender objects in Postgres or SQLite, add a simple web interface to browse decisions and export results.
Evaluation harness: build a labeled dataset of 50–200 tenders with expected relevance, category, and score ranges. Track false positives, rating stability, and output validity rate. This becomes your real test suite.
Stronger guardrails: schema-aware repair prompts on parse failures, response-format controls where supported, field constraints like exactly 3 strengths/risks.
Orchestration comparison: once the linear flow is solid, try LangGraph, add optional agents (compliance, risk), or layer in retrieval over past bids and company capabilities.
Closing: the big lesson
LLMs are probabilistic text generators. Wrap them in Pydantic schemas, validation gates, retries, and orchestration logic, and they start to behave like reliable components.
This one-file MVP is my first documented step in that direction. Next I’ll move from “demo tenders” to real ingestion, persistence, evaluation, and automation.
If you’re also learning LLM engineering: start with something that forces structure and decisions. You’ll learn more in a week than you’ll learn from months of prompt tinkering.
Appendix: the pattern to reuse everywhere
If you only take one piece from this article, take this pattern:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # 1) Define schema with Pydantic
class OutputModel(BaseModel):
field_a: str
score: float = Field(ge=0, le=10)
# 2) Call LLM and force JSON shape
result = await llm.generate_structured(
prompt="Do the task and return JSON.",
response_model=OutputModel,
system_prompt="Be precise.",
temperature=0.1,
)
# 3) Now you have a validated object, not messy text
print(result.score)
|
That’s the foundation of LLM engineering.
Resources & next steps
Read the Code: github.com/aminrj/procurement-ai — Tag v0.1-article-procurement-mvp for the one-file version
Follow the Series:
Starting an LLM Project? If you’re building AI systems and want to avoid the “prototype trap” where demos work but production fails, let’s talk. I offer a free 30-minute assessment where we’ll review your architecture approach and flag potential pitfalls before you hit them.
Connect with me on LinkedIn, and star the procurement-ai repository to follow the journey from MVP to production.
Thanks for reading.