Reasoning About AI Systems Through Constraints

Manish
aisystemsarchitectureengineeringdecision-making

A reflection on reasoning about AI systems by starting with constraints, using a small, real system to surface tradeoffs around cost, deployment, and failure early.

From open questions to bounded reasoning

When I started thinking seriously about AI systems, I kept running into the same set of open questions. What does a real implementation actually look like? What sits between a notebook demo and production? How do you reason about these systems before committing?

Those questions aren’t flawed. But on their own, they don’t constrain the space enough to force decisions. They allow multiple answers to coexist without forcing a concrete tradeoff.

What changed things for me was introducing limits.

Once limits are present, the questions shift. “What does it look like?” becomes “what must it handle.” “How do you reason about this?” turns into “which decisions can’t be deferred.”

So instead of starting with models or frameworks, I started by bounding the system.


My constraint: Think clearly before committing

I had no desire to begin with a product idea. I didn't want to remain theoretical either.

Instead, I established a constraint:

Build the smallest possible AI-powered system that still forces real decisions.

That meant real deployment boundaries, real cost limits, and real failure modes—but deliberately limited scope. No UI. No users. No ambition to scale.

The objective was to gain clarity.

I wanted something concrete enough that I couldn't hand-wave questions away, but small enough that I wasn't locked into early bets. Constraints weren't a limitation here—they were the tool.


The smallest interaction that still makes certain decisions unavoidable

I anchored on a POST /recommendations endpoint. It accepts a text query and returns ranked JSON results.

Even this minimal contract immediately surfaced decisions around data shape, latency, cost, and failure handling.

This pattern appears everywhere: documentation search, content discovery, product recommendations.

The implementation: text to embedding, vector similarity search against a pre-built index, ranked results returned.

No UI, no personalization, no chat loop. The point isn't the feature—it's the system shape required to support even this simple interaction.


The minimal architecture that wouldn't lie to me

Even narrow, the system decomposes into layers:

  • API layer for request handling and contracts
  • AI pipeline for embedding and semantic retrieval
  • Serverless runtime with zero-cost free tier, CPU-only, sub-2GB RAM, cold start reality
  • Vector index built offline
  • Observability layer to make behavior visible

At runtime, a request flows like this:

Logs Vector Index Embedding Model RAG Service Middleware FastAPI Endpoint User Logs Vector Index Embedding Model RAG Service Middleware FastAPI Endpoint User Query Validate & log request Request metrics Semantic search Generate embedding Query vector Similarity search Top-K results Ranked items Format response Response metrics Recommendations
Edit in Mermaid Live Editor

What makes this useful is what it reveals: deployment boundaries, cost limits, cold starts, memory limits, and failure states.


The AI components and the problems they quietly solve

In this system, these components surfaced because each addressed a specific, persistent problem once constraints were in place.

  • Embedding pipeline: Relied on a single embedding model call to turn free-form input into vectors once inputs stopped being predictable.
  • Vector index: Handled cases where exact matches stopped being useful, enabling fast retrieval under tight latency and cost constraints.
  • RAG service: Sat between stored knowledge and live queries, keeping responses grounded while still adapting to what the user was asking.

One system shape, many business problems

When you step back, the same end-to-end system shape supports a wide range of business problems that look different on the surface but resolve similarly under constraints.

  • Customer support routing: Free-form issue descriptions are interpreted semantically, matched against prior cases or knowledge, and routed to the right queue. The same flow supports observability through resolution time, escalations, and reopens.
  • Content discovery: After a user consumes something, the system retrieves semantically related items from a large archive and ranks them for relevance. Engagement and session depth become the visible outcomes.
  • Product search: Shopper queries are interpreted for intent, matched against a product corpus, and returned as ranked results. Conversion and search satisfaction provide the feedback loop.

Where reality shows up immediately

I used deterministic embeddings because accuracy wasn't the question yet—cost and flow were.

FAISS wasn't a tech preference; it was a CPU-only, free-tier constraint.

The /health 503 forced me to confront readiness instead of pretending availability.

These aren't abstractions. They're the real decisions constraints force early.


Why constraints are the point, not a limitation

Constraints shifted attention to what mattered early. They prevented premature optimization. I had to confront what "production" actually meant—before spending money or committing teams.

In this exploration, starting with constraints surfaced the tradeoffs earlier than starting with tools or frameworks.


A better first question

From an engineering decision-making perspective, starting with constraints made it easier to reason about tradeoffs before committing to tools, teams, or platforms.