Compute Constraints
The physics of agent existence—how computational resources like tokens, latency, memory, and cost create carrying capacities that shape what agents can do.
In ecology, carrying capacity is the maximum population an environment can sustain given available resources—food, water, space, energy. Organisms must survive within these limits or perish.
For AI agents, compute constraints define their carrying capacity: the tokens, memory, latency, and cost budgets that bound what’s possible. These aren’t abstract limits—they’re the physics of agent existence.
The Resource Landscape
Agents consume multiple computational resources:
┌─────────────────────────────────────────────────────────┐ │ │ │ AGENT OPERATION │ │ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ │ │ │ │ INPUT PROCESSING OUTPUT │ │ │ │ TOKENS ───► COMPUTE ───► TOKENS │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ │ │ ┌───────┐ ┌───────┐ ┌───────┐ │ │ │ │ │Context│ │Latency│ │ Cost │ │ │ │ │ │Window │ │ Time │ │ $ │ │ │ │ │ └───────┘ └───────┘ └───────┘ │ │ │ │ │ │ │ └─────────────────────────────────────────────────┘ │ │ │ │ Every operation consumes limited resources │ │ │ └─────────────────────────────────────────────────────────┘
Primary Constraints
Token Limits
The fundamental unit of language model capacity.
┌─────────────────────────────────────────────────────────┐ │ CONTEXT WINDOW │ │ (e.g., 128K tokens) │ ├─────────────────────────────────────────────────────────┤ │ │ │ ████████████████████████████████░░░░░░░░░░░░░░░░░░░░ │ │ ▲ ▲ │ │ │ │ │ │ Used tokens Available │ │ (system prompt, capacity │ │ history, context) │ │ │ │ Components competing for space: │ │ ┌─────────────────────────────────────────────────┐ │ │ │ System prompt │ ~1-5K tokens │ │ │ │ Conversation │ Variable (grows over time) │ │ │ │ Retrieved context │ Variable (per query) │ │ │ │ Tool results │ Variable (per operation) │ │ │ │ Working space │ What remains for reasoning │ │ │ └─────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────┘
Implications:
- Long conversations degrade as history exceeds limits
- Large documents can’t fit in single context
- Complex reasoning competes with background context
- Trade-off between memory and immediate capability
Latency
Time from request to response.
| Latency Component | Source |
|---|---|
| Network | Data transmission time |
| Queue | Waiting for processing capacity |
| Inference | Model computation (scales with tokens) |
| Tool execution | External API calls |
| Post-processing | Output filtering, formatting |
┌─────────────────────────────────────────────────────────┐ │ │ │ Total acceptable latency: 10 seconds │ │ │ │ ├──────┼──────┼────────────────┼───────┼─────┤ │ │ 0 1 2 6 8 10 sec │ │ │ │ │ │ │ │ │ │ │ │ │ └─ Safety │ │ │ │ │ │ checks │ │ │ │ │ └─ Tool calls │ │ │ │ └─ Model inference │ │ │ └─ Context loading │ │ └─ Network round-trip │ │ │ │ If any component exceeds budget, UX suffers │ │ │ └─────────────────────────────────────────────────────────┘
Implications:
- Real-time applications have strict requirements
- More reasoning = more latency
- Tool calls add unpredictable delays
- Streaming can hide latency perception
Memory
What persists across interactions.
| Memory Type | Constraint |
|---|---|
| Context window | Hard limit per call |
| Vector store | Storage and retrieval costs |
| External DB | Query latency, storage costs |
| Session state | Management complexity |
Cost
The economic constraint underlying all others.
┌─────────────────────────────────────────────────────────┐ │ │ │ COST PER OPERATION │ │ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ Input tokens │ $X per 1M tokens │ │ │ │ Output tokens │ $Y per 1M tokens (often 3-4X) │ │ │ │ Tool calls │ $Z per API call │ │ │ │ Storage │ $W per GB per month │ │ │ │ Compute │ $V per GPU-hour │ │ │ └─────────────────────────────────────────────────┘ │ │ │ │ Total cost = sum of all operations │ │ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ Simple query: ~$0.001 │ │ │ │ Complex task: ~$0.10 - $1.00 │ │ │ │ Long workflow: ~$1.00 - $10.00 │ │ │ │ Heavy usage: ~$100+ per day │ │ │ └─────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────┘
Implications:
- Cost scales with capability usage
- Longer reasoning costs more
- Retrieval augmentation adds cost
- High volume requires careful budgeting
Constraint Interactions
Constraints interact in complex ways:
┌─────────────────────────────────────────────────────────┐ │ │ │ CONSTRAINT INTERACTIONS │ │ │ │ More tokens ───────────► Higher cost │ │ │ │ │ └────────────────► Higher latency │ │ │ │ More reasoning ────────► More tokens ───► Both above │ │ │ │ Better quality ────────► More reasoning ──► All above │ │ │ │ More context ──────────► More tokens │ │ │ Less working space │ │ │ │ │ └─────────────────► Better grounding │ │ │ │ EVERYTHING IS TRADEOFFS │ │ │ └─────────────────────────────────────────────────────────┘
The Quality-Speed-Cost Triangle
QUALITY
▲
/│\
/ │ \
/ │ \
/ │ \
/ │ \
/ │ \
/ │ \
/ │ \
/ │ \
▼─────────┴─────────▼
SPEED COST
Pick two. The third suffers.
High quality + Low cost = Slow
High quality + Fast = Expensive
Fast + Low cost = Low quality
Adaptation Strategies
How agents and their operators adapt to constraints:
Token Management
Compression techniques:
- Summarize long histories
- Extract key information from documents
- Use efficient prompt formats
- Prune irrelevant context
Allocation strategies:
- Reserve space for reasoning
- Dynamic context window management
- Tiered importance for context items
Latency Optimization
Parallel processing:
- Multiple tool calls simultaneously
- Speculative execution
- Background pre-computation
Streaming:
- Return partial results early
- Progressive enhancement
Caching:
- Store frequent responses
- Pre-compute common operations
Cost Optimization
Model selection:
- Use cheaper models for simple tasks
- Route to expensive models only when needed
- Fine-tuned small models for specific tasks
Request optimization:
- Batch operations where possible
- Avoid redundant calls
- Cache aggressively
Carrying Capacity Metaphor
Like ecological carrying capacity, compute constraints determine sustainable agent behavior:
┌─────────────────────────────────────────────────────────┐ │ │ │ SUSTAINABLE OPERATION ZONE │ │ │ │ Capability │ │ │ Level │ ┌──────────────┐ │ │ │ │ Sustainable │ │ │ │ ╱│ Zone │╲ │ │ High │ ╱ │ │ ╲ │ │ │ ╱ │ │ ╲ │ │ │ ╱ └──────────────┘ ╲ │ │ │ ╱ Exceeds budget ╲ │ │ Low │ ╱ (unsustainable) ╲ │ │ │ ╱ ╲ │ │ └────────────────────────────────── │ │ Low High │ │ Resource Budget │ │ │ │ Agents must operate within sustainable zone or │ │ run out of resources (funds, tokens, time) │ │ │ └─────────────────────────────────────────────────────────┘
Boom-Bust Dynamics
Agents can exceed sustainable limits temporarily:
- Burst capacity for complex tasks
- Borrowing from future budget
- Degraded performance while recovering
But sustained over-consumption leads to failure.
Constraint Evolution
Constraints change over time:
| Trend | Direction | Implication |
|---|---|---|
| Context windows | Growing | More complex tasks possible |
| Token costs | Falling | Higher volume feasible |
| Model latency | Improving | Better real-time capability |
| Capability per token | Increasing | More efficient reasoning |
┌─────────────────────────────────────────────────────────┐ │ │ │ CONSTRAINT TRENDS (log scale) │ │ │ │ Context │ │ │ Window │ ▄▄▄▄ │ │ (tokens) │ ▄▄▄▄▄▀ │ │ │ ▄▄▄▄▄▀ │ │ │ ▄▄▄▄▀ │ │ │ ▄▄▄▀ │ │ └────────────────────────────────────── │ │ 2020 2021 2022 2023 2024 │ │ │ │ Cost per │ │ │ Token │ ▀▀▀▀▀▄▄ │ │ ($) │ ▀▀▀▀▄▄ │ │ │ ▀▀▀▀▄▄ │ │ │ ▀▀▀▀▄▄ │ │ │ ▀▀▀ │ │ └────────────────────────────────────── │ │ 2020 2021 2022 2023 2024 │ │ │ │ What's expensive today is cheap tomorrow. │ │ What's impossible today is routine tomorrow. │ │ │ └─────────────────────────────────────────────────────────┘
Constraint-Aware Design
Design agents with constraints as first-class considerations:
Graceful Degradation
When resources are exhausted:
- Reduce quality rather than fail completely
- Inform users of limitations
- Prioritize critical functionality
Resource Budgeting
Explicit allocation:
- Token budget per task
- Latency budget per operation
- Cost budget per session/user
Monitoring
Track consumption:
- Alert on approaching limits
- Analyze usage patterns
- Identify optimization opportunities
Scaling Strategies
Handle varying demand:
- Queue management under load
- Tiered service levels
- Automatic capacity adjustment
The Resource Politics
Compute constraints create political questions:
- Who gets compute? Allocation decisions have consequences
- Who pays? Cost distribution affects access
- Who decides limits? Governance of shared resources
- What’s prioritized? Quality vs. access trade-offs
These are not purely technical questions—they’re questions about values and priorities.
See Also
- Scaffolding — infrastructure that manages constraints
- Capability Tiers — how constraints bound capability
- Habitat Classification — environments with different constraint profiles
- API Ecosystems — external constraints from dependencies
Related Entries
API Ecosystems
The digital biome that sustains agent life—the interconnected network of APIs, services, and tools that form the environment in which agents operate.
TaxonomyCapability Tiers
A developmental taxonomy of agent capabilities—from basic language to autonomous self-improvement, mapping the stages of cognitive maturity.
TaxonomyHabitat Classification
Where agents live—a taxonomy of deployment environments, from chat interfaces to embedded systems, and how habitat shapes agent behavior.
AnatomyScaffolding
The external structures—code, tools, memory systems—that transform a language model into an agent capable of action and persistence.