skip to content →
dispatch_log DISPATCH_001FILED JAN 15 2026OPERATORPRIYA SHARMALIVE
DISPATCH_001ENGINEERINGJan 15, 202614 min read

How we built our content pipeline on AWS (and why we regret some of it)

A tour through VectraSEO's serverless stack: SQS FIFO queues, Lambda workers, DynamoDB single-table, S3 + CloudFront. What works, what doesn't, what we'd do differently if we were starting today.

When I joined VectraSEO to lead engineering, the founding team had already made a core infrastructure decision: go fully serverless on AWS. Specifically: Lambda for compute, DynamoDB for data, SQS for queuing, S3 + CloudFront for storage and delivery. I mostly agreed with those choices and still mostly do. This post is a tour through the stack, including the parts where I was wrong and the parts where the original decision was wrong.

The system in five boxes

At the highest level, the VectraSEO backend is five boxes:

  1. API — FastAPI running on Lambda (via Mangum), behind API Gateway. Serves the REST API that the Vue frontend talks to.
  2. Queue — One SQS FIFO queue for pipeline jobs. Content generation and monitoring scans both go through it, discriminated by job type.
  3. Workers — Two Lambda functions. A pipeline worker that handles content generation and monitoring scans. A scheduler worker that fires hourly and enqueues jobs for active schedules.
  4. Data — A single DynamoDB table with five GSIs. All tenants, all projects, all posts, all monitors, all users live here.
  5. Storage — Two S3 buckets fronted by CloudFront. One for generated content (blog images, HTML artifacts). One for the customer-facing published blog under the vectraseo-blog subdomain.

That's the whole thing. Everything else is details.

Lambda for everything

The bet on Lambda was made early, for three reasons: scale-to-zero (important when you have few customers), no servers to maintain (important when you have a small team), and clean separation of concerns (each worker is its own deployable unit).

A year and a half in, I'd say Lambda was right for ~80% of our use cases and wrong for ~20%. Let me be specific.

Where Lambda works great: the scheduler. It fires hourly via EventBridge, wakes up, checks which schedules are due, enqueues work, goes back to sleep. This is the textbook Lambda use case. We pay pennies a month for it.

Where Lambda is fine: the API. FastAPI + Mangum on Lambda works. Cold starts are noticeable (200–400ms on a cold hit) but most requests are warm. The scaling is automatic. The cost is low at our current traffic.

Where Lambda is awkward: the pipeline worker. This is the one that bothers me. A content generation job is 30–120 seconds of work, most of it waiting on Gemini API calls. Lambda's 15-minute timeout is fine. The memory is fine. But debugging a long-running Lambda with multiple external API calls is painful. Traces get scattered across CloudWatch. Retries are complicated by SQS visibility timeouts. When something goes wrong in production, the observability is worse than it would be on a long-running process.

If I were starting the pipeline worker today, I'd probably use ECS Fargate or a small EKS cluster running a consumer process. Not because Lambda is wrong — it's fine — but because a worker that runs continuously has much nicer debugging ergonomics.

We haven't migrated. The Lambda version works. The migration cost is real and the benefit is marginal. But if a new team were starting a similar product today, I'd steer them toward containers for the worker path.

SQS FIFO, specifically FIFO

We use SQS FIFO, not standard SQS. This was a deliberate choice and I still think it was right, but it has costs.

Why FIFO: content generation jobs need to process in order within a project. If a customer enqueues three posts and the second one fails, we don't want the third one to get ahead of it. FIFO guarantees within-group ordering, where the group is the project ID. This makes reasoning about state much easier.

The cost of FIFO: lower throughput ceiling, 5x higher per-message cost (which is negligible at our scale but matters at some scales), and deduplication via MessageDeduplicationId requires you to think about idempotency from message send to message receive.

The design I actually like is: pipeline worker receives a job, writes a deterministic pre-execution marker to DynamoDB, does the work, writes a completion marker. If the worker is retried due to SQS redelivery, it sees the pre-execution marker and aborts the duplicate execution. This is belt-and-suspenders with FIFO's dedup, but SQS FIFO dedup is a 5-minute window and that's sometimes too short.

DynamoDB single-table

The data model is a single DynamoDB table. Five global secondary indexes. Everything — users, projects, posts, monitors, issues, schedules, jobs, magic-link tokens — lives in one table, distinguished by the shape of the partition and sort keys.

This is the "single-table design" pattern that the AWS docs and Rick Houlihan's talks advocate. It is genuinely correct, and it is also genuinely painful to work with.

What's correct about it: DynamoDB is a key-value store. The single-table pattern lets us pre-compute query shapes into partition/sort key designs, which makes every query O(1) in partition cost. We get a consistent-performance database that scales without us thinking about it. For our workload (mostly known access patterns, bounded cardinality per partition), this is the right tool.

What's painful: every new access pattern requires thought. Adding a new query might require a new GSI. Adding a new entity type requires figuring out how its keys fit into the existing scheme. Debugging is harder than with SQL because the shape of queries is implicit in the code rather than visible in a query log.

I've reconsidered this choice maybe three times. Each time, I've concluded that the operational simplicity of DynamoDB (no provisioning, no maintenance, no vacuum, no replicas) outweighs the developer-experience cost. But it's close. If we had more complex reporting requirements — ad-hoc queries, analytical joins, aggregations — I'd want PostgreSQL.

Our compromise: DynamoDB for transactional operations, periodic snapshots to S3 + Athena for analytical queries. This works. It's not as clean as a single database, but it's pragmatic.

S3 + CloudFront for content

The least surprising part of our stack. Two buckets, two CloudFront distributions, straight S3 origin with CloudFront caching.

One thing we did right: we use CloudFront Functions (not Lambda@Edge) for lightweight URL rewriting. Much cheaper than Lambda@Edge. For the complexity of URL manipulation we need, CloudFront Functions is plenty.

One thing I'd reconsider: we cache all S3-served content very aggressively (1 year on hashed URLs). This is fine for images, but for published blog HTML, it means that when a customer updates a post, we have to invalidate the cached HTML. CloudFront invalidations are not free at scale, and our customers can edit posts after publishing. We budget for invalidations but they're a noticeable cost line item.

If I were starting over: shorter cache TTLs for HTML (30 minutes), aggressive caching for images. This would reduce invalidation spend substantially.

The parts I regret

Being specific about regrets:

Over-using Lambda for the worker path. Already mentioned. ECS Fargate or EKS would have been better for debugging.

DynamoDB Streams for change tracking. We initially wired up DynamoDB Streams to trigger downstream updates (e.g., when a post changes status, update the project's counters). Streams are fine but they're a lagging source of truth, and debugging missing events is painful. We moved most of this to synchronous updates in the same transaction as the primary write. Simpler.

Magic link tokens stored in DynamoDB. These are short-lived, single-use tokens. They have no business being in our main table. They should be in Redis with a TTL. We kept them in DynamoDB for consistency of infrastructure — one data store — but it's a misuse and the table churn it creates is noticeable.

Cold-start-sensitive API paths. Some of our API endpoints make a database call and an LLM call (for inline preview generation). Cold Lambda + cold Gemini client = 2-second first response. We should have prioritized keeping these paths warm with provisioned concurrency, or moved them to a warm worker, earlier than we did.

The parts I'd keep

To balance:

Single DynamoDB table. Despite the complexity, I'd do it again. Operational simplicity wins.

SQS FIFO. The within-project ordering guarantee has prevented many classes of bug. Worth the cost.

FastAPI + Mangum. Good choice. Python. Good type hints. Pydantic validation. Works on Lambda. Works on ECS if we migrate. Portable.

Vue + Pinia on the frontend. Not AWS, but related. Small learning curve. Clean state management. Good ecosystem. No regrets here.

CDK for infrastructure. Python CDK. Not everyone loves CDK but I like that I can write Python for infrastructure in the same repo as the application. One language, one mental model, one PR that moves both the code and the infra.

Cost reality

At our current scale (hundreds of active customers, thousands of posts generated per month), our full AWS bill is about $1,400/month. The biggest line items:

  • CloudFront: ~$400
  • Lambda invocations: ~$280
  • DynamoDB (capacity + storage): ~$250
  • S3 (storage + requests): ~$180
  • Other (SQS, CloudWatch, data transfer, Route53): ~$290

Per customer, that's about $6/month in infrastructure. Our revenue per customer is substantially higher than that, so the infrastructure margin is fine.

The Gemini API bill, by the way, is roughly 5× the AWS bill. Most of our marginal cost is AI inference, not infrastructure. This is true for most AI products right now and will be for a while.

What I'd tell another founder

If a founder asked me what to copy from our stack:

Copy the serverless defaults for your API and scheduler. They'll let you scale to zero and focus on product.

Copy DynamoDB if your access patterns are known and bounded. Do not copy it if you need ad-hoc analytical queries.

Copy SQS FIFO if you need ordering guarantees. Copy SQS Standard if you don't.

Do not copy Lambda for long-running workers. Use ECS Fargate or equivalent.

Do not over-optimize. We didn't need any of the performance tuning we'd eventually do. At startup scale, simple and correct beats fast.

Do use CDK or Terraform. Do not hand-write CloudFormation. Do not click-ops.

Do monitor from day one. CloudWatch metrics, structured logging, a simple health endpoint. You'll wish you had it later.

That's the tour. The stack has gotten us to product-market fit. It has enough runway to get us to Series A scale. Beyond that, we'll probably have to migrate pieces — the worker path first, maybe the storage layer if analytics gets serious. But we're a couple of years away from needing to worry about that.

Build the boring infrastructure. Let it run. Go spend your time on the product.

[ END_OF_DISPATCH ]
PS
Priya Sharma
Engineering — VectraSEO

Field reports filed by operators who actually run the system. If something in this dispatch is wrong, tell us — dispatch@vectraseo.com.