Building a Real-Time Amazon Data Pipeline

The Problem

BareGold operates as an Amazon FBA retailer and consulting firm managing PPC campaigns, inventory, and listings across US and Canadian marketplaces. Our existing data pipeline—a fleet of Lambda functions polling Amazon's SP-API and Ads API on cron schedules—had a critical flaw: 24-48 hour data lag.

This meant PPC bid optimization was always reacting to yesterday's data, inventory stockout alerts came hours after the fact, order monitoring required manual Seller Central checks, and campaign budget exhaustion wasn't detected until the next day's report.

We were paying $500-1000/month combined for third-party tools (Helium 10, Sellerboard, etc.) that had the same-day data we couldn't access through standard APIs. When we investigated how they got it, we discovered Amazon Marketing Stream—an official push-based data delivery system hiding in plain sight.

The Discovery

During a deep investigation into why no official Amazon API provided same-day PPC data, we found:

Standard Ads API reports have 24-48 hour lag and take 5+ minutes to generate. SP-API business reports use overnight processing, not real-time. Seller Central dashboard has real-time data, but no API access. However, Amazon Marketing Stream is an official push-based system delivering hourly campaign metrics directly to your AWS account via SQS or AWS Lambda.

This was the missing piece. Combined with SP-API's push notification system (ORDER_CHANGE, ITEM_INVENTORY_EVENT_CHANGE, etc.), we could build a complete real-time data pipeline.

The Development Process

[DEVELOPMENT_TIMELINE]

8:41 PM

Initial question about real-time SP-API data

8:46 PM

Decision to build comprehensive plan for implementation

8:50 PM

20KB technical specification generated

8:56 PM

Initial plan generated

8:58 PM

Round 1 review: 12 critical issues found

9:04 PM

Fixes applied, v2 generated

9:10 PM

Round 2 review: 8 fixes needed

9:16 PM

Fixes applied, v3 generated

9:19 PM

Round 3 review: 5 final fixes

9:23 PM

Plan finalized—ready to implement

Total elapsed: 42 minutes from idea to production-ready architecture

The Review Cycles

[REVIEW_CYCLES]

Issues Caught Pre-Production

Review Cycles

1Round 1

12issues

Critical architecture and schema issues

Silver layer SQL incompatibilityMissing dbt model column definitionsWrong compression settingsNo attribution window handlingSingle Lambda bottleneck riskMissing marketplace separation

2Round 2

8issues

Data handling and cost optimization

Transformer Lambda JSON handlingSmall files problem (per-event writes)Daily aggregate SQL bugsMulti-tenant scalability designS3 lifecycle policies

3Round 3

5issues

Schema alignment and normalization

UNION column name mismatchesMissing JSON SerDe definitionscamelCase to snake_case normalizationLambda output file extension handlingS3 naming conventions

* Many issues were subtle (JSON SerDe config, camelCase vs snake_case, column name mismatches) that wouldn't surface until first query returns empty results.

The Architecture

[ARCHITECTURE_OVERVIEW]

The architecture uses Amazon's official push-based systems to receive data in near real-time, then transforms and stores it in S3 with proper partitioning for efficient querying.

Amazon Marketing Stream delivers hourly PPC metrics (clicks, spend, sales) via AWS Lambda

SP-API Notifications push real-time events (orders, inventory, Buy Box changes) via SQS

Transform Lambdas normalize data (camelCase → snake_case) and add metadata (marketplace, profile_id)

Data lands in S3 with Hive-style partitioning (year/month/day/hour)

dbt models with partition projection enable fast Neon PostgreSQL queries

[IMPACT_METRICS]

Metric	Before	After	Improvement
PPC Data Freshness	24-48 hours	~1 hour	24-48x faster
Order Awareness	12-24 hours	Real-time (seconds)	Instant
Inventory Snapshots	Daily (often stale)	Hourly	24x more frequent
Buy Box Monitoring	None	Real-time	New capability
Financial Event Tracking	Daily batch	Real-time	Instant
Monthly Infrastructure Cost	$0 (but $500-1000 in SaaS)	~$461	Net savings
Data Granularity	Daily aggregates	Hourly, per entity	24x granular
Per-Client Marginal Cost	N/A	~$30-50/mo	Scalable

PPC Data Freshness

Before

24-48 hours

→

After

~1 hour

24-48x faster

Order Awareness

Before

12-24 hours

→

After

Real-time (seconds)

Instant

Inventory Snapshots

Before

Daily (often stale)

→

After

Hourly

24x more frequent

Buy Box Monitoring

Before

None

→

After

Real-time

New capability

Financial Event Tracking

Before

Daily batch

→

After

Real-time

Instant

Monthly Infrastructure Cost

Before

$0 (but $500-1000 in SaaS)

→

After

~$461

Net savings

Data Granularity

Before

Daily aggregates

→

After

Hourly, per entity

24x granular

Per-Client Marginal Cost

Before

N/A

→

After

~$30-50/mo

Scalable

Lessons Learned

[LESSONS_LEARNED]

Lesson 01

AI Review Cycles Catch Architecture Bugs

25 issues caught across 4 review cycles. Many were subtle (JSON SerDe config, camelCase vs snake_case, column name mismatches) that wouldn't surface until the first query returns empty results. Traditional development would have discovered these in production.

Lesson 02

Amazon Has Real-Time Data—You Just Have to Know Where to Look

Marketing Stream has been available since 2022 but is barely documented compared to standard Ads API reports. The same-day data mystery was solved by an official Amazon product, not scraping or special partnerships.

Lesson 03

Start JSON, Optimize Later

The impulse to use Parquet from day one adds significant complexity (SerDe configuration, schema rigidity). At low volumes (<1GB/day), JSON with partition projection is fast enough. Optimize when the data demands it.

Lesson 04

Design for Multi-Tenancy from Day One

Adding client_id partitioning after the fact requires migrating all data and rebuilding all views. Baking it in from the start costs nothing extra and makes the consulting business model work.

Lesson 05

Never Trust Empty Defaults for Credentials

A previous CDK deployment wiped all Lambda credentials because the stack read from environment variables with empty defaults. This pipeline uses Secrets Manager exclusively—credentials survive deployments.

What's Next

[IMPLEMENTATION_ROADMAP]

1Deploy CDK stack—SQS queues, Lambda functions, dbt models
2Subscribe to Marketing Stream—6 subscriptions (3 datasets × 2 profiles)
3Subscribe to SP-API notifications—6 notification types
4Verify data flow—confirm S3 data within 1-2 hours
5Create silver views—execute SQL in Neon PostgreSQL
6Connect PPC bid optimizer—switch from batch to hourly data
7Onboard first consulting client—prove multi-tenant architecture

Building a Real-Time Amazon Data Pipeline

From 48-Hour Lag to Sub-Hour Intelligence

The Problem

The Discovery

The Development Process

The Review Cycles

The Architecture

Key Architecture Decisions

Lambda Pipeline vs All-Lambda

JSON-First, Parquet-Later

Batch + Stream Coexistence

Append-Only Attribution

Multi-Tenant from Day One

The Numbers

Lessons Learned

AI Review Cycles Catch Architecture Bugs

Amazon Has Real-Time Data—You Just Have to Know Where to Look

Start JSON, Optimize Later

Design for Multi-Tenancy from Day One

Never Trust Empty Defaults for Credentials

What's Next

Want This Level of Rigor for Your Project?