The Problem
BareGold operates as an Amazon FBA retailer and consulting firm managing PPC campaigns, inventory, and listings across US and Canadian marketplaces. Our existing data pipeline—a fleet of Lambda functions polling Amazon's SP-API and Ads API on cron schedules—had a critical flaw: 24-48 hour data lag.
This meant PPC bid optimization was always reacting to yesterday's data, inventory stockout alerts came hours after the fact, order monitoring required manual Seller Central checks, and campaign budget exhaustion wasn't detected until the next day's report.
We were paying $500-1000/month combined for third-party tools (Helium 10, Sellerboard, etc.) that had the same-day data we couldn't access through standard APIs. When we investigated how they got it, we discovered Amazon Marketing Stream—an official push-based data delivery system hiding in plain sight.
The Discovery
During a deep investigation into why no official Amazon API provided same-day PPC data, we found:
Standard Ads API reports have 24-48 hour lag and take 5+ minutes to generate. SP-API business reports use overnight processing, not real-time. Seller Central dashboard has real-time data, but no API access. However, Amazon Marketing Stream is an official push-based system delivering hourly campaign metrics directly to your AWS account via SQS or Kinesis Firehose.
This was the missing piece. Combined with SP-API's push notification system (ORDER_CHANGE, ITEM_INVENTORY_EVENT_CHANGE, etc.), we could build a complete real-time data pipeline.
The Development Process
[DEVELOPMENT_TIMELINE]
Initial question about real-time SP-API data
Decision to build comprehensive plan for implementation
20KB technical specification generated
Initial plan generated
Round 1 review: 12 critical issues found
Fixes applied, v2 generated
Round 2 review: 8 fixes needed
Fixes applied, v3 generated
Round 3 review: 5 final fixes
Plan finalized—ready to implement
Total elapsed: 42 minutes from idea to production-ready architecture
The Review Cycles
[REVIEW_CYCLES]
25
Issues Caught Pre-Production
3
Review Cycles
Critical architecture and schema issues
Data handling and cost optimization
Schema alignment and normalization
* Many issues were subtle (JSON SerDe config, camelCase vs snake_case, column name mismatches) that wouldn't surface until first query returns empty results.
The Architecture
[ARCHITECTURE_OVERVIEW]
The architecture uses Amazon's official push-based systems to receive data in near real-time, then transforms and stores it in S3 with proper partitioning for efficient querying.
Amazon Marketing Stream delivers hourly PPC metrics (clicks, spend, sales) via Kinesis Firehose
SP-API Notifications push real-time events (orders, inventory, Buy Box changes) via SQS
Transform Lambdas normalize data (camelCase → snake_case) and add metadata (marketplace, profile_id)
Data lands in S3 with Hive-style partitioning (year/month/day/hour)
Glue Catalog tables with partition projection enable fast Athena queries
Silver layer views blend real-time stream data (T+0/T+1) with batch data (T+2+)
Key Architecture Decisions
[KEY_DECISIONS]
Firehose vs All-Lambda
Decision:
Chose Firehose despite 30x higher cost (~$461/mo vs ~$15/mo)
Rationale:
Production reliability with automatic retry and backpressure handling, multi-tenant readiness for consulting clients, and net savings by replacing $500-1000/mo in third-party SaaS subscriptions.
JSON-First, Parquet-Later
Decision:
Write JSON with partition projection instead of Parquet from day one
Rationale:
Avoids complex SerDe setup and schema rigidity. At our volume (<1GB/day), JSON queries return in under 2 seconds. Optimize when data demands it, not before.
Batch + Stream Coexistence
Decision:
Real-time pipeline supplements existing batch polling rather than replacing it
Rationale:
Stream provides same-day directional data for operational decisions. Batch remains source of truth for reporting and historical analysis. Silver views automatically blend both.
Append-Only Attribution
Decision:
Append-only storage with deduplication at query time instead of complex upsert logic
Rationale:
Marketing Stream conversion data updates retroactively (1d, 7d, 14d, 30d windows). Simpler to store all versions and deduplicate with ROW_NUMBER() at query time.
Multi-Tenant from Day One
Decision:
Built client_id partitioning into the architecture from the start
Rationale:
Adding multi-tenancy after the fact requires migrating all data and rebuilding all views. Baking it in costs nothing extra and enables the consulting business model at ~$30-50/mo per client.
The Numbers
[IMPACT_METRICS]
| Metric | Before | After | Improvement |
|---|---|---|---|
| PPC Data Freshness | 24-48 hours | ~1 hour | 24-48x faster |
| Order Awareness | 12-24 hours | Real-time (seconds) | Instant |
| Inventory Snapshots | Daily (often stale) | Hourly | 24x more frequent |
| Buy Box Monitoring | None | Real-time | New capability |
| Financial Event Tracking | Daily batch | Real-time | Instant |
| Monthly Infrastructure Cost | $0 (but $500-1000 in SaaS) | ~$461 | Net savings |
| Data Granularity | Daily aggregates | Hourly, per entity | 24x granular |
| Per-Client Marginal Cost | N/A | ~$30-50/mo | Scalable |
PPC Data Freshness
Before
24-48 hours
After
~1 hour
24-48x faster
Order Awareness
Before
12-24 hours
After
Real-time (seconds)
Instant
Inventory Snapshots
Before
Daily (often stale)
After
Hourly
24x more frequent
Buy Box Monitoring
Before
None
After
Real-time
New capability
Financial Event Tracking
Before
Daily batch
After
Real-time
Instant
Monthly Infrastructure Cost
Before
$0 (but $500-1000 in SaaS)
After
~$461
Net savings
Data Granularity
Before
Daily aggregates
After
Hourly, per entity
24x granular
Per-Client Marginal Cost
Before
N/A
After
~$30-50/mo
Scalable
Lessons Learned
[LESSONS_LEARNED]
Lesson 01
AI Review Cycles Catch Architecture Bugs
25 issues caught across 4 review cycles. Many were subtle (JSON SerDe config, camelCase vs snake_case, column name mismatches) that wouldn't surface until the first query returns empty results. Traditional development would have discovered these in production.
Lesson 02
Amazon Has Real-Time Data—You Just Have to Know Where to Look
Marketing Stream has been available since 2022 but is barely documented compared to standard Ads API reports. The same-day data mystery was solved by an official Amazon product, not scraping or special partnerships.
Lesson 03
Start JSON, Optimize Later
The impulse to use Parquet from day one adds significant complexity (SerDe configuration, schema rigidity). At low volumes (<1GB/day), JSON with partition projection is fast enough. Optimize when the data demands it.
Lesson 04
Design for Multi-Tenancy from Day One
Adding client_id partitioning after the fact requires migrating all data and rebuilding all views. Baking it in from the start costs nothing extra and makes the consulting business model work.
Lesson 05
Never Trust Empty Defaults for Credentials
A previous CDK deployment wiped all Lambda credentials because the stack read from environment variables with empty defaults. This pipeline uses Secrets Manager exclusively—credentials survive deployments.
What's Next
[IMPLEMENTATION_ROADMAP]
- 1Deploy CDK stack—SQS queues, Firehose streams, Lambdas, Glue tables
- 2Subscribe to Marketing Stream—6 subscriptions (3 datasets × 2 profiles)
- 3Subscribe to SP-API notifications—6 notification types
- 4Verify data flow—confirm S3 data within 1-2 hours
- 5Create silver views—execute SQL in Athena
- 6Connect PPC bid optimizer—switch from batch to hourly data
- 7Onboard first consulting client—prove multi-tenant architecture