How Aampe Handles High-Volume Data Ingestion Without Breaking the Bank

Nov 14, 2025
Saiyam Shah

Introduction

Agentic AI enables businesses to unlock massive amounts of value for their customers. In some cases, that involves solving conventionally familiar problems so that agents can focus on harder learning problems. For example, Aampe enables businesses to provision and manage an agentic learning agent for every one of its customers. Those agents need to process massive volumes of events in real-time while keeping costs under control.

So we've enabled our agents to operate through an ELT (Extract, Load, Transform) pipeline that can scale to handle billions of events and users without sacrificing performance or cost efficiency. In this post, we'll dive deep into our architecture, cost optimization strategies, and how we've achieved this scale.

The Challenge: Scale Meets Cost Efficiency

When processing events at scale, four primary costs dominate:

  1. Ingress Costs: The cost of data entering cloud services (often $0 initially, but can add up)

  2. Egress Costs: The cost of data leaving customer systems (can be significant)

  3. Storage Costs: The cost of storing and querying that data over time

  4. Compute Costs: The cost of processing and transforming data

Our Architecture

Apache Beam: Distributed Processing at Scale

We leverage Apache Beam running on Google Cloud Dataflow to process events in a distributed, scalable manner. This allows us to:

  • Process multiple customers in parallel: Each customer's data pipeline runs independently

  • Handle varying data volumes: From small customers with thousands of events to enterprise customers with millions

  • Scale workers dynamically: Configure worker counts (10-100 workers) based on data volume and customer needs

  • Process multiple file formats: Parquet, NDJSON, Avro, CSV - all handled efficiently

# Example: Parallel processing with configurable workers
pipeline_args.extend([
    "--num_workers=50",
    "--autoscaling_algorithm=NONE",
    "--dataflow_service_options=enable_prime",
    "--experiments=enable_vertical_memory_autoscaling",
])


Multi-Source Ingestion

Our pipeline supports ingestion from multiple sources:

  • S3 buckets (with cross-account role assumption)

  • Google Cloud Storage

  • Snowflake (direct query-based extraction)

  • BigQuery (cross-project data pulls)

  • Streaming sources (via staging tables)

This flexibility allows us to work with customers regardless of their existing data infrastructure.

Cost Optimization Strategies

1. Eliminating Data Duplication: Direct Access Pattern

Customer Service Account Pattern: When ingesting from customer GCS buckets, we use customer service accounts to read data directly without copying it to our buckets first:

# Use customer credentials to read from their GCS buckets
customer_credentials = ServiceAccountAuth.get_customer_credentials(
    secret_manager_project, customer_id
)
client = storage.Client(credentials=customer_credentials)

Benefits:

  • No data duplication: We read directly from customer buckets, avoiding the need to copy data to Aampe's buckets

  • Reduced storage costs: We don't maintain duplicate copies of customer data in our buckets

  • Faster processing: Eliminates the intermediate copy step

  • Note: While GCS ingress is free, the real savings come from avoiding duplicate storage

Cost Impact:

  • Traditional approach: Copy 1TB from customer bucket to Aampe bucket = ~$23/month storage cost for duplicate data

Additional Benefits:

  • No ingress operations: While GCS ingress is free, we avoid unnecessary data transfer operations

  • Allows to pull only subset of data as noted in next section


2. Minimizing Egress Costs: Incremental Data Pulling

Incremental Extraction: Instead of pulling full tables every day, we use configurable time windows to pull only recent data. This applies to Snowflake, S3, and other data sources where customers still pay egress costs:

# Only pull data from the last N days (Snowflake example)
query = f"""
    SELECT * FROM {table_name}
    WHERE DATE({incremental_by_column}) >= DATEADD(day, -{data_change_window}, DATE('{run_date}'))
    AND DATE({incremental_by_column}) <= DATE('{run_date}')
""

Key Features:

  • Configurable windowsdata_change_window parameter (typically 0-7 days)

  • Partition-aware queries: Only query relevant partitions

  • Full load on first run: Initial load pulls everything, subsequent runs are incremental

  • Applies to all sources: Snowflake, S3, GCS, and other external data sources

Important Note:

  • For Snowflake, S3, and other sources: Customers still pay egress costs, but we minimize them through incremental pulls

  • For BigQuery sources: Egress is free (see Billing Project Pattern section below) because BigQuery-to-BigQuery transfers within the same region have no egress charges

Cost Impact for Snowflake/S3 Sources:

  • Full table pull: For a 1TB table, egress costs ~$120/month (at $0.12/GB)

  • Incremental pull (7 days): For 7 days of data (~50GB), egress costs ~$6/month

Example: A customer with a 500GB events table in Snowflake:

  • Daily full pull: 500GB × 30 days = 15TB/month = ~$1,800/month in egress

  • Daily incremental (1 day): ~17GB/day × 30 = 510GB/month = ~$61/month

For BigQuery Sources: Egress costs are $0 because BigQuery-to-BigQuery transfers within the same region are free, and we use the billing project pattern (see next section).


3. Billing Project Pattern: Absorbing Query Costs

Aampe as Billing Project: When pulling data from customer BigQuery projects, we use customer credentials but set Aampe's project as the billing project:

# Use customer credentials to access their data
json_credentials = self._get_json_cred_from_secret_manager()
# But bill queries to Aampe's projectreturn bigquery.Client(
    credentials=json_credentials,
    project=settings.BQ_PROJECT_ID# Aampe's project
)

How It Works:

  1. We authenticate using customer's service account credentials

  2. We query customer's BigQuery tables (with their permission)

  3. But the query costs are billed to Aampe's project, not the customer's

Benefits for Customers:

  • Zero query costs: Customers don't pay for queries we run on their data

  • Zero egress costs: BigQuery-to-BigQuery transfers within the same region are free (unlike Snowflake/S3 where egress costs apply)

  • Transparent pricing: Customers only see storage costs, not processing costs

Key Distinction:

  • BigQuery sources: $0 egress costs (free BQ-to-BQ transfers) + $0 query costs (billed to Aampe)

  • Snowflake/S3 sources: Customers pay egress costs, but we minimize them through incremental pulls (95%+ savings)

Cost Impact: For a customer with daily queries scanning 100GB:

  • Customer pays: $0 (queries billed to Aampe)

  • Traditional approach: Customer would pay ~$5/day = ~$150/month in query costs

  • Customer savings: ~$150/month

Note: This pattern works because:

  • We have proper IAM permissions via customer service accounts

  • The billing project is just an accounting mechanism - data access is still controlled by customer IAM


4. Strategic BigQuery Partitioning

Daily Time Partitioning: Every table in our system uses daily partitioning, which provides several benefits:

  • Query Cost Reduction: Queries can scan only relevant partitions instead of entire tables

  • Efficient Data Management: Easy to drop old partitions or backfill specific date ranges

  • Better Performance: BigQuery can parallelize queries across partitions

additional_bq_parameters={
    "timePartitioning": {"type": "DAY", "expirationMs": EXPIRATION_MS},
}

Impact: For a table with 180 days of data, a query targeting a single day scans only 0.55% of the data, resulting in ~99.45% cost reduction for date-filtered queries.


5. Intelligent Clustering

We strategically cluster tables on frequently queried fields:

  • Event tables: Clustered on event_nameevent_type, or user_id

  • Attribute tables: Clustered on user_id for fast user lookups

clustering_fields=["event_name", "user_id", "timestamp"]

Impact: Clustering can reduce query costs by 50-90% for queries that filter on clustered columns, as BigQuery can skip entire blocks of data.


6. Automated Data Lifecycle Management

180-Day Expiration Policy: All staging tables automatically expire after 180 days:

EXPIRATION_MS = 180 * 24 * 60 * 60 * 1000# 180 days

Benefits:

  • Automatic cleanup: No manual intervention needed

  • Cost control: Prevents unbounded storage growth

  • Compliance: Helps meet data retention requirements

Cost Impact: For a customer generating 1TB/month, this policy saves ~$20/month in storage costs per customer after the first 180 days, and prevents exponential growth.

Scaling Patterns

Per-Customer Isolation

Each customer has:

  • Dedicated service accounts: Isolated IAM and permissions

  • Separate staging buckets: Per-customer GCS buckets for Dataflow staging

  • Independent pipelines: Customer-specific DAGs that can scale independently

This architecture allows us to:

  • Scale individual customers without affecting others

  • Apply customer-specific optimizations

  • Handle failures in isolation


Parallel Processing Architecture

Our ingestion pipeline processes multiple data types in parallel:

Benefits:

  • Reduced total processing time

  • Better resource utilization

  • Independent scaling per data type


Scaling Linearly

Our architecture scales linearly with data volume:

  • 10x more events → ~10x processing time (not 100x)

  • 10x more customers → ~10x total cost (not exponential)

  • Parallel processing ensures we can handle spikes without over-provisioning


Conclusion

By combining strategic BigQuery optimizations (partitioning, clustering, expiration), efficient batch processing with Apache Beam, and a scalable multi-customer architecture, Aampe’s agents can handle massive scale while keeping costs predictable and manageable both internally and for the customer.