MarTech Data Pipelines: Architecture Patterns That Actually Scale

Every MarTech stack eventually develops the same problem: the same user exists in six different tools, and none of them agree on who that user is, what they paid, or when they last logged in. GA4 shows one conversion count. HubSpot shows another. The data warehouse has a third number. The marketing team stops trusting the data, starts maintaining spreadsheets, and the engineering team gets blamed for something that was actually an architecture problem from the start.

The good news is this is a solved problem, architecturally. The bad news is the solutions require upfront investment that feels disproportionate to the immediate problem, which is why most stacks end up in the mess before the fix.

The Three Architectural Patterns

Pattern 1: Point-to-Point Integrations

This is the default. Tool A sends data to Tool B via a native integration or webhook. Tool B sends data to Tool C. Each tool is connected to each other tool that needs its data.

It works at small scale. Five tools, eight integrations — that is manageable. Each integration is configured independently, usually by a marketer or a developer in a one-day session, and it does what it says on the box.

The failure mode is well-documented but always underestimated: the number of integrations grows quadratically with the number of tools. Fifteen tools can produce 105 integration paths. Each integration has its own schema, its own user identifier, its own sync frequency, its own failure mode. When one tool changes its API or data model, you are debugging multiple integrations simultaneously.

The data quality failure is more subtle. Each point-to-point integration makes assumptions about identity. Tool A thinks the user is identified by email. Tool B uses a cookie ID. The native integration between them maps one to the other using a best-effort lookup that works 80% of the time. The 20% that fails silently accumulates over months into a material attribution error.

Use point-to-point integrations when you have fewer than eight tools with limited data overlap, when your team lacks the engineering capacity to build and maintain a centralized data layer, and when the business is early enough that investment in infrastructure is premature.

Pattern 2: Event Bus / Message Queue

An event bus (Kafka, AWS Kinesis, Google Pub/Sub, or a managed solution like Segment as a routing layer) sits between your data producers and your data consumers. Events are published to the bus once and consumed by any number of downstream systems.

The architecture looks like this: your application publishes a purchase_completed event with a canonical data structure. HubSpot consumes it and updates the contact. GA4 consumes it and records the conversion. Your data warehouse consumes it and appends it to the transactions table. The email platform consumes it and triggers the confirmation sequence.

The advantages:

Each system subscribes to the events it cares about independently
Adding a new downstream consumer requires no changes to the producer
Failed deliveries can be replayed from the event log
The bus provides a durable audit trail of everything that happened

The operational cost: you are running and maintaining a message queue infrastructure. Kafka clusters require operations expertise. Managed solutions like AWS Kinesis reduce that burden but add vendor coupling. And the event schema governance problem — what fields does each event contain, who controls the schema, how do breaking changes get deployed — requires engineering discipline to solve.

Use an event bus architecture when you have high event volume (millions of events per day), when you need real-time delivery to multiple consumers, or when your current point-to-point integrations are producing visible data quality problems.

Pattern 3: Warehouse-First with Reverse ETL

The warehouse-first pattern treats your data warehouse as the central source of truth. All data flows in (via ETL from source systems) and the warehouse is the entity that knows your users. Reverse ETL tools (Census, Hightouch, Polytomic) sync data from the warehouse back out to the MarTech tools that need it.

The data flow: your product database, CRM, and event stream all load data into the warehouse via ETL jobs. A dbt layer transforms raw data into clean, modeled tables — a users table, an orders table, a user_activity table. Reverse ETL syncs from those modeled tables into HubSpot, Salesforce, Klaviyo, and wherever else it belongs.

The advantages:

SQL is the query language — no proprietary data transformation tools to learn
Your BI team and your marketing ops team work from the same data model
Identity resolution happens in SQL in the warehouse, not in 12 different tool-specific ways
The warehouse provides point-in-time queryability — you can reconstruct what any user’s data looked like on any date

The tradeoffs: latency. Reverse ETL syncs are typically batch jobs running on intervals — hourly or daily. Real-time triggering is possible but adds complexity. If your activation use cases require sub-minute latency (triggered emails seconds after a user action), warehouse-first adds architectural complexity to achieve it.

Use warehouse-first when you have a data team already running a warehouse, when your activation latency requirements allow for batch sync frequencies, or when data quality and consistent reporting are higher priorities than real-time activation.

When a CDP Is the Right Answer

A Customer Data Platform promises to unify all your customer data in one place, resolve identity across channels, and power real-time segmentation for activation. Tools in this category include Segment, Lytics, mParticle, Treasure Data, and Tealium.

A CDP is the right answer when:

You have high traffic and real-time activation requirements that warehouse-first cannot satisfy with acceptable latency
You have a significant anonymous-to-identified resolution problem (e-commerce sites with large anonymous browsing periods before purchase)
You have a dedicated marketing team with budget for the tooling and the operational capacity to use it
Your identity resolution problem involves multiple channels (web, mobile, offline) that are difficult to stitch together in SQL

A CDP is probably overkill when:

You have fewer than 100,000 monthly active users
Your activation use cases are primarily batch (weekly email sends, monthly ad audience refreshes)
You already have a working data warehouse and an engineering team comfortable with SQL
The CDP’s cost exceeds the value of the use cases it unlocks

The CDP vendors will tell you that you need a CDP at $50K ARR. That is usually not true. The decision framework is use case driven: list the three most important things you need customer data to power, and evaluate whether a warehouse-plus-reverse-ETL approach can accomplish them with acceptable latency. If yes, start there and add a CDP later if requirements change.

Identity Resolution in Depth

Identity resolution is the process of associating multiple identifiers with the same real-world person. It is the hardest problem in MarTech infrastructure, and getting it wrong quietly corrupts every metric that depends on user-level analysis.

The identifiers that need to be stitched together:

Cookie IDs — set by your analytics platform or CDP on first visit. Third-party cookies are effectively dead; you are working with first-party cookies. These expire, get cleared, and do not travel across devices.

Device IDs — assigned by mobile operating systems (IDFA on iOS, GAID on Android). IDFA is now opt-in by default in iOS 14.5+, which has decimated its usefulness for cross-app tracking. Useful for in-app session continuity.

Email addresses — the most durable identifier for known users. Available after authentication or form submission. Not available for anonymous sessions. Users can have multiple email addresses.

First-party user IDs — your system’s internal identifier for an authenticated user. The most reliable identifier. Stable, controlled by you, not subject to browser or OS policy changes.

Phone numbers — available when users provide them. Useful for SMS marketing and call attribution. Needs normalization (E.164 format) before use as a lookup key.

The practical identity graph for most products has three states:

Anonymous — cookie ID only. The user is a visitor, not yet a known person.
Transitioning — cookie ID + email, linked at the moment of form submission or login initiation, before account confirmation.
Identified — cookie ID + email + user ID, fully linked after authentication.

The engineering requirement at each transition: explicitly link the new identifier to the existing anonymous ID. This is the alias() call in Segment, the identity stitching logic in your warehouse, or the merge event in your CDP. Without an explicit link, pre-authentication behavior is silently discarded when the user authenticates.

The probabilistic identity problem: on mobile and in cookieless environments, deterministic linking (matching on a shared identifier both sessions have) is sometimes not possible. Probabilistic matching — inferring that two sessions are the same person based on IP address, device fingerprint, and behavioral signals — has accuracy limits. Be explicit about which matches in your identity graph are deterministic vs probabilistic, and do not treat probabilistic matches as ground truth in revenue attribution.

GDPR and CCPA impose specific requirements on data pipelines that many implementations handle incorrectly.

GDPR requires lawful basis for processing. For marketing use cases, this is usually consent. Consent must be specific (not blanket “by using this site you agree to everything”), withdrawable, and auditable. Your data pipeline must respect consent state — if a user has consented to analytics but not to marketing personalization, their behavioral data should flow to your analytics warehouse but not to your ad platform for retargeting.

Data minimization means you should not collect data that you do not have a purpose for. Every field in your event schema and every enrichment attribute should have a documented purpose. Auditors look for data that is being collected without a stated use case.

Right to erasure means you need to be able to delete a user’s personal data from every system it lives in. In a point-to-point architecture with 15 tools, erasure means 15 separate deletion jobs. In a warehouse-first architecture, deletion in the warehouse plus a reverse-sync propagates the deletion to downstream tools, which is much more tractable.

CCPA’s data sale provisions affect MarTech pipelines that share data with ad platforms for targeting — California residents have the right to opt out. If your pipeline shares customer data with Google Ads, Meta, or any other advertising platform, you need an opt-out mechanism and you need your pipeline to honor it.

The engineering requirement is not just filtering data on the way out — it is tagging data with consent attributes at collection time so downstream systems can filter correctly without requiring re-identification.

Choosing an Architecture: A Decision Framework

The right architecture depends on three variables: your event volume, your activation latency requirements, and your team’s engineering capacity.

Under 1M events/month, no real-time activation requirements, small engineering team: Point-to-point integrations with a CDP like Segment as the routing layer. Segment’s free tier covers low-volume use cases and its destinations ecosystem handles most common integration needs without custom code. Focus on getting your event taxonomy right and your identity resolution correct at authentication time.

1M-50M events/month, batch activation acceptable, engineering team with SQL skills: Warehouse-first. Set up a data warehouse (BigQuery is the easiest starting point), implement an event stream to the warehouse via Segment or a native SDK, build a dbt model layer, and use a reverse ETL tool for activation. This gives you the cleanest data model and the most analytical flexibility.

50M+ events/month, real-time activation required, dedicated data team: Event bus architecture with a specialized CDP for identity resolution. This is where Kafka or Kinesis, a CDP like mParticle or Tealium, and a warehouse all coexist and each plays a specific role.

The mistake to avoid at every scale: building for a scale you do not have yet. Point-to-point integrations that work fine at 500K users do not survive to 5M — but neither does a team that spent six months building a Kafka-based event bus when they had 50,000 users. Build for the current requirements with a clear migration path, not for the architecture you imagine needing in three years.

FAQ

Build an erasure pipeline that treats the warehouse as the primary deletion target and uses reverse ETL or direct API calls to propagate the deletion downstream. In practice: maintain a deleted_users table in the warehouse that logs erasure requests with timestamp. Your dbt models should exclude records from deleted users. A scheduled job reads the deleted_users table and calls the deletion APIs for every downstream tool — HubSpot, Segment, your email platform. Document the propagation latency (some tools process deletion requests asynchronously) and communicate it in your privacy policy.

What is reverse ETL and when do we need it?

Reverse ETL is the pattern of moving data from your data warehouse into operational tools like CRMs, marketing automation platforms, and ad platforms. Tools like Census, Hightouch, and Polytomic provide the infrastructure — you define a SQL model in your warehouse, map warehouse columns to destination tool fields, and the tool handles syncing on a schedule. You need reverse ETL when your warehouse has become the authoritative source of truth for customer data, and you need that data to drive activation (email sends, ad audiences, sales notifications) in downstream tools without maintaining custom sync pipelines for each tool.

Is a CDP still necessary if we have a data warehouse?

For many companies, no. A data warehouse with a well-designed schema and a reverse ETL tool can handle the use cases that CDPs were originally built for — unified customer profiles, audience segmentation, cross-channel analytics. Where CDPs remain differentiated: real-time data collection and routing (a warehouse is batch-oriented), mobile SDK identity resolution (CDPs have better mobile identity stitching), and pre-built destination connections for marketers who want self-service activation without SQL. The question is whether your activation use cases require real-time processing and marketer self-service, or whether batch sync frequency and SQL-based segmentation are acceptable.

How do we quantify the cost of getting identity resolution wrong?

Two places where bad identity resolution creates measurable damage: attribution and re-engagement. In attribution, users who appear as multiple people in your analytics have their conversion credited to multiple sources, inflating the apparent contribution of whatever channel captured each session. In re-engagement, a churned user who returns under a new anonymous ID does not get recognized, so they do not get onboarding flows, do not see their previous account data, and may create a duplicate account. Quantify it by sampling: take 1,000 conversions and manually verify what percentage have consistent identity linkage from first anonymous touch to authenticated conversion. The gap is your identity resolution error rate.

What is the operational cost of running Kafka vs using a managed event bus?

Self-managed Kafka on EC2 or bare metal requires active cluster management — partitioning decisions, replication configuration, broker health monitoring, consumer lag alerting, and the occasional topic compaction issue that requires deep Kafka knowledge to debug. Budget for a dedicated engineering resource or significant operational time. Managed Kafka (Confluent Cloud, AWS MSK) significantly reduces the operational burden at the cost of higher per-unit pricing and vendor dependency. For teams without Kafka expertise, AWS Kinesis or Google Pub/Sub are simpler alternatives with less operational overhead, at the cost of some flexibility. The decision comes down to whether your team has Kafka expertise or is willing to develop it — if neither, start with a managed alternative.