Attribution Modeling for Developers: Building and Evaluating Marketing Attribution Systems

Marketing attribution is the question of which touchpoints deserve credit for a conversion. It sounds like a business intelligence question, but it is fundamentally an engineering problem. The right attribution model for a business means nothing if the underlying data is incomplete, the touch sequence is broken, or the model’s query is incorrect.

This guide covers attribution from the engineering side: the data model required, how to implement the common models in SQL, what each model measures (and what it misses), and the data quality problems that make attribution outputs unreliable regardless of model sophistication.

What Attribution Is Actually Measuring

Before implementing any attribution model, understand what attribution can and cannot tell you.

Attribution models distribute credit for conversions across marketing touchpoints based on rules or statistical patterns. What they measure is the observable correlation between touching a marketing channel and converting. What they cannot measure is causality — whether the marketing touch actually caused the conversion.

This is not a trivial distinction. A user who was going to convert anyway, who also happened to see a branded search ad before converting, will have that branded search ad take attribution credit in a last-touch model. The ad did not cause the conversion — it was incidental. But from an attribution perspective, it looks the same as an ad that genuinely influenced a marginal conversion.

Attribution models provide a consistent accounting framework for comparing marketing programs, not a ground truth for causal impact. Using attribution outputs as if they were causal measurement leads to systematically wrong budget allocation decisions.

The Data Infrastructure Attribution Requires

Any attribution model is only as accurate as the touchpoint data it works from. Before implementing a model, audit your touch data quality:

Touch completeness. Are you capturing all the marketing touches that occur before conversion? Missing email clicks, missing organic social touches, or missing offline interactions create a systematically incomplete picture that biases all models toward the channels you can observe.

Identity resolution. Attribution requires linking touches to the same user across sessions, devices, and time. A user who clicks an ad on mobile, researches on desktop, and converts via email is three separate anonymous sessions without proper cross-device identity resolution. As covered in the MarTech data pipeline architecture guide, identity resolution is the hardest problem in MarTech data infrastructure.

Timestamp accuracy. The sequence of touches matters for sequential attribution models. Timestamps from different systems (your CRM, ad platforms, analytics) may differ due to time zone offsets, server time vs. event time discrepancies, and batch sync delays. Normalize all touch timestamps to UTC before building an attribution model.

Conversion definition clarity. What counts as a conversion? Multiple conversion events (lead form, trial start, first purchase, upsell) require separate attribution models — combining them produces attribution results that mean nothing.

The Touch Data Model

A workable attribution data model in SQL looks like this:

-- Touch events table
CREATE TABLE marketing_touches (
    touch_id        UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id         TEXT,
    anonymous_id    TEXT,
    session_id      TEXT,
    channel         TEXT NOT NULL,    -- 'email', 'paid_search', 'organic', 'direct'
    source          TEXT,             -- 'google', 'facebook', 'mailchimp'
    campaign        TEXT,
    medium          TEXT,
    content         TEXT,
    touch_type      TEXT NOT NULL,    -- 'impression', 'click', 'page_view'
    touched_at      TIMESTAMPTZ NOT NULL,
    page_url        TEXT,
    utm_source      TEXT,
    utm_medium      TEXT,
    utm_campaign    TEXT
);

-- Conversions table
CREATE TABLE conversions (
    conversion_id   UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id         TEXT NOT NULL,
    conversion_type TEXT NOT NULL,    -- 'trial_start', 'purchase', 'upsell'
    revenue         DECIMAL(10, 2),
    converted_at    TIMESTAMPTZ NOT NULL
);

-- Index for attribution queries (performance critical)
CREATE INDEX idx_touches_user_time ON marketing_touches (user_id, touched_at);
CREATE INDEX idx_conversions_user_time ON conversions (user_id, converted_at);

Implementing Attribution Models in SQL

Last-Touch Attribution

Last-touch gives 100% of conversion credit to the final touch before conversion:

WITH conversion_last_touch AS (
    SELECT
        c.conversion_id,
        c.user_id,
        c.revenue,
        c.converted_at,
        t.channel,
        t.source,
        t.campaign,
        ROW_NUMBER() OVER (
            PARTITION BY c.conversion_id
            ORDER BY t.touched_at DESC
        ) AS touch_rank
    FROM conversions c
    INNER JOIN marketing_touches t
        ON c.user_id = t.user_id
        AND t.touched_at <= c.converted_at
        AND t.touched_at >= c.converted_at - INTERVAL '30 days'  -- Attribution window
    WHERE c.conversion_type = 'purchase'
)
SELECT
    channel,
    source,
    campaign,
    COUNT(*) AS conversions,
    SUM(revenue) AS attributed_revenue,
    ROUND(AVG(revenue), 2) AS avg_revenue_per_conversion
FROM conversion_last_touch
WHERE touch_rank = 1
GROUP BY channel, source, campaign
ORDER BY attributed_revenue DESC;

Linear Attribution

Linear distributes credit equally across all touches in the attribution window:

WITH touch_counts AS (
    SELECT
        c.conversion_id,
        c.user_id,
        c.revenue,
        t.channel,
        t.source,
        t.campaign,
        COUNT(*) OVER (PARTITION BY c.conversion_id) AS total_touches
    FROM conversions c
    INNER JOIN marketing_touches t
        ON c.user_id = t.user_id
        AND t.touched_at <= c.converted_at
        AND t.touched_at >= c.converted_at - INTERVAL '30 days'
    WHERE c.conversion_type = 'purchase'
)
SELECT
    channel,
    source,
    campaign,
    COUNT(DISTINCT conversion_id) AS conversion_touchpoints,
    ROUND(SUM(revenue / total_touches), 2) AS attributed_revenue
FROM touch_counts
GROUP BY channel, source, campaign
ORDER BY attributed_revenue DESC;

Time-Decay Attribution

Time-decay gives more credit to touches that occur closer to the conversion, with a configurable half-life:

WITH touch_decay AS (
    SELECT
        c.conversion_id,
        c.user_id,
        c.revenue,
        t.channel,
        t.source,
        t.campaign,
        t.touched_at,
        c.converted_at,
        -- Decay weight: higher for touches closer to conversion
        -- Half-life of 7 days: weight = 2^(-days_before_conversion/7)
        POWER(2.0, -EXTRACT(EPOCH FROM (c.converted_at - t.touched_at)) / (7 * 86400)) AS decay_weight
    FROM conversions c
    INNER JOIN marketing_touches t
        ON c.user_id = t.user_id
        AND t.touched_at <= c.converted_at
        AND t.touched_at >= c.converted_at - INTERVAL '30 days'
    WHERE c.conversion_type = 'purchase'
),
normalized_decay AS (
    SELECT
        conversion_id,
        revenue,
        channel,
        source,
        campaign,
        decay_weight / SUM(decay_weight) OVER (PARTITION BY conversion_id) AS normalized_weight
    FROM touch_decay
)
SELECT
    channel,
    source,
    campaign,
    ROUND(SUM(revenue * normalized_weight), 2) AS attributed_revenue
FROM normalized_decay
GROUP BY channel, source, campaign
ORDER BY attributed_revenue DESC;

Data-Driven Attribution: The Statistical Approach

The rule-based models above (last-touch, linear, time-decay) apply arbitrary credit rules. Data-driven attribution attempts to estimate the actual incremental contribution of each channel using statistical methods.

The Shapley Value approach (from cooperative game theory) distributes credit based on each channel’s marginal contribution to the conversion probability:

from itertools import combinations
from collections import defaultdict

def calculate_shapley_values(conversion_paths):
    """
    Calculate Shapley values for marketing channels.
    
    Args:
        conversion_paths: List of (channels_tuple, conversions, total_paths) tuples
    """
    # Calculate conversion rate for each channel combination
    rates = {}
    for channels, conversions, total in conversion_paths:
        rates[frozenset(channels)] = conversions / total if total > 0 else 0
    
    # Get all unique channels
    all_channels = set()
    for channels, _, _ in conversion_paths:
        all_channels.update(channels)
    
    shapley_values = defaultdict(float)
    n = len(all_channels)
    
    for channel in all_channels:
        other_channels = all_channels - {channel}
        
        for size in range(len(other_channels) + 1):
            for subset in combinations(other_channels, size):
                subset_set = frozenset(subset)
                subset_with_channel = frozenset(subset) | {channel}
                
                # Marginal contribution of adding channel to this subset
                v_without = rates.get(subset_set, 0)
                v_with = rates.get(subset_with_channel, 0)
                marginal = v_with - v_without
                
                # Weight by number of subsets of this size
                weight = 1.0 / (n * len(list(combinations(other_channels, size + 1))) + 1) if n > 0 else 1
                shapley_values[channel] += weight * marginal
    
    # Normalize to sum to 1
    total = sum(shapley_values.values())
    if total > 0:
        return {ch: v / total for ch, v in shapley_values.items()}
    return dict(shapley_values)

Data-driven attribution is more accurate than rule-based models but requires significant conversion volume — statistical significance for Shapley values requires thousands of conversions per channel combination. For companies with fewer conversions, linear or time-decay models may produce more stable (though still arbitrary) results.

Attribution Windows: A Critical Decision

The attribution window — how far back to look for touches before a conversion — determines what gets credit. Common windows:

7 days — standard for e-commerce with short purchase cycles
30 days — standard for SaaS trials and B2C subscriptions
90 days — appropriate for B2B software with longer sales cycles
Custom by conversion type — trial starts might use 14-day windows; enterprise contracts 180-day windows

Longer windows attribute credit to channels that may have influenced awareness weeks or months ago. Shorter windows attribute credit primarily to performance channels that appear at the bottom of the funnel. The right window matches your actual sales cycle length, which you can estimate from the distribution of days between first touch and conversion.

Frequently Asked Questions

Why do our attribution numbers differ from platform-reported conversions?

Ad platforms (Google Ads, Meta) report conversions using their own attribution models, applied to the touchpoints visible within their platform. They do not see touches from other channels, and they apply their own cross-device models. Your internal attribution model applies a single consistent model across all channels. The numbers will never match exactly — use your internal model for cross-channel comparison and the platform models for platform-specific optimization.

How do we handle view-through attribution (impression attribution) in our model?

Impression data from ad platforms is separate from click data and must be imported separately. For view-through attribution, add impression events to the marketing_touches table with touch_type = 'impression' and include or exclude them based on your model’s design. View-through attribution significantly inflates display advertising’s apparent contribution — use it carefully and measure the lift it provides to display spend against control groups.

What is the minimum conversion volume needed for reliable attribution?

Rule-based models (last-touch, linear, time-decay) do not have a statistical minimum — they apply rules regardless of volume. Data-driven models require at minimum 1,000 conversions per channel combination being modeled for statistically significant Shapley values. For most companies without enterprise conversion volume, the choice of rule-based model matters less than the quality of the underlying touch data.

Should we attribute based on clicks only, or include page views and impressions?

It depends on your measurement objective. Click-only attribution measures channels that drove navigational traffic. Including page views adds sessions from organic search, direct, and email that are typically under-represented in ad platform reporting. Including impressions adds display and social impression exposure. Build your model based on what marketing behaviors you want to credit, not on what data is easiest to collect.

How do we account for organic search and direct traffic in attribution?

Organic search and direct traffic should appear in your marketing_touches table with channel = 'organic_search' or channel = 'direct'. Populate these from your server-side analytics or UTM parameter parsing on landing page URLs. Sessions without a referring URL or UTM parameters are typically classified as direct. This requires server-side session tracking with proper referrer capture — not just ad platform data.