BigQuery for Marketing Analytics: Setting Up a Production-Grade Data Warehouse

BigQuery has become the default analytics data warehouse for companies implementing a serious MarTech data infrastructure. Its serverless model, native GA4 integration, tight ecosystem integration with Looker and other BI tools, and consumption-based pricing make it the easiest starting point for teams moving from spreadsheet-based marketing analytics to SQL-based analysis.

The setup is not complicated. The decisions that matter — schema design, partitioning strategy, access control, and cost management — are where most implementations go wrong. This guide covers the setup that produces a production-grade marketing analytics warehouse rather than an expensive cloud spreadsheet.

Why BigQuery for Marketing Analytics

The case for BigQuery over other warehouse options (Snowflake, Redshift, Databricks) for marketing analytics teams:

Native GA4 export. GA4 has a first-party BigQuery export that streams daily event data directly into your BigQuery project at no additional cost beyond storage. This is the single most useful native integration in the marketing data stack — it eliminates the ETL pipeline you would otherwise need to build for web analytics data.

Scale and pricing model. BigQuery charges for query processing (per TB scanned) rather than per-cluster-hour, which means idle time is free. For marketing analytics workloads with periodic heavy queries and long idle periods, this produces significantly lower costs than always-on cluster architectures.

Google ecosystem integration. Looker, Data Studio/Looker Studio, and Google Ads all have native BigQuery connectors. For teams already in the Google marketing ecosystem, the integration overhead is lower.

SQL familiarity. BigQuery uses standard SQL with significant extensions. Most marketing data analysts comfortable with SQL can query BigQuery without significant relearning.

Project Structure and Dataset Organization

Before ingesting any data, establish a logical project and dataset structure. Retrofitting structure onto an existing warehouse is expensive.

Recommended structure:

project: company-analytics
├── dataset: raw_ga4            # GA4 export data (managed by Google)
├── dataset: raw_events         # Server-side event data from your collection pipeline
├── dataset: raw_crm            # CRM data imports (HubSpot, Salesforce)
├── dataset: raw_ad_platforms   # Advertising platform data (Google Ads, Meta)
├── dataset: staging            # dbt staging models (intermediate transformations)
├── dataset: marts              # dbt mart models (business-ready tables)
│   ├── mart_web_analytics
│   ├── mart_customer_lifecycle
│   └── mart_attribution
└── dataset: sandbox            # Analyst workspace, temp tables

Separating raw data from transformed data is critical. Raw data should never be modified — it is the source of truth. Transformations occur in staging and mart layers, applied by a transformation tool (dbt is standard).

Setting Up the GA4 BigQuery Export

GA4’s BigQuery export is configured in the GA4 admin interface: Admin → BigQuery Links → Link → select your BigQuery project and streaming or daily export frequency.

The export creates daily tables in raw_ga4 named events_YYYYMMDD with the schema:

-- Key columns in the GA4 export schema
-- event_date: string in YYYYMMDD format
-- event_timestamp: microseconds since epoch
-- event_name: string
-- event_params: ARRAY<STRUCT<key STRING, value STRUCT<...>>>
-- user_id: string (populated from analytics.setUserId())
-- user_pseudo_id: string (anonymous GA4 client ID)
-- user_properties: ARRAY<STRUCT<...>>
-- geo, device, traffic_source, collected_traffic_source

The GA4 export uses nested/repeated fields (ARRAY<STRUCT>), which requires knowledge of BigQuery’s UNNEST function for most queries:

-- Get all custom events with their parameters for a date range
SELECT
    event_date,
    event_timestamp,
    event_name,
    user_pseudo_id,
    user_id,
    -- Extract specific event parameters
    (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_url,
    (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_title') AS page_title,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'engagement_time_msec') AS engagement_time_ms,
    -- Traffic source
    traffic_source.source AS session_source,
    traffic_source.medium AS session_medium,
    traffic_source.name AS session_campaign
FROM `project.raw_ga4.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20250101' AND '20251231'
    AND event_name NOT IN ('session_start', 'first_visit')  -- Filter pseudo-events
ORDER BY event_timestamp;

The _TABLE_SUFFIX wildcard pattern and the date filter on it are critical for cost control — without the date filter, BigQuery scans all tables in the pattern, which for large GA4 datasets can be expensive.

Schema Design for Marketing Event Data

Beyond the GA4 export, you will ingest events from your own server-side collection, CRM systems, and ad platforms. A consistent schema design enables cross-source analysis:

-- Unified events table for server-side events
CREATE TABLE `company-analytics.raw_events.events` (
    event_id        STRING NOT NULL,
    user_id         STRING,
    anonymous_id    STRING,
    session_id      STRING,
    event_name      STRING NOT NULL,
    event_timestamp TIMESTAMP NOT NULL,
    event_date      DATE NOT NULL,     -- Partition column
    channel         STRING,
    source          STRING,
    campaign        STRING,
    properties      JSON,              -- Flexible storage for event-specific properties
    -- Standard context
    page_url        STRING,
    referrer        STRING,
    user_agent      STRING,
    ip_address      STRING,
    ingested_at     TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP()
)
PARTITION BY event_date
CLUSTER BY event_name, user_id;

Partitioning by date is mandatory for cost management. Every query that filters on event_date scans only the relevant partitions. Without partitioning, a query for last month’s data scans the entire table — which can be terabytes for high-volume sites.

Clustering by event_name and user_id improves performance for the two most common query patterns: “all events of type X” and “all events for user Y”.

Cost Optimization Strategies

BigQuery costs are dominated by query bytes processed. The common cost antipatterns:

No date partitioning. Without partitioning, every query scans the full table. A 2TB events table costs $10 per query (at $5/TB). With daily partitioning and a filter on event_date, the same query scans only the relevant day’s data — typically 1–10GB.

SELECT * queries. BigQuery charges for all columns scanned, even those not used in the result. Select only the columns you need. In columnar storage, selecting 5 columns from a 50-column table scans approximately 10% of the data.

Non-partitioned views joining large tables. Views that join multiple large tables without partition filters can trigger full scans of all joined tables. Always include partition filters in WHERE clauses for partitioned tables, and verify the query execution plan shows partition pruning.

Cost control configuration:

-- Set a maximum bytes billed per query to prevent runaway costs
-- This is set at the project or dataset level in BigQuery settings

-- You can also preview bytes scanned before running with dry runs in client libraries:

from google.cloud import bigquery

client = bigquery.Client()

# Dry run to estimate bytes processed before actually running
job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)

query = """
    SELECT event_name, COUNT(*) as event_count
    FROM `company-analytics.raw_events.events`
    WHERE event_date >= '2025-01-01'
    GROUP BY event_name
"""

job = client.query(query, job_config=job_config)
print(f"Estimated bytes processed: {job.total_bytes_processed / 1e9:.2f} GB")
print(f"Estimated cost: ${job.total_bytes_processed / 1e12 * 5:.4f}")

dbt Transformation Patterns for Marketing Data

dbt (data build tool) is the standard transformation layer for BigQuery marketing analytics. It provides SQL-based transformations with version control, testing, and documentation.

A basic dbt project structure for marketing analytics:

models/
├── staging/
│   ├── stg_ga4_events.sql
│   ├── stg_server_events.sql
│   ├── stg_hubspot_contacts.sql
│   └── stg_google_ads.sql
├── intermediate/
│   ├── int_unified_events.sql
│   └── int_user_sessions.sql
└── marts/
    ├── mart_web_sessions.sql
    ├── mart_conversion_paths.sql
    └── mart_customer_mrr.sql

A staging model that normalizes the GA4 nested schema into a flat table:

-- models/staging/stg_ga4_events.sql
{{
    config(
        materialized='incremental',
        partition_by={
            "field": "event_date",
            "data_type": "date",
            "granularity": "day"
        },
        cluster_by=["event_name", "user_pseudo_id"],
        incremental_strategy='insert_overwrite'
    )
}}

WITH source AS (
    SELECT
        PARSE_DATE('%Y%m%d', event_date) AS event_date,
        TIMESTAMP_MICROS(event_timestamp) AS event_timestamp,
        event_name,
        user_pseudo_id,
        user_id,
        -- Unnest key event parameters
        (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_url,
        (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_title') AS page_title,
        (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'session_id') AS ga4_session_id,
        -- Traffic source
        traffic_source.source AS traffic_source,
        traffic_source.medium AS traffic_medium,
        traffic_source.name AS traffic_campaign,
        -- Device and geo
        device.category AS device_category,
        device.operating_system AS os,
        geo.country AS country,
        geo.city AS city
    FROM `{{ source('raw_ga4', 'events_*') }}`
    WHERE
        {% if is_incremental() %}
        _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
        {% else %}
        _TABLE_SUFFIX >= '20240101'
        {% endif %}
)

SELECT * FROM source

Frequently Asked Questions

How do we handle the delay in GA4’s BigQuery export (data is typically 24-48 hours behind)?

GA4’s standard export completes by 09:00 UTC the following day. For yesterday’s data with the streaming export option (available in GA4 360), events appear within minutes. For daily exports, build your dbt models to run after the export completes — typically a 10:00 UTC daily schedule covers the export window reliably. For dashboards that require today’s data in near-real-time, supplement GA4’s export with your server-side event pipeline, which can be configured for hourly or more frequent loads.

What is the recommended BigQuery edition for a startup marketing analytics stack?

The BigQuery sandbox (free, with 10 GB storage and 1 TB queries per month free) is sufficient for initial development. The on-demand pricing tier (pay per query, no commitment) is right for early production use. BigQuery Enterprise editions with reserved capacity make sense at query volumes above ~$2,000/month, where reserved capacity costs less than on-demand pricing.

How do we join GA4 data with our CRM data for a unified customer view?

The join key is typically the GA4 user_id (set via analytics.setUserId() when users authenticate) matched to the user ID in your CRM. This works only for authenticated sessions — pre-authentication GA4 events have no user_id, only user_pseudo_id. For a complete join that covers anonymous sessions, you need cross-device identity resolution linking user_pseudo_id to user_id in your identity graph. Build an int_user_identity model that maps all known IDs for each user.

Should we use materialized views or scheduled queries for dashboard performance?

For frequently-queried aggregations (daily session counts, weekly conversion rates), materialized views are the better choice — BigQuery automatically refreshes them when source data changes and query costs are charged at refresh time, not query time. Scheduled queries are better for complex transformations that require procedural logic or for loading data into non-partitioned tables. For most marketing analytics dashboards, materializing the top 3–5 most expensive queries as materialized views produces a 10–50× performance improvement.

How do we set up row-level access controls so marketing analysts can’t see PII in BigQuery?

BigQuery supports column-level security through policy tags and row-level security through row access policies. For PII (email addresses, phone numbers, names), assign a policy tag to those columns and configure data masking — analysts see the masked version (e.g., truncated email) without needing to see the full value. For row-level restrictions (e.g., analysts can only see their own region’s data), configure row access policies with filters based on the analyst’s identity.