Advanced SQL Analysis in BigQuery
Table of Contents
1. Project Overview
In this project, I utilize Google BigQuery to perform a comprehensive analysis of e-commerce data. The goal is to move beyond simple data retrieval and demonstrate advanced data manipulation techniques, including Common Table Expressions (CTEs), Window Functions, and data cleaning strategies.
The business problem addressed here is calculating Customer Lifetime Value (CLV) and segmenting users based on their purchasing behavior (Recency, Frequency, and Monetary value).
2. Data Cleaning & Preparation
Real-world data is rarely clean. Before joining tables, I use CTEs to handle NULL values, standardize string formats, and cast data types correctly.
WITH CleanCustomers AS (
SELECT
customer_id,
COALESCE(email, 'unknown_email') AS email,
UPPER(TRIM(country)) AS country,
DATE(created_at) AS signup_date
FROM `project.ecommerce.customers`
WHERE customer_id IS NOT NULL
)
Why this matters: Using COALESCE prevents downstream calculation errors, and UPPER(TRIM()) ensures that 'USA ', 'usa', and 'USA' are treated as the same country during aggregation.
3. Joins & RFM Calculation
Next, I join the cleaned customer data with the orders table. I perform an aggregate analysis to calculate the three pillars of RFM:
- Recency: Days since last order.
- Frequency: Total count of orders.
- Monetary: Total sum of spend.
, CustomerStats AS (
SELECT
c.customer_id,
c.country,
COUNT(o.order_id) AS frequency,
SUM(o.total_amount) AS monetary,
MAX(o.order_date) AS last_order_date
FROM CleanCustomers c
INNER JOIN `project.ecommerce.orders` o
ON c.customer_id = o.customer_id
GROUP BY 1, 2
)
4. Advanced Analytics: Window Functions
To provide actionable insights, I rank customers within their specific countries based on their total spend. This allows the marketing team to identify the top 5% of customers per region without running separate queries for each country.
SELECT
customer_id,
country,
monetary,
DATE_DIFF(CURRENT_DATE(), last_order_date, DAY) AS recency_days,
RANK() OVER (
PARTITION BY country
ORDER BY monetary DESC
) AS country_rank
FROM CustomerStats
WHERE monetary > 0
ORDER BY country, country_rank ASC;
Key Insight: By utilizing PARTITION BY country, the ranking resets for every country. This is far more efficient than using WHERE clauses for every single region. This query produces a clean dataset ready for the marketing team to target "High Value" users who haven't purchased in the last 30 days.