Advanced SQL Analysis in BigQuery

1. Project Overview

In this project, I utilize Google BigQuery to perform a comprehensive analysis of e-commerce data. The goal is to move beyond simple data retrieval and demonstrate advanced data manipulation techniques, including Common Table Expressions (CTEs), Window Functions, and data cleaning strategies.

The business problem addressed here is calculating Customer Lifetime Value (CLV) and segmenting users based on their purchasing behavior (Recency, Frequency, and Monetary value).

2. Data Cleaning & Preparation

Real-world data is rarely clean. Before joining tables, I use CTEs to handle NULL values, standardize string formats, and cast data types correctly.

-- Step 1: Clean Customer Data using a CTE
WITH CleanCustomers AS (
  SELECT
    customer_id,
    COALESCE(email, 'unknown_email') AS email,
    UPPER(TRIM(country)) AS country,
    DATE(created_at) AS signup_date
  FROM `project.ecommerce.customers`
  WHERE customer_id IS NOT NULL
)

Why this matters: Using COALESCE prevents downstream calculation errors, and UPPER(TRIM()) ensures that 'USA ', 'usa', and 'USA' are treated as the same country during aggregation.

3. Joins & RFM Calculation

Next, I join the cleaned customer data with the orders table. I perform an aggregate analysis to calculate the three pillars of RFM:

  • Recency: Days since last order.
  • Frequency: Total count of orders.
  • Monetary: Total sum of spend.
-- Step 2: Aggregate Order Data
, CustomerStats AS (
  SELECT
    c.customer_id,
    c.country,
    COUNT(o.order_id) AS frequency,
    SUM(o.total_amount) AS monetary,
    MAX(o.order_date) AS last_order_date
  FROM CleanCustomers c
  INNER JOIN `project.ecommerce.orders` o
    ON c.customer_id = o.customer_id
  GROUP BY 1, 2
)

4. Advanced Analytics: Window Functions

To provide actionable insights, I rank customers within their specific countries based on their total spend. This allows the marketing team to identify the top 5% of customers per region without running separate queries for each country.

-- Step 3: Rank Customers per Country using Window Functions
SELECT
  customer_id,
  country,
  monetary,
  DATE_DIFF(CURRENT_DATE(), last_order_date, DAY) AS recency_days,
  RANK() OVER (
    PARTITION BY country
    ORDER BY monetary DESC
  ) AS country_rank
FROM CustomerStats
WHERE monetary > 0
ORDER BY country, country_rank ASC;

Key Insight: By utilizing PARTITION BY country, the ranking resets for every country. This is far more efficient than using WHERE clauses for every single region. This query produces a clean dataset ready for the marketing team to target "High Value" users who haven't purchased in the last 30 days.

← Return to All Projects