Skip to main content

Command Palette

Search for a command to run...

Data Lake vs. Data Warehouse

Data Lakes vs. Data Warehouses: How They Differ

Updated
10 min read
Data Lake vs. Data Warehouse

In the world of data management, you'll often hear the terms "data lake" and "data warehouse" used. While they both store data, they serve very different purposes and are built on different principles. Understanding the distinction is crucial for making the right data strategy decisions. Let's dive in and explore the key differences and use cases for each.

What is a Data Lake?

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. Think of it as a large body of water in its natural state. You can store data in its raw format, without having to first structure the data. This makes data lakes ideal for storing massive amounts of data from various sources, such as social media, IoT devices, and log files.

Use Cases:

  • Big Data Analytics: Data lakes are perfect for running complex queries and analysis on very large datasets.

  • Machine Learning: Data scientists can use the vast amount of raw data in a data lake to train machine learning models.

  • Real-time Data Processing: Data lakes can ingest and process real-time data streams.

What is a Data Warehouse?

A data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics. Unlike a data lake, a data warehouse primarily stores structured, filtered data that has been processed for a specific purpose. It's like a well-organized library where information is neatly categorized and easy to find.

Use Cases:

  • Business Intelligence and Reporting: Data warehouses are the backbone of BI tools, providing the data for reports, dashboards, and analytics that help businesses make informed decisions.

  • Performance Analysis: Companies can use data warehouses to track key performance indicators (KPIs) and analyze business trends over time.

  • Data Mining: Analysts can use data warehouses to discover patterns and relationships in their data.

The Key Differences

FeatureData LakeData Warehouse
Data StructureRaw, unstructured, and structuredProcessed, structured
SchemaSchema-on-readSchema-on-write
FlexibilityHighly flexibleLess flexible
UsersData scientists, data engineersBusiness analysts, data analysts
Use CasesBig data, machine learning, real-time analyticsBusiness intelligence, reporting, data mining

What is Schema-on-Read?

Schema-on-read is a data handling strategy where a structure (or schema) is applied to data only when it is queried or read. This is the approach used by data lakes.

In this model, data is loaded into the system in its raw, native format. There are no requirements for it to fit a predefined structure before being stored. This allows for immense flexibility, as you can store all types of data—structured, semi-structured, and unstructured—together.

When it's time to perform an analysis, the user or the analytics tool defines the schema they need at that moment. The system then parses and formats the relevant data from the lake according to that schema for the query.

This contrasts with the traditional schema-on-write model used by data warehouses. In that approach, a strict schema must be defined before any data is loaded. The data is cleaned, transformed, and structured to fit this schema upon entry. While this ensures data is clean and optimized for specific, known queries, it is also rigid and time-consuming. Any changes to the analysis requirements often necessitate a redesign of the schema itself.

In short, schema-on-read prioritizes flexibility and speed of data ingestion, making it ideal for exploratory analytics, data science, and machine learning on large, diverse datasets.

Python Code Examples: Schema-on-Write vs. Schema-on-Read

Here are two Python scripts that demonstrate the difference between the schema-on-write and schema-on-read approaches using a patient health data example.


1. Schema-on-Write: The Data Warehouse Approach

This approach defines a strict structure first and then fits the data into it. It prioritizes consistency and query performance.

# schema_on_write_example.py
# Author: Roberto

import sqlite3
import json

# --- 1. Define Schema First (Schema-on-Write) ---
# In a real data warehouse, this would be a table definition.
# Here, we create a SQLite table to enforce a strict schema.

def create_patient_table(cursor):
    """Defines and creates the structured patient table."""
    cursor.execute("""
    CREATE TABLE IF NOT EXISTS patients (
        id INTEGER PRIMARY KEY,
        patient_name TEXT NOT NULL,
        age INTEGER,
        blood_pressure_systolic INTEGER,
        blood_pressure_diastolic INTEGER
    )
    """)
    print("Schema-on-Write: `patients` table schema created.")

# --- 2. ETL Process: Extract, Transform, Load ---
# Data must be cleaned and transformed to fit the schema before writing.

def etl_and_load_data(cursor, raw_data):
    """Transforms raw data and loads it into the structured table."""
    print("\nStarting ETL process...")
    for record in raw_data:
        # Transform: Extract and clean data to fit the schema
        try:
            patient_name = record.get("name")
            age = record.get("personal_info", {}).get("age")
            bp = record.get("vitals", {}).get("blood_pressure")
            systolic, diastolic = (bp.split("/") if bp else (None, None))

            # Load: Insert the clean, structured data
            if patient_name and systolic and diastolic:
                cursor.execute("""
                INSERT INTO patients (patient_name, age, blood_pressure_systolic, blood_pressure_diastolic)
                VALUES (?, ?, ?, ?)
                """, (patient_name, age, int(systolic), int(diastolic)))
                print(f"- Loaded record for {patient_name}")
            else:
                print(f"- Skipped record due to missing data: {record}")
        except (AttributeError, ValueError, TypeError) as e:
            print(f"- Skipped record due to error: {e} in {record}")

# --- 3. Query Structured Data ---
# Queries are fast because the data is already organized.

def analyze_blood_pressure(cursor):
    """Analyzes the clean data in the warehouse."""
    print("\nQuerying the data warehouse...")
    cursor.execute("SELECT patient_name, blood_pressure_systolic FROM patients WHERE blood_pressure_systolic > 130")
    high_bp_patients = cursor.fetchall()
    print("Patients with high systolic blood pressure (>130):")
    for patient in high_bp_patients:
        print(f"- {patient[0]} (Systolic: {patient[1]})")

if __name__ == "__main__":
    # Raw data with inconsistent formats
    raw_health_records = [
        {"id": 1, "name": "John Doe", "personal_info": {"age": 45}, "vitals": {"blood_pressure": "140/90"}},
        {"id": 2, "name": "Jane Smith", "personal_info": {"age": 35}, "vitals": {"blood_pressure": "120/80"}},
        {"id": 3, "name": "Peter Jones", "vitals": {"blood_pressure": "135/85"}}, # Missing age
        {"id": 4, "name": "Mary Brown", "personal_info": {"age": 55}, "vitals": {}} # Missing vitals
    ]

    # Setup in-memory SQLite database
    conn = sqlite3.connect(":memory:")
    cursor = conn.cursor()

    # Run the schema-on-write workflow
    create_patient_table(cursor)
    etl_and_load_data(cursor, raw_health_records)
    analyze_blood_pressure(cursor)

    conn.close()

Output:

Schema-on-Write: `patients` table schema created.

Starting ETL process...
- Loaded record for John Doe
- Loaded record for Jane Smith
- Skipped record due to missing data: {'id': 3, 'name': 'Peter Jones', 'vitals': {'blood_pressure': '135/85'}}
- Skipped record due to missing data: {'id': 4, 'name': 'Mary Brown', 'personal_info': {'age': 55}, 'vitals': {}}

Querying the data warehouse...
Patients with high systolic blood pressure (>130):
- John Doe (Systolic: 140)

The Code Explained

  1. Define Schema First (create_patient_table):

    • We start by defining a rigid schema for our patients table using SQL. This is our "schema-on-write."

    • The table has fixed columns: id, patient_name, age, blood_pressure_systolic, and blood_pressure_diastolic.

    • Any data loaded into this table must conform to these columns and their data types (e.g., age must be an integer).

  2. ETL Process (etl_and_load_data):

    • Before data can be stored, it goes through an Extract, Transform, Load (ETL) process.

    • The code iterates through the raw, messy records.

    • Transform: It tries to extract the required fields (name, age, blood_pressure). It even has to split the blood_pressure string into two separate integer columns (systolic and diastolic).

    • Load: Only if the data can be successfully transformed to fit the schema is it inserted into the database table. Records with missing or malformed data are skipped.

  3. Query Structured Data (analyze_blood_pressure):

    • Querying is straightforward and fast because the data is already clean, structured, and indexed in the table.

    • The query SELECT ... WHERE blood_pressure_systolic > 130 is efficient because it operates on a predictable, optimized structure.

The Schema

The schema is the rigid structure defined by the CREATE TABLE statement. It is enforced before any data is written to the database.

CREATE TABLE patients (
    id INTEGER PRIMARY KEY,
    patient_name TEXT NOT NULL,
    age INTEGER,
    blood_pressure_systolic INTEGER,
    blood_pressure_diastolic INTEGER
)

2. Schema-on-Read: The Data Lake Approach

This approach stores raw data first and applies structure only when it's time to analyze it. It prioritizes flexibility and speed of data ingestion.

# schema_on_read_example.py
# Author: Roberto

import json

# --- 1. Load Raw Data into Data Lake ---
# Data is stored in its raw, native format (here, as JSON strings).
# No schema is enforced on write.

def load_raw_data_to_lake(raw_data):
    """Simulates loading raw JSON data into a data lake."""
    print("Schema-on-Read: Loading raw data into the lake without transformation.")
    # In a real data lake, this would be saving files to a system like S3 or HDFS.
    # Here, we just keep it in a list.
    data_lake = [json.dumps(record) for record in raw_data]
    print(f"- {len(data_lake)} records loaded.")
    return data_lake

# --- 2. Query Data by Applying Schema-on-Read ---
# The schema is defined in the code that reads and processes the data.

def query_for_high_blood_pressure(data_lake):
    """Applies a schema at read time to find patients with high blood pressure."""
    print("\nQuerying the data lake with Schema-on-Read...")
    high_bp_patients = []

    # The "schema" is defined here, in the logic of the query.
    for raw_record_str in data_lake:
        # Parse the raw data
        record = json.loads(raw_record_str)

        # Apply schema: try to extract the fields we need.
        # This is flexible and can handle missing data gracefully.
        try:
            patient_name = record.get("name")
            bp_str = record.get("vitals", {}).get("blood_pressure")

            if patient_name and bp_str:
                systolic = int(bp_str.split("/")[0])
                if systolic > 130:
                    # The structure of the result is also defined at read time.
                    high_bp_patients.append({
                        "name": patient_name,
                        "systolic_bp": systolic
                    })
        except (AttributeError, ValueError, TypeError, IndexError):
            # Gracefully skip records that don't match the expected structure.
            print(f"- Could not process record, skipping: {record}")
            continue

    print("\nPatients with high systolic blood pressure (>130):")
    for patient in high_bp_patients:
        print(f"- {patient["name"]} (Systolic: {patient["systolic_bp"]})")

if __name__ == "__main__":
    # Raw data with inconsistent formats and new, unexpected fields
    raw_health_records = [
        {"id": 1, "name": "John Doe", "personal_info": {"age": 45}, "vitals": {"blood_pressure": "140/90"}},
        {"id": 2, "name": "Jane Smith", "personal_info": {"age": 35}, "vitals": {"blood_pressure": "120/80"}},
        {"id": 3, "name": "Peter Jones", "vitals": {"blood_pressure": "135/85"}}, # Missing personal_info
        {"id": 4, "name": "Mary Brown", "personal_info": {"age": 55}, "vitals": {}}, # Missing blood_pressure
        {"id": 5, "name": "Dr. Eve", "wearable_data": {"steps": 10000, "heart_rate": 72}} # Completely different structure
    ]

    # Run the schema-on-read workflow
    data_lake_storage = load_raw_data_to_lake(raw_health_records)
    query_for_high_blood_pressure(data_lake_storage)

The Code Explained

  1. Load Raw Data (load_raw_data_to_lake):

    • The raw data (in this case, a list of JSON objects) is loaded directly into our simulated "data lake" (data_lake_storage).

    • There is no upfront transformation or validation. The data is stored as-is, including records with missing fields or completely different structures (like the wearable_data record).

  2. Query by Applying Schema-on-Read (query_for_high_blood_pressure):

    • This is where the "schema-on-read" happens. The schema is not in a database table definition but is implicitly defined within the query logic itself.

    • The code iterates through the raw JSON strings in the lake.

    • Apply Schema: For each record, it tries to parse it and extract the name and blood_pressure fields. It uses .get() and try-except blocks to handle cases where fields might be missing or have the wrong format. This makes the query resilient to messy, inconsistent data.

    • The logic defines what it wants to see at that moment: a name and a blood_pressure value. It doesn't care about other fields like age or wearable_data for this specific query.

Output

Schema-on-Read: Loading raw data into the lake without transformation.
- 5 records loaded.

Querying the data lake with Schema-on-Read...
- Could not process record, skipping: {'id': 4, 'name': 'Mary Brown', 'personal_info': {'age': 55}, 'vitals': {}}
- Could not process record, skipping: {'id': 5, 'name': 'Dr. Eve', 'wearable_data': {'steps': 10000, 'heart_rate': 72}}

Patients with high systolic blood pressure (>130):
- John Doe (Systolic: 140)
- Peter Jones (Systolic: 135)

The Schema Being Read

In this case, the "schema" is not a formal structure but the pattern of data access defined by the Python code. The schema being "read" is:

  • "I expect a JSON object."

  • "I expect a key named name with a string value."

  • "I expect a key named vitals which is a dictionary."

  • "Inside vitals, I expect a key named blood_pressure with a string value in the format 'systolic/diastolic'."

The code applies this "schema" to each record as it reads it. If a record doesn't fit this pattern, it's simply skipped for this particular query, but it remains in the data lake, available for other queries that might use a different schema (e.g., a query looking for wearable_data). '''

Conclusion

Choosing between a data lake and a data warehouse depends on your specific needs. If you need to store vast amounts of raw data for deep analysis and machine learning, a data lake is the way to go. If your primary goal is to perform business intelligence and reporting on structured data, a data warehouse is the better choice. In many modern data architectures, you'll find both working together, with the data lake acting as a source for the data warehouse.

Academy

Part 1 of 1

This is a set of fundamental topics. The must knows

More from this blog

U

Understand. Build. Conquer the Cloud

70 posts

No time for a novel? Here are my my Cloud Architect field notes: Distilling my complex cloud adventures into digestible TL;DRs.