The Modern Data Platform Series: Part 1

Docker - Spark + Delta Lake + Minio + Hive Metastore

Dec 22, 2024

Are you looking for a quick way to get started with Delta Lake, MinIO, and Apache Spark? Whether you're building a proof of concept or setting up a development environment, this guide will help you get a fully functional data lakehouse running in minutes. We'll walk through a Docker-based setup that combines Delta Lake's ACID guarantees, MinIO's S3-compatible storage, and Spark's powerful processing capabilities – all orchestrated with Apache Hive Metastore for seamless table management.

Architecture Overview

Our setup consists of several key components:

Apache Spark with Delta Lake for data processing and storage
MinIO as S3-compatible object storage
Apache Hive Metastore for managing table metadata
PostgreSQL as the backend database for Hive Metastore

Prerequisites

Docker and Docker Compose
Basic understanding of Apache Spark and SQL
Familiarity with Python programming

Quick Start

About the Spark Delta Image

Our custom Spark Delta image is built on top of the official Spark 3.5.0 image and includes everything you need for a local Delta Lake development environment. Here's what's inside:
Base Image and Core Components:

Based on spark:3.5.0-scala2.12-java11-python3-ubuntu
Python 3 with PySpark support
Java 11 runtime
Scala 2.12

Key Features:

Pre-installed Delta Lake dependencies
Configured S3A connector for MinIO
Hadoop AWS libraries for S3 compatibility
Built-in Hive configuration
Optimized for local development

Implementation

1. Docker Compose Configuration

First, let's look at our docker-compose.yml configuration that sets up all required services:

docker-compose.yml

version: '3.8'

services:
  spark:
    image: howdyhow9/spark-delta-local:v1.0.1
    networks:
      - osds
    volumes:
      - ./data:/opt/spark/work-dir/data
      - ./scripts:/opt/spark/work-dir/scripts
      - ./source_files:/opt/spark/work-dir/source_files
    env_file:
      - .env.spark
    environment:
      - AWS_ACCESS_KEY_ID=minioadmin
      - AWS_SECRET_ACCESS_KEY=minioadmin
      - SPARK_EXTRA_CONF_DIR=/opt/spark/conf
    ports:
      - "4040:4040"  # Spark UI
      - "7077:7077"  # Spark Master
      - "8080:8080"  # Spark UI
    command: bash -c "/opt/spark/sbin/start-master.sh && tail -f /dev/null"
    depends_on:
      - hive-metastore
      - minio

  postgres:
    networks:
      - osds
    image: postgres:16
    environment:
      - POSTGRES_HOST_AUTH_METHOD=md5
      - POSTGRES_DB=hive_metastore
      - POSTGRES_USER=hive
      - POSTGRES_PASSWORD=hivepass123
      - PGDATA=/var/lib/postgresql/data/pgdata
    ports:
      - "5432:5432"
    volumes:
      - ./postgres-data:/var/lib/postgresql/data

  hive-metastore:
    image: apache/hive:4.0.0-alpha-2
    networks:
      - osds
    environment:
      - SERVICE_NAME=metastore
      - DB_DRIVER=postgres
      - SERVICE_OPTS=-Djavax.jdo.option.ConnectionDriverName=org.postgresql.Driver -Djavax.jdo.option.ConnectionURL=jdbc:postgresql://postgres:5432/hive_metastore -Djavax.jdo.option.ConnectionUserName=hive -Djavax.jdo.option.ConnectionPassword=hivepass123
    ports:
      - "9083:9083"
    volumes:
      - ./data/delta/osdp/spark-warehouse:/opt/spark/work-dir/data/delta/osdp/spark-warehouse
    depends_on:
      - postgres

  minio:
    image: minio/minio:latest
    networks:
      - osds
    ports:
      - "9000:9000"  # API
      - "9001:9001"  # Console
    environment:
      - MINIO_ROOT_USER=minioadmin
      - MINIO_ROOT_PASSWORD=minioadmin
    volumes:
      - minio-data:/data
    command: server /data --console-address ":9001"
    healthcheck:
      test: ["CMD", "mc", "ready", "local"]
      interval: 30s
      timeout: 20s
      retries: 3

2. Spark Configuration

Spark Environment Configuration

Before starting the environment, create a .env.spark file to configure Spark's resource allocation and performance settings:
.env.spark

SPARK_MASTER=spark://spark:7077
SPARK_DRIVER_MEMORY=2g
SPARK_EXECUTOR_MEMORY=2g
SPARK_CORES_MAX=2
SPARK_LOCAL_DIRS=/tmp
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_DIR=/tmp
SPARK_DAEMON_MEMORY=1g
SPARK_CONF_DIR=/opt/spark/conf

The following Python code sets up our Spark session with Delta Lake integration and necessary configurations for MinIO and Hive Metastore:
scripts/spark_config_delta.py

from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip

def create_spark_session():
    builder = SparkSession.builder \
        .appName("Spark Delta with MinIO and Hive") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
        .config("spark.sql.warehouse.dir", "s3a://delta-lake/warehouse/") \
        .config("hive.metastore.warehouse.dir", "s3a://delta-lake/warehouse/") \
        .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
        .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
        .config("spark.hadoop.fs.s3a.secret.key", "minioadmin") \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("javax.jdo.option.ConnectionURL", "jdbc:postgresql://postgres:5432/hive_metastore") \
        .config("javax.jdo.option.ConnectionDriverName", "org.postgresql.Driver") \
        .config("javax.jdo.option.ConnectionUserName", "hive") \
        .config("javax.jdo.option.ConnectionPassword", "hivepass123") \
        .config("spark.sql.catalogImplementation", "hive") \
        .config("datanucleus.schema.autoCreateTables", "true") \
        .config("hive.metastore.schema.verification", "false") \
        .master("local[2]") \
        .enableHiveSupport()
    
    return configure_spark_with_delta_pip(builder).getOrCreate()

Start all services using Docker Compose:

docker compose up -d

3. Data Ingestion

Place some sample csv files as below into the minio bucket - source-data/
Here's a sample implementation for ingesting CSV files into Delta tables:
scripts/basic_spark_delta.py

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.types import *
from delta import *

from spark_config_delta import create_spark_session

spark = create_spark_session()

spark.sql("show databases").show()

def IngestDeltaCSVHeader(iDBSchema, iTable, iFilePath):
    menu_csv = spark.read.option("header", True).csv(iFilePath)
    menu_csv.show()
    spark.sql("create schema if not exists "+iDBSchema)
    menu_csv.write.format("delta").mode("overwrite").saveAsTable(iDBSchema+"."+iTable)

IngestDeltaCSVHeader("restaurant","menu", "s3a://source-data/menu_items.csv")
IngestDeltaCSVHeader("restaurant","orders", "s3a://source-data/order_details.csv")
IngestDeltaCSVHeader("restaurant","db_dictionary", "s3a://source-data/data_dictionary.csv")

To execute above using the deployed containers using docker compose -

docker-compose exec spark spark-submit /opt/spark/work-dir/scripts/basic_spark_delta.py

3. Spark SQL Session

scripts/spark_sql_session_delta.py

from spark_config_delta import create_spark_session

spark = create_spark_session()
print("Spark SQL session started!")
print("\nAvailable databases:")
spark.sql("SHOW DATABASES").show()

while True:
    query = input("\nEnter SQL query (or 'exit' to quit): ")
    if query.lower() == 'exit':
        break
    try:
        spark.sql(query).show(truncate=False)
    except Exception as e:
        print(f"Error executing query: {str(e)}")

spark.stop()

docker-compose exec spark python3 /opt/spark/work-dir/scripts/spark_sql_session_delta.py

Conclusion

This setup provides a robust foundation for building modern data applications. The combination of Spark, Delta Lake, MinIO, and Hive Metastore offers a powerful, scalable, and maintainable data platform that can handle various data processing needs while maintaining data consistency and providing SQL capabilities.
Remember to adjust configurations based on your specific needs and scale requirements. This setup can be extended with additional services like Airflow for orchestration or Superset for visualization as your needs grow.

OSD Data Services

Discussion about this post