Are you looking for a quick way to get started with Delta Lake, MinIO, and Apache Spark? Whether you're building a proof of concept or setting up a development environment, this guide will help you get a fully functional data lakehouse running in minutes. We'll walk through a Docker-based setup that combines Delta Lake's ACID guarantees, MinIO's S3-compatible storage, and Spark's powerful processing capabilities – all orchestrated with Apache Hive Metastore for seamless table management.
Architecture Overview
Our setup consists of several key components:
Apache Spark with Delta Lake for data processing and storage
MinIO as S3-compatible object storage
Apache Hive Metastore for managing table metadata
PostgreSQL as the backend database for Hive Metastore
Prerequisites
Docker and Docker Compose
Basic understanding of Apache Spark and SQL
Familiarity with Python programming
Quick Start
About the Spark Delta Image
Our custom Spark Delta image is built on top of the official Spark 3.5.0 image and includes everything you need for a local Delta Lake development environment. Here's what's inside:
Base Image and Core Components:
Based on spark:3.5.0-scala2.12-java11-python3-ubuntu
Python 3 with PySpark support
Java 11 runtime
Scala 2.12
Key Features:
Pre-installed Delta Lake dependencies
Configured S3A connector for MinIO
Hadoop AWS libraries for S3 compatibility
Built-in Hive configuration
Optimized for local development
Implementation
1. Docker Compose Configuration
First, let's look at our docker-compose.yml configuration that sets up all required services:
docker-compose.yml
version: '3.8'
services:
spark:
image: howdyhow9/spark-delta-local:v1.0.1
networks:
- osds
volumes:
- ./data:/opt/spark/work-dir/data
- ./scripts:/opt/spark/work-dir/scripts
- ./source_files:/opt/spark/work-dir/source_files
env_file:
- .env.spark
environment:
- AWS_ACCESS_KEY_ID=minioadmin
- AWS_SECRET_ACCESS_KEY=minioadmin
- SPARK_EXTRA_CONF_DIR=/opt/spark/conf
ports:
- "4040:4040" # Spark UI
- "7077:7077" # Spark Master
- "8080:8080" # Spark UI
command: bash -c "/opt/spark/sbin/start-master.sh && tail -f /dev/null"
depends_on:
- hive-metastore
- minio
postgres:
networks:
- osds
image: postgres:16
environment:
- POSTGRES_HOST_AUTH_METHOD=md5
- POSTGRES_DB=hive_metastore
- POSTGRES_USER=hive
- POSTGRES_PASSWORD=hivepass123
- PGDATA=/var/lib/postgresql/data/pgdata
ports:
- "5432:5432"
volumes:
- ./postgres-data:/var/lib/postgresql/data
hive-metastore:
image: apache/hive:4.0.0-alpha-2
networks:
- osds
environment:
- SERVICE_NAME=metastore
- DB_DRIVER=postgres
- SERVICE_OPTS=-Djavax.jdo.option.ConnectionDriverName=org.postgresql.Driver -Djavax.jdo.option.ConnectionURL=jdbc:postgresql://postgres:5432/hive_metastore -Djavax.jdo.option.ConnectionUserName=hive -Djavax.jdo.option.ConnectionPassword=hivepass123
ports:
- "9083:9083"
volumes:
- ./data/delta/osdp/spark-warehouse:/opt/spark/work-dir/data/delta/osdp/spark-warehouse
depends_on:
- postgres
minio:
image: minio/minio:latest
networks:
- osds
ports:
- "9000:9000" # API
- "9001:9001" # Console
environment:
- MINIO_ROOT_USER=minioadmin
- MINIO_ROOT_PASSWORD=minioadmin
volumes:
- minio-data:/data
command: server /data --console-address ":9001"
healthcheck:
test: ["CMD", "mc", "ready", "local"]
interval: 30s
timeout: 20s
retries: 3
2. Spark Configuration
Spark Environment Configuration
Before starting the environment, create a .env.spark file to configure Spark's resource allocation and performance settings:
.env.spark
SPARK_MASTER=spark://spark:7077
SPARK_DRIVER_MEMORY=2g
SPARK_EXECUTOR_MEMORY=2g
SPARK_CORES_MAX=2
SPARK_LOCAL_DIRS=/tmp
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_DIR=/tmp
SPARK_DAEMON_MEMORY=1g
SPARK_CONF_DIR=/opt/spark/conf
The following Python code sets up our Spark session with Delta Lake integration and necessary configurations for MinIO and Hive Metastore:
scripts/spark_config_delta.py
from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip
def create_spark_session():
builder = SparkSession.builder \
.appName("Spark Delta with MinIO and Hive") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.sql.warehouse.dir", "s3a://delta-lake/warehouse/") \
.config("hive.metastore.warehouse.dir", "s3a://delta-lake/warehouse/") \
.config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
.config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
.config("spark.hadoop.fs.s3a.secret.key", "minioadmin") \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("javax.jdo.option.ConnectionURL", "jdbc:postgresql://postgres:5432/hive_metastore") \
.config("javax.jdo.option.ConnectionDriverName", "org.postgresql.Driver") \
.config("javax.jdo.option.ConnectionUserName", "hive") \
.config("javax.jdo.option.ConnectionPassword", "hivepass123") \
.config("spark.sql.catalogImplementation", "hive") \
.config("datanucleus.schema.autoCreateTables", "true") \
.config("hive.metastore.schema.verification", "false") \
.master("local[2]") \
.enableHiveSupport()
return configure_spark_with_delta_pip(builder).getOrCreate()
Start all services using Docker Compose:
docker compose up -d
3. Data Ingestion
Place some sample csv files as below into the minio bucket - source-data/
Here's a sample implementation for ingesting CSV files into Delta tables:
scripts/basic_spark_delta.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.types import *
from delta import *
from spark_config_delta import create_spark_session
spark = create_spark_session()
spark.sql("show databases").show()
def IngestDeltaCSVHeader(iDBSchema, iTable, iFilePath):
menu_csv = spark.read.option("header", True).csv(iFilePath)
menu_csv.show()
spark.sql("create schema if not exists "+iDBSchema)
menu_csv.write.format("delta").mode("overwrite").saveAsTable(iDBSchema+"."+iTable)
IngestDeltaCSVHeader("restaurant","menu", "s3a://source-data/menu_items.csv")
IngestDeltaCSVHeader("restaurant","orders", "s3a://source-data/order_details.csv")
IngestDeltaCSVHeader("restaurant","db_dictionary", "s3a://source-data/data_dictionary.csv")
To execute above using the deployed containers using docker compose -
docker-compose exec spark spark-submit /opt/spark/work-dir/scripts/basic_spark_delta.py
3. Spark SQL Session
scripts/spark_sql_session_delta.py
from spark_config_delta import create_spark_session
spark = create_spark_session()
print("Spark SQL session started!")
print("\nAvailable databases:")
spark.sql("SHOW DATABASES").show()
while True:
query = input("\nEnter SQL query (or 'exit' to quit): ")
if query.lower() == 'exit':
break
try:
spark.sql(query).show(truncate=False)
except Exception as e:
print(f"Error executing query: {str(e)}")
spark.stop()
docker-compose exec spark python3 /opt/spark/work-dir/scripts/spark_sql_session_delta.py
Conclusion
This setup provides a robust foundation for building modern data applications. The combination of Spark, Delta Lake, MinIO, and Hive Metastore offers a powerful, scalable, and maintainable data platform that can handle various data processing needs while maintaining data consistency and providing SQL capabilities.
Remember to adjust configurations based on your specific needs and scale requirements. This setup can be extended with additional services like Airflow for orchestration or Superset for visualization as your needs grow.