This project demonstrates Apache Spark concepts from basic to advanced, integrated with Spring Boot.
src/main/java/com/sparklearning/
├── basic/ # Basic Spark concepts
├── intermediate/ # Intermediate operations
├── advanced/ # Advanced features
├── config/ # Spring configuration
└── controller/ # REST endpoints
- Java 17+
- Maven 3.6+
- Apache Spark 3.5.0
mvn clean install
mvn spring-boot:run-
RDD Operations (
BasicRDDOperations.java)- Creating RDDs
- Transformations (map, filter, flatMap)
- Actions (collect, count, reduce)
- Word Count example
-
DataFrame Operations (
BasicDataFrameOperations.java)- Creating DataFrames
- Select, filter, groupBy
- SQL queries
-
Data Sources (
DataSourceOperations.java)- Reading CSV, JSON, Parquet
- JDBC connections
- Writing data with partitioning
-
Advanced Transformations (
IntermediateTransformations.java)- Window functions
- Joins (inner, outer, left, right)
- Aggregations
- User Defined Functions (UDF)
-
Performance Optimization (
AdvancedOptimization.java)- Caching and persistence
- Broadcast joins
- Partitioning strategies
- Adaptive Query Execution
-
Machine Learning (
MachineLearningPipeline.java)- Feature engineering
- Classification models
- Cross-validation
- Hyperparameter tuning
-
Structured Streaming (
AdvancedStreaming.java)- Real-time data processing
- Windowing operations
- Stateful streaming
- Stream-stream joins
GET /api/spark/basic/transformations
GET /api/spark/basic/actions
POST /api/spark/basic/wordcount
GET /api/spark/basic/dataframe
GET /api/solutions/basic/exercise1
POST /api/solutions/basic/exercise2
GET /api/solutions/basic/exercise3
GET /api/solutions/intermediate/exercise4
GET /api/solutions/intermediate/exercise5
GET /api/solutions/intermediate/exercise6
GET /api/solutions/advanced/exercise7
GET /api/solutions/advanced/exercise9
GET /api/solutions/projects/log-analytics
GET /api/solutions/projects/recommendations
GET /api/solutions/projects/etl-pipeline
- RDD: Low-level API, full control
- DataFrame: High-level API, optimized
- Dataset: Type-safe, best of both
- Transformations: Lazy (map, filter, join)
- Actions: Trigger execution (collect, count, save)
- Use DataFrames over RDDs
- Cache frequently used data
- Avoid shuffles when possible
- Use broadcast for small tables
- Partition data appropriately
- Run basic examples
- Experiment with your own data
- Try streaming examples with Kafka
- Build ML pipelines
- Optimize for production
- Apache Spark Documentation
- Spark SQL Guide
- MLlib Guide ============================================= Project Structure Basic Level (Start here):
BasicRDDOperations.java - RDD fundamentals, transformations, actions, word count BasicDataFrameOperations.java - DataFrames, SQL queries, basic operations Intermediate Level:
DataSourceOperations.java - Reading/writing CSV, JSON, Parquet, JDBC IntermediateTransformations.java - Joins, window functions, aggregations, UDFs Advanced Level:
AdvancedOptimization.java - Caching, broadcast joins, partitioning, AQE MachineLearningPipeline.java - MLlib, feature engineering, model training AdvancedStreaming.java - Structured streaming, windowing, stateful operations Spring Boot Integration:
SparkConfig.java - Spark configuration beans SparkController.java - REST endpoints to test Spark operations application.yml - Application configuration Learning Resources README.md - Quick start guide and project overview LEARNING_GUIDE.md - Comprehensive concepts, best practices, interview questions EXERCISES.md - Hands-on exercises from basic to advanced To Get Started Build the project: mvn clean install Run: mvn spring-boot:run Test endpoints: http://localhost:8080/api/spark/basic/transformations Learning Path Start with RDD basics (transformations, actions) Move to DataFrames (more optimized) Learn data sources and I/O operations Master joins and aggregations Explore performance optimization Try machine learning pipelines Build streaming applications Each file is heavily commented with explanations. Work through them sequentially, and use the exercises to practice!
========================================================
The data/ directory contains sample CSV, JSON, and text files for all exercises:
employees.csv- Employee records with salary and departmentproducts.csv- Product catalogcustomers.json- Customer informationorders.json- Order transactionssales.json- Sales data with datesdepartments.json- Department detailspeople.json- People data for age categorizationratings.csv- User ratings for recommendationstransactions.csv- Transaction data for fraud detectionmessy_data.csv- Intentionally messy data for cleaningml_dataset.csv- Machine learning training datasample_text.txt- Text for word count exercisesaccess.log- Apache web server logssmall_lookup.csv- Small lookup table for broadcast joinslarge_dataset_sample.csv- Sample large dataset structure
You can generate larger datasets for performance testing:
# Generate 1 million rows
curl -X POST "http://localhost:8080/api/data/generate/large?numRows=1000000"
# Generate time series data
curl -X POST "http://localhost:8080/api/data/generate/timeseries?numRows=10000"
# Generate transactions
curl -X POST "http://localhost:8080/api/data/generate/transactions?numRows=50000"
# Generate ML training data
curl -X POST "http://localhost:8080/api/data/generate/ml?numRows=10000"
# Generate web logs
curl -X POST "http://localhost:8080/api/data/generate/logs?numRows=100000"Generated files will be saved in data/generated/ directory.