08 November 2020

#Apache_Spark

#Apache_Spark

Key Concepts


S.No Topic Sub-Topics
1SparkOverview, Features, Spark vs Hadoop, Components, Use cases
2Spark ArchitectureDriver, Executors, Cluster Manager, DAG, Spark Context
3RDD BasicsResilient Distributed Dataset, Creation, Transformations, Actions, Persistence
4RDD AdvancedPartitioning, Caching, Checkpointing, Narrow vs Wide transformations, lineage
5DataFrames BasicsCreation, Schema, Columns, Show(), PrintSchema()
6DataFrames OperationsSelect, Filter, GroupBy, Aggregate, Join
7DataFrames AdvancedUDF, Window Functions, Pivot, Explode, Sorting
8Spark SQLSQLContext, SparkSession, Running SQL Queries, Caching tables, tempView
9Datasets BasicsTyped API, Creation, Transformations, Actions, Interoperability with DataFrames
10Data LoadingCSV, JSON, Parquet, Avro, ORC
11Data WritingSaveAsTable, write.parquet, write.json, Overwrite, Append
12Spark Streaming BasicsDStream, StreamingContext, Window operations, Input sources, Output operations
13Spark Structured StreamingSparkSession, readStream, writeStream, Triggers, Watermarking
14Streaming AdvancedStateful processing, Aggregations, Joins, Checkpointing, Fault tolerance
15Spark MLlib BasicsMachine Learning Pipeline, Transformers, Estimators, Features, Label encoding
16MLlib AlgorithmsLinear Regression, Logistic Regression, Decision Trees, Random Forest, Clustering
17MLlib Feature EngineeringStandardScaler, MinMaxScaler, OneHotEncoder, VectorAssembler, PCA
18MLlib Model EvaluationTrain/Test Split, CrossValidation, Metrics (Accuracy, RMSE), ParamGrid, Evaluator
19Spark GraphX BasicsGraphs, Vertices, Edges, Pregel, Graph operators
20GraphX AdvancedPageRank, Connected Components, Triangle Count, Shortest Path, Graph Algorithms
21Spark Performance TuningPartitioning, Caching, Serialization, Shuffle, Broadcast variables
22Spark DeploymentStandalone Mode, YARN, Mesos, Kubernetes, Cluster configuration
23Spark with Python (PySpark)RDD, DataFrame API, UDF, MLlib, Integration with Pandas
24Spark with ScalaRDD API, DataFrame API, Typed Datasets, MLlib, Spark SQL
25Spark with JavaRDD API, DataFrame API, Datasets, MLlib, JavaRDD
26Spark with R (SparkR)DataFrames, SparkR SQL, Machine Learning, Integration with RStudio, Visualization
27Debugging & LoggingLogs, Spark UI, DAG visualization, Event logs, Executors monitoring
28Security in SparkAuthentication, Authorization, SSL, Encryption, Kerberos
29Spark EcosystemHive, HBase, Kafka, Cassandra, Hadoop integration
30End-to-End ProjectData ingestion, Cleaning, Transformations, Machine Learning, Deployment

Interview question

Basic

  1. What is Apache Spark?
  2. Differences between Spark and Hadoop MapReduce.
  3. What are the main features of Spark?
  4. Explain Spark architecture.
  5. What is RDD in Spark?
  6. How to create RDDs?
  7. Explain transformations and actions in RDD.
  8. What is lazy evaluation?
  9. What is a SparkContext?
  10. Explain driver and executor in Spark.
  11. What is DAG in Spark?
  12. Difference between narrow and wide transformations.
  13. What is persistence and caching?
  14. How to check Spark version?
  15. What is a partition?
  16. How to create a DataFrame in Spark?
  17. Difference between RDD and DataFrame.
  18. What is SparkSession?
  19. How to show schema of a DataFrame?
  20. How to select columns in DataFrame?
  21. How to filter rows in DataFrame?
  22. How to perform groupBy operations?
  23. What are Spark actions?
  24. How to handle missing data?
  25. How to sort data in Spark DataFrame?

Intermediate

  1. What is Dataset in Spark?
  2. Difference between DataFrame and Dataset.
  3. Explain Spark SQL.
  4. How to run SQL queries in Spark?
  5. What is a tempView?
  6. How to join DataFrames?
  7. How to perform aggregations?
  8. What is a window function in Spark?
  9. Explain user-defined functions (UDF).
  10. How to register UDF?
  11. How to read data from CSV?
  12. How to read data from JSON?
  13. How to read data from Parquet?
  14. How to write DataFrame to file?
  15. How to cache and persist DataFrame?
  16. How to monitor Spark jobs?
  17. How to debug Spark applications?
  18. Explain Spark UI.
  19. What is Spark Streaming?
  20. Difference between DStream and Structured Streaming.
  21. How to create StreamingContext?
  22. Explain window operations in streaming.
  23. How to checkpoint in streaming?
  24. How to handle late data?
  25. Explain triggers in Structured Streaming.

Advanced

  1. Explain Spark MLlib.
  2. Difference between Transformers and Estimators.
  3. Explain Pipelines in Spark ML.
  4. How to split data into training and test sets?
  5. How to evaluate models?
  6. Explain cross-validation in Spark ML.
  7. Explain Linear Regression in Spark ML.
  8. Explain Logistic Regression in Spark ML.
  9. Explain Decision Trees in Spark ML.
  10. Explain Random Forests in Spark ML.
  11. Explain Gradient-Boosted Trees.
  12. How to handle categorical features?
  13. How to normalize features?
  14. Explain feature engineering in Spark ML.
  15. How to use VectorAssembler?
  16. How to use StringIndexer?
  17. Explain OneHotEncoder.
  18. Explain MLlib clustering algorithms.
  19. Explain KMeans in Spark ML.
  20. Explain PCA in Spark ML.
  21. Explain GraphX in Spark.
  22. How to create a graph in GraphX?
  23. Explain Pregel API.
  24. Explain Graph operations: PageRank, Connected Components.
  25. How to debug Spark ML pipelines?

Expert

  1. Explain Spark performance tuning.
  2. How to optimize Spark jobs?
  3. Explain partitioning strategies.
  4. How to reduce shuffle operations?
  5. How to use broadcast variables?
  6. How to use accumulators?
  7. How to handle skewed data?
  8. How to configure Spark cluster?
  9. Explain Spark deployment modes (Standalone, YARN, Mesos, Kubernetes).
  10. How to integrate Spark with Hadoop?
  11. How to integrate Spark with Hive?
  12. How to integrate Spark with HBase?
  13. How to integrate Spark with Cassandra?
  14. Explain Spark with Python (PySpark).
  15. Explain Spark with Scala.
  16. Explain Spark with Java.
  17. Explain SparkR.
  18. How to implement end-to-end Spark project?
  19. How to monitor and log Spark jobs?
  20. How to secure Spark cluster (SSL, Kerberos)?
  21. How to use checkpointing for fault tolerance?
  22. How to implement streaming with Kafka?
  23. How to implement Spark on cloud (AWS EMR, Databricks)?
  24. Explain advanced GraphX algorithms.
  25. Best practices for production-ready Spark applications.

Related Topics