08 November 2020

#Apache Spark

#Apache Spark

Key Concepts


S.No Topic Sub-Topics
1SparkOverview, Features, Spark vs Hadoop, Components, Use cases
2Spark ArchitectureDriver, Executors, Cluster Manager, DAG, Spark Context
3RDD BasicsResilient Distributed Dataset, Creation, Transformations, Actions, Persistence
4RDD AdvancedPartitioning, Caching, Checkpointing, Narrow vs Wide transformations, lineage
5DataFrames BasicsCreation, Schema, Columns, Show(), PrintSchema()
6DataFrames OperationsSelect, Filter, GroupBy, Aggregate, Join
7DataFrames AdvancedUDF, Window Functions, Pivot, Explode, Sorting
8Spark SQLSQLContext, SparkSession, Running SQL Queries, Caching tables, tempView
9Datasets BasicsTyped API, Creation, Transformations, Actions, Interoperability with DataFrames
10Data LoadingCSV, JSON, Parquet, Avro, ORC
11Data WritingSaveAsTable, write.parquet, write.json, Overwrite, Append
12Spark Streaming BasicsDStream, StreamingContext, Window operations, Input sources, Output operations
13Spark Structured StreamingSparkSession, readStream, writeStream, Triggers, Watermarking
14Streaming AdvancedStateful processing, Aggregations, Joins, Checkpointing, Fault tolerance
15Spark MLlib BasicsMachine Learning Pipeline, Transformers, Estimators, Features, Label encoding
16MLlib AlgorithmsLinear Regression, Logistic Regression, Decision Trees, Random Forest, Clustering
17MLlib Feature EngineeringStandardScaler, MinMaxScaler, OneHotEncoder, VectorAssembler, PCA
18MLlib Model EvaluationTrain/Test Split, CrossValidation, Metrics (Accuracy, RMSE), ParamGrid, Evaluator
19Spark GraphX BasicsGraphs, Vertices, Edges, Pregel, Graph operators
20GraphX AdvancedPageRank, Connected Components, Triangle Count, Shortest Path, Graph Algorithms
21Spark Performance TuningPartitioning, Caching, Serialization, Shuffle, Broadcast variables
22Spark DeploymentStandalone Mode, YARN, Mesos, Kubernetes, Cluster configuration
23Spark with Python (PySpark)RDD, DataFrame API, UDF, MLlib, Integration with Pandas
24Spark with ScalaRDD API, DataFrame API, Typed Datasets, MLlib, Spark SQL
25Spark with JavaRDD API, DataFrame API, Datasets, MLlib, JavaRDD
26Spark with R (SparkR)DataFrames, SparkR SQL, Machine Learning, Integration with RStudio, Visualization
27Debugging & LoggingLogs, Spark UI, DAG visualization, Event logs, Executors monitoring
28Security in SparkAuthentication, Authorization, SSL, Encryption, Kerberos
29Spark EcosystemHive, HBase, Kafka, Cassandra, Hadoop integration
30End-to-End ProjectData ingestion, Cleaning, Transformations, Machine Learning, Deployment

Interview question


Related Topics


   Spark-Queries   
   Spark Core   
   RDD   
   DataFrames & Datasets   
   Spark SQL   
   Transformations & Actions   
   Joins & Aggregations   
   Spark Streaming   
   Hadoop vs Spark