08 November 2020

#Apache_Spark

#Apache_Spark
What is Piping in Spark.
What is Apache Spark? What are the features of Apache Spark?
What is RDD?
What does DAG refer to in Apache Spark?
What is Client Mode?
What is Cluster Mode?
What are receivers in Apache Spark Streaming?
What is the difference between repartition and coalesce?
What is Repartition ?
What is Coalesce?
What are the data formats supported by Spark?
What do you understand by Shuffling in Spark?
What is YARN in Spark?
What is MapReduce?
What is the working of DAG in Spark?
What is Spark Streaming and how is it implemented in Spark?
What is Spark Datasets?
what is Spark DataFrames?
What is Executor Memory in Spark
What are the functions of SparkCore?
What is worker node?
What is Spark context?
What is cluster manager?
What are some of the demerits of using Spark in applications?
What is SchemaRDD in Spark RDD?
What module is used for implementing SQL in Apache Spark?
What are the different persistence levels in Apache Spark?
What are the steps to calculate the executor memory?
What is Spark Datasets?
What is Dataframes?
What are Sparse Vectors? How are they different from dense vectors?
What API is used for Graph Implementation in Spark?
Explain the working of Spark with the help of its architecture.
Why do we need broadcast variables in Spark?
How is Apache Spark different from MapReduce?
How can the data transfers be minimized while working with Spark?
How are automatic clean-ups triggered in Spark for handling the accumulated metadata?
How is Caching relevant in Spark Streaming?
How can you achieve machine learning in Spark?
Can Apache Spark be used along with Hadoop? If yes, then how?
Differentiate between Spark Datasets, Dataframes and RDDs.
List the types of Deploy Modes in Spark.
Under what scenarios do you use Client and Cluster modes for deployment?
Spark
  • Parallel distributing computing framework
  • Built on Hadoop/HDFS
  • Provides a powerful API for transforming and manipulating data
  • Open-source
  • Fast and flexible
  • Great for distributed SQL like applications
  • Easy to install and use
Spark Core
  • The Spark Core is the heart of Spark and performs the core functionality.
  • It holds the components for task scheduling, fault recovery, interacting with storage systems and memory management.
Spark SQL
  • The Spark SQL is built on the top of Spark Core. It provides support for structured data.
  • It allows to query the data via SQL (Structured Query Language) as well as the Apache Hive variant of SQL?called the HQL (Hive Query Language).
  • It supports JDBC and ODBC connections that establish a relation between Java objects and existing databases, data warehouses and business intelligence tools.
  • It also supports various sources of data like Hive tables, Parquet, and JSON.
Spark Streaming
  • Spark Streaming is a Spark component that supports scalable and fault-tolerant processing of streaming data.
  • It uses Spark Core's fast scheduling capability to perform streaming analytics.
  • It accepts data in mini-batches and performs RDD transformations on that data.
  • Its design ensures that the applications written for streaming data can be reused to analyze batches of historical data with little modification.
  • The log files generated by web servers can be considered as a real-time example of a data stream.
MLlib
  • The MLlib is a Machine Learning library that contains various machine learning algorithms.
  • These include correlations and hypothesis testing, classification and regression, clustering, and principal component analysis.
  • It is nine times faster than the disk-based implementation used by Apache Mahout.
GraphX
  • The GraphX is a library that is used to manipulate graphs and perform graph-parallel computations.
  • It facilitates to create a directed graph with arbitrary properties attached to each vertex and edge.
  • To manipulate graph, it supports various fundamental operators like subgraph, join Vertices, and aggregate Messages.

No comments:

Post a Comment

Most views on this month