Problem 5 . Hadoop and Spark are both share-nothing paradigms. But they do suppo
ID: 3712224 • Letter: P
Question
Problem 5. Hadoop and Spark are both share-nothing paradigms. But they do support sharing immutable data structures among all workers (computer nodes) in the computer cluster. Please enumerate two techniques to achieve this goal and fill them in the following table.
Hadoop
Spark
In Spark, Rob needs to create an accumulator with initial value “3.14” of double type. Please tell him how to do that:
Answer:
$
Rob wants to increase the value of this accumulator by “1.1” of double type in each executor. Please him how to do that.
Answer: In the source code for each executor, add the following two lines:
$
$
When Rob implements the join algorithm for joining the phone book and country code lookup table, he needs that each executor can access the code lookup table. Suppose that the lookup table is stored in an RDD named “LookupTable1”, please tell Rob how to replicate this RDD to all the executors in the driver (master) program.
Answer:
$
How to get the broadcasted data in each executor?
Answer:
$
For each executor, the Accumulators are -only variables; the broadcast variables are -only variables. Please choose from “read” or “write” when you fill in the previous two blanks.
Hadoop
Spark
Explanation / Answer
1. Spark Resilient distributed database is collection of immutables objects that computes on different nodes of the cluster. There are three ways to do this in spark
1. Data in Stable Storage
2. using other RDDs(Resilient Distributed Database)
3. Parallelizing already existing collection
Hadoop Spark 1. Hadoop provides a method called MapReduce API which can be used to provide core functionality for data store in Hadoop.1. Spark Resilient distributed database is collection of immutables objects that computes on different nodes of the cluster. There are three ways to do this in spark
1. Data in Stable Storage
2. using other RDDs(Resilient Distributed Database)
3. Parallelizing already existing collection
With Hadoop, all records written are immutable because Hadoop random writes are not supported in hadoop. This can be a disadvantage, but it scales really well. Some more languages are bringing back this concept of immutable objects Spark RDDs also can get cached and partitionedRelated Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.