Problem 4. Rob designs two algorithms for solving the Word Counting problem. The

ID: 3712223 • Letter: P

Question

Problem 4. Rob designs two algorithms for solving the Word Counting problem. The two algorithms are shown in the following table.

Algorithm A

Algorithm B

book = sc.textFile(“/home/rob/data/peterpan.txt”)

book.count()

book.first()

wordCount = book.flatMap(lamba line : line.split(“ ”))

.map(lambda word : (word, 1))

.reduceByKey(lambda x, y : x + y )

wordcount.collect()

book = sc.textFile(“/home/rob/data/peterpan.txt”).persist()

book.count()

book.first()

wordCount = book.flatMap(lamba line : line.split(“ ”))

.map(lambda word : (word, 1))

.reduceByKey(lambda x, y : x + y )

wordcount.collect()

The only difference between Algorithm A and B is that we add “.persist()” at the end of the first line in Algorithm B. Which one (Algorithm A or B) runs faster and why?

Answer:

Instead of persist(), we can also use cache(). What is the difference between persist() and cache()?

Answer:

In the Algorithm A, how many RDDs are there? Please tell the type of the RDD for each. Standard string RDD or key-value pair RDD? Please also explain the meaning of the elements in each RDD.

Answer:

Algorithm A

Algorithm B

book = sc.textFile(“/home/rob/data/peterpan.txt”)

book.count()

book.first()

wordCount = book.flatMap(lamba line : line.split(“ ”))

.map(lambda word : (word, 1))

.reduceByKey(lambda x, y : x + y )

wordcount.collect()

book = sc.textFile(“/home/rob/data/peterpan.txt”).persist()

book.count()

book.first()

wordCount = book.flatMap(lamba line : line.split(“ ”))

.map(lambda word : (word, 1))

.reduceByKey(lambda x, y : x + y )

wordcount.collect()

Explanation / Answer

Solution:

Which one (Algorithm A or B) runs faster and why?

Algorithm B will be faster compared amongst A and B, because persist() is used in Algorithm B.

Explanation:

Persist() is an optimization technique which used memory in the JVM to hold the requests which are reused and reduce the cost.

Instead of persist(), we can also use cache(). What is the difference between persist() and cache()?

Cache will be faster in this case since cache uses MEMORY_ONLY. Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.

and persist() will store the data in the JVM heap as unserialized objects

please repost c part.

I hope this helps if you find any problem. Please comment below. Don't forget to give a thumbs up if you liked it. :)

Navigate

Problem 4. Recently, a team of Mechanical Engineers (Li et al., PNAS vol 110 (50

Problem 4. Serap and her umbrella 0.0/6.0 points (graded) Before leaving for wor

Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.

Problem 4. Rob designs two algorithms for solving the Word Counting problem. The

Question

Explanation / Answer

Related Questions

Navigate