Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Problem 1 . Rob is installing Spark in Ubuntu 16.04 OS. Please help him with the

ID: 3712222 • Letter: P

Question

Problem 1. Rob is installing Spark in Ubuntu 16.04 OS. Please help him with the installation.

Step 1: Since Spark needs Scala, Rob needs to install Scala first. He downloads the Scala (scala-2.12.4.tgz). He unzips the file in the “/home/rob/” directory and renames the folder as “scala”. Therefore “/home/rob/scala” is the root directory for Scala. After that, please tell Rob how to update the environment variables “SCALA_HOME” and “PATH”.

Answer: Open the “.bashrc” file by using the following command:

$

Add the following two lines to the end of the file to update the environment variables “SCALA_HOME” and “PATH”.

1:

2:

Step 2: He downloads the Spark (spark-2.2.1-bin-hadoop2.7.tgz). He unzips the file in the “/home/rob/” directory and renames the folder as “spark”. Thus “/home/rob/spark” is the root directory for Spark. Now Rob needs to update the environment variables “SPARK_HOME” and “PATH”.

Answer: Open the “.bashrc” file again and add the following two lines to the end of the file.

1:

2:

Step 3: After Spark installation, we can use the following commands to verify the Spark installation. To start the Python Spark shell, we should type:

$

To start the Scala Spark shell, we should type:

$

Step 4: Rob wants to run the WordCount example in the batch mode. Suppose that the Python source code is in the file “WordCount.py”, please give the command for running this Python Spark source code file. Suppose the input file name and output file directory are hard coded in the source code, so you do not need to pass those parameters in the command line.

$

Explanation / Answer

import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__": # create Spark context with Spark configuration conf = SparkConf().setAppName("Spark Count") sc = SparkContext(conf=conf) # get threshold threshold = int(sys.argv[2]) # read in text file and split each document into words tokenized = sc.textFile(sys.argv[1]).flatMap(lambda line: line.split(" ")) # count the occurrence of each word wordCounts = tokenized.map(lambda word: (word, 1)).reduceByKey(lambda v1,v2:v1 +v2) # filter out words with fewer than threshold occurrences filtered = wordCounts.filter(lambda pair:pair[1] >= threshold) # count characters charCounts = filtered.flatMap(lambda pair:pair[0]).map(lambda c: c).map(lambda c: (c, 1)).reduceByKey(lambda v1,v2:v1 +v2) list = charCounts.collect() print repr(list)[1:-1]

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote