Problem 1 . Rob is installing Spark in Ubuntu 16.04 OS. Please help him with the

ID: 3712222 • Letter: P

Question

Problem 1. Rob is installing Spark in Ubuntu 16.04 OS. Please help him with the installation.

Step 1: Since Spark needs Scala, Rob needs to install Scala first. He downloads the Scala (scala-2.12.4.tgz). He unzips the file in the “/home/rob/” directory and renames the folder as “scala”. Therefore “/home/rob/scala” is the root directory for Scala. After that, please tell Rob how to update the environment variables “SCALA_HOME” and “PATH”.

Answer: Open the “.bashrc” file by using the following command:

Add the following two lines to the end of the file to update the environment variables “SCALA_HOME” and “PATH”.

Step 2: He downloads the Spark (spark-2.2.1-bin-hadoop2.7.tgz). He unzips the file in the “/home/rob/” directory and renames the folder as “spark”. Thus “/home/rob/spark” is the root directory for Spark. Now Rob needs to update the environment variables “SPARK_HOME” and “PATH”.

Answer: Open the “.bashrc” file again and add the following two lines to the end of the file.

Step 3: After Spark installation, we can use the following commands to verify the Spark installation. To start the Python Spark shell, we should type:

To start the Scala Spark shell, we should type:

Step 4: Rob wants to run the WordCount example in the batch mode. Suppose that the Python source code is in the file “WordCount.py”, please give the command for running this Python Spark source code file. Suppose the input file name and output file directory are hard coded in the source code, so you do not need to pass those parameters in the command line.

Explanation / Answer

import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__": # create Spark context with Spark configuration conf = SparkConf().setAppName("Spark Count") sc = SparkContext(conf=conf) # get threshold threshold = int(sys.argv[2]) # read in text file and split each document into words tokenized = sc.textFile(sys.argv[1]).flatMap(lambda line: line.split(" ")) # count the occurrence of each word wordCounts = tokenized.map(lambda word: (word, 1)).reduceByKey(lambda v1,v2:v1 +v2) # filter out words with fewer than threshold occurrences filtered = wordCounts.filter(lambda pair:pair[1] >= threshold) # count characters charCounts = filtered.flatMap(lambda pair:pair[0]).map(lambda c: c).map(lambda c: (c, 1)).reduceByKey(lambda v1,v2:v1 +v2) list = charCounts.collect() print repr(list)[1:-1]

Navigate

Problem 1 . In the C programming language (C99), an int8_t is 8 bits, an int16_t

Problem 1 . Solve these questions and watch for the unites. a. You measure 38 wa

Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.

Problem 1 . Rob is installing Spark in Ubuntu 16.04 OS. Please help him with the

Question

Explanation / Answer

Related Questions

Navigate