Spark shell word count
Introduction
Nowadays, people are talking big data. But how to implement a big data application? There are an increasing number of framework emerging for manipulating large datasets. When it comes to big data framework, the first comes to one's mind must be Hadoop and Spark. Further discussion about the other framework may be found at the aritcle Top Big Data Processing Frameworks. Today I'm going to demonstrate a easy use of Spark shell solving word count problem.
Prereqisite
- Spark installation & environment (Spark overview)
- Word count text file (http://www.gutenberg.org/files/4300/4300.zip)
- Hadoop file system (optional)
Hadoop file system
Check the text file in the hadoop file system.
Spark shell
Start the spark shell by the command spark-shell. If the environment hasn't been set, please see the Spark environment variables.
Word count
First, use SparkContext object which represents a connection to a Spark cluster and can be used to create RDDs, accumulators, broadcast variables on that cluster to read a text in the HDFS (or local file system) and return it as a RDD of Strings.
Second, transform the lines of String into a map of element splited by " ". If only a map function is used to split the RDD of Strings, a two layer map would be produced since the return of line of also seen as a delimiter. In this case, [[Zoo, Zoo, Yes], [Yes, Water, Water]].
Third, map each element in the set to a map the indicate the element in the map have one occurance.
Fourth, reduce all the map elements by key and add the number of occurance in the map.
Finally, save the count of words to a text file.
To see the output, use the HDFS command to list the file in the directory named word_count_output. To see the content in the output file, use the HDFS command to copy the file into current directory. And a list of word occurance can be seen by a cat command.
Second, transform the lines of String into a map of element splited by " ". If only a map function is used to split the RDD of Strings, a two layer map would be produced since the return of line of also seen as a delimiter. In this case, [[Zoo, Zoo, Yes], [Yes, Water, Water]].
Third, map each element in the set to a map the indicate the element in the map have one occurance.
Fourth, reduce all the map elements by key and add the number of occurance in the map.
Finally, save the count of words to a text file.
To see the output, use the HDFS command to list the file in the directory named word_count_output. To see the content in the output file, use the HDFS command to copy the file into current directory. And a list of word occurance can be seen by a cat command.