- Read streaming data into Spark
- Create and apply computations over a sliding window of data
Step 1. Open Jupyter Python Notebook for Spark Streaming. Open a web browser by clicking on the web browser icon at the top of the toolbar:
Navigate to localhost:8889/tree/Downloads/big-data-3/spark-streaming:
Open the Spark Streaming Notebook by clicking on Spark-Streaming.ipynb:
Step 2. Look at sensor format and measurement types. The first cell in the notebook gives an example of the streaming measurements coming from the weather station:
Each line contains a timestamp and a set of measurements. Each measurement has an abbreviation, and for this exercise, we are interested in the average wind direction, which is Dm. The next cell lists the abbreviations used for each type of measurement:
The third cell defines a function that parses each line and returns the average wind direction (Dm). Run this cell:
Step 3. Import and create streaming context. Next, we will import and create a new instance of Spark's StreamingContext:
Similar to the SparkContext, the StreamingContext provides an interface to Spark's streaming capabilities. The argument sc is the SparkContext, and 1 specifies a batch interval of one second.
Step 4. Create DStream of weather data. Let's open a connection to the streaming weather data:
Instead of 12028, you may find that port 12020 works instead. This create a new variable lines to be a Spark DStream that streams the lines of output from the weather station.
Step 5. Read measurement. Next, let's read the average wind speed from each line and store it in a new DStream vals:
This line uses flatMap() to iterate over the lines DStream, and calls the parse() function we defined above to get the average wind speed.
Step 6. Create sliding window of data. We can create a sliding window over the measurements by calling the window() method:
This create a new DStream called window that combines the ten seconds worth of data and moves by five seconds.
Step 7. Define and call analysis function. We would like to find the minimum and maximum values in our window. Let's define a function that prints these values for an RDD:
This function first prints the entire contents of the RDD by calling the collect() method. This is done to demonstrate the sliding window and would not be practical if the RDD was containing a large amount of data. Next, we check if the size of the RDD is greater than zero before printing the maximum and minimum values.
Next, we call the stats() function for each RDD in our sliding window:
This line calls the stats() function defined above for each RDD in the DStream window.
Step 8. Start the stream processing. We call start() on the StreamingContext to begin the processing:
The sliding window contains ten seconds worth of data and slides every five seconds. In the beginning, the number of values in the windows are increasing as the data accumulates, and after Window 3, the size stays (approximately) the same. Since the window slides half as often as the size of the window, the second half of a window becomes the first half of the next window. For example, the second half of Window 5 is 310, 321, 323, 325, 326, which becomes the first half of Window 6.
When we are done, call stop() on the StreamingContext: