Monday, July 16, 2018

Exploring Spark SQL and Spark DataFrames

By the end of this activity, you will be able to:
  1. Access Postgres database tables with SparkSQL
  2. Filter rows and columns of a Spark DataFrame
  3. Group and perform aggregate functions on columns in a Spark DataFrame
  4. Join two SparkDataframes on a single column

Step 1. Open Jupyter Python Notebook for SparkSQL. First Open the Terminal and enter the command "pyspark" to setup the server. Next, open a web browser by clicking on the web browser icon at the top of the toolbar:
Navigate to localhost:8889/tree/Downloads/big-data-3/spark-sql:
Open the SparkSQL Notebook by clicking on SparkSQL.ipynb:

Step 2. Connect to Postgres Table. This notebook already contains three lines of code so you do not have to enter them. Run these three lines. The first line imports the SQLContext module, which is needed access SQL databases in Spark:
The second line creates a new SQLContext from the SparkContext sc:
The third line creates a new Spark DataFrame in the variable df for the Postgres table gameclicks:
The format("jdbc") says that the source of the DataFrame will be using a Java database connection, the url option is the URL connection string to access the Postgres database, and the dbtable option specifies the gameclicks table.

Step 3. View Spark DataFrame schema and count rows. We can call the printSchema() method to view the schema of the DataFrame:
The description lists the name and data type of each column.
We can also call the count() method to count the number of rows in the DataFrame:

Step 4. View contents of DataFrame. We can call the show() method to view the contents of the DataFrame. The argument specifies how many rows to display:

Step 5. Filter columns in DataFrame. We can filter for one or more columns by calling the select() method:

Step 6. Filter rows based on criteria. We can also filter for rows that match a specific criteria using filter():
The arguments to filter() are a Column, in this case specified as df["teamlevel"], and the condition, which is greater than 1. The remainder of the commander selects only the userid and teamlevel columns and shows the first five rows.

Step 7. Group by a column and count. The groupBy() method groups the values of column(s). The ishit column only has values 0 and 1. We can calculate how many times each occurs by grouping the ishit column and counting the result:

Step 8. Calculate average and sum. Aggregate operations can be performed on columns of DataFrames. First, let's import the Python libraries for the aggregate operations. Next, we can calculate the average and total values by calling the mean() and sum() methods, respectively:
Step 9. Join two DataFrames. We can merge or join two Dataframes on a single column. First, let's create a DataFrame for the adclicks table in the Postgres database by copying the third cell in this notebook and changing gameclicks to adclicks and storing the result in a new variable df2:
Let's view the columns in df2 by calling printSchema():
We can see that the adclicks df2 DataFrame also has a column called userid. Next, we will combine the gameclicks and adclicks DataFrames by calling the join() method and saving the resulting DataFrame in a variable called merge:
We are calling the join() method on the gameclicks DataFrame; the first argument is the DrataFrame to join with, i.e., the adclicks DataFrame, and the second argument is the column name in both DataFrames to join on.
Let's view the schema of merge:
We can see that the merged DataFrame has all the columns of both gameclicks and adclicks.
Finally, let's look at the contents of merge:

No comments:

Post a Comment