- Access Postgres database tables with SparkSQL
- Filter rows and columns of a Spark DataFrame
- Group and perform aggregate functions on columns in a Spark DataFrame
- Join two SparkDataframes on a single column
Step 1. Open Jupyter Python Notebook for SparkSQL. First Open the Terminal and enter the command "pyspark" to setup the server. Next, open a web browser by clicking on the web browser icon at the top of the toolbar:



Step 2. Connect to Postgres Table. This notebook already contains three lines of code so you do not have to enter them. Run these three lines. The first line imports the SQLContext module, which is needed access SQL databases in Spark:



Step 3. View Spark DataFrame schema and count rows. We can call the printSchema() method to view the schema of the DataFrame:

We can also call the count() method to count the number of rows in the DataFrame:

Step 4. View contents of DataFrame. We can call the show() method to view the contents of the DataFrame. The argument specifies how many rows to display:

Step 5. Filter columns in DataFrame. We can filter for one or more columns by calling the select() method:

Step 6. Filter rows based on criteria. We can also filter for rows that match a specific criteria using filter():

Step 7. Group by a column and count. The groupBy() method groups the values of column(s). The ishit column only has values 0 and 1. We can calculate how many times each occurs by grouping the ishit column and counting the result:

Step 8. Calculate average and sum. Aggregate operations can be performed on columns of DataFrames. First, let's import the Python libraries for the aggregate operations. Next, we can calculate the average and total values by calling the mean() and sum() methods, respectively:




Let's view the schema of merge:

Finally, let's look at the contents of merge:

No comments:
Post a Comment