Graph Analysis with GraphFrames

This notebook goes over basic graph analysis using the GraphFrames package available on spark-packages.org. The goal of this notebook is to show you how to use GraphFrames to perform graph analysis. You're going to be doing this with Bay area bike share data from Kaggle.

Graph Theory and Graph Processing

Graph processing is an important aspect of analysis that applies to a lot of use cases. Fundamentally, graph theory and processing are about defining relationships between different nodes and edges. Nodes or vertices are the units while edges are the relationships that are defined between those.

Some business use cases could be to look at the central people in social networks (identifying who is most popular in a group of friends), the importance of papers in bibliographic networks (determining which papers are most referenced), and ranking web pages.

Graphs and Bike Trip Data

As mentioned, in this example you'll be using Bay area bike share data. The way you're going to orient your analysis is by making every vertex a station and each trip will become an edge connecting two stations. This creates a directed graph.

Requirements

This notebook requires Databricks Runtime for Machine Learning.

Further Reference:

Graph Theory on Wikipedia

bikeStations: org.apache.spark.sql.DataFrame = [id: int, name: string ... 5 more fields] tripData: org.apache.spark.sql.DataFrame = [id: int, duration: int ... 9 more fields]

Table

import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.graphframes._

Build the graph

Now that you've imported your data, you're going to need to build your graph. To do so you're going to do two things. You are going to build the structure of the vertices (or nodes) and you're going to build the structure of the edges. What's awesome about GraphFrames is that this process is incredibly simple. All that you need to do get the distinct id values in the Vertices table and rename the start and end stations to src and dst respectively for your edges tables. These are required conventions for vertices and edges in GraphFrames.

stationVertices: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, name: string ... 5 more fields] tripEdges: org.apache.spark.sql.DataFrame = [id: int, duration: int ... 9 more fields]

Table

stationGraph: org.graphframes.GraphFrame = GraphFrame(v:[id: int, name: string ... 5 more fields], e:[src: string, dst: string ... 9 more fields]) res7: stationVertices.type = [id: int, name: string ... 5 more fields]

Total Number of Stations: 70 Total Number of Trips in Graph: 669959 Total Number of Trips in Original Data: 669959

Trips From station to station

One question you might ask is what the most common destinations in the dataset are for a given starting location. You can do this by performing a grouping operation and adding the edge counts together. This will yield a new graph except each edge will now be the sum of all of the semantically same edges. Think about it this way: you have a number of trips that are the exact same from station A to station B, and you want to count those up.

The following query identifies the most common station to station trips and prints out the top 10.

Table

In degrees and out degrees

Remember that in this instance you've got a directed graph. That means that your trips are directional - from one location to another. Therefore you get access to a wealth of analysis that you can use. You can find the number of trips that go into a specific station and leave from a specific station.

You can sort this information and find the stations with lots of inbound and outbound trips. Check out this definition of Vertex Degrees for more information.

Now that you've defined that process, go ahead and find the stations that have lots of inbound and outbound traffic.

Visualization

5 rows

Visualization

5 rows

Table

graph-analysis-graphframes(Scala)

Graph Analysis with GraphFrames

Graph Theory and Graph Processing

Graphs and Bike Trip Data

Requirements

Create DataFrames

Imports

Build the graph

Trips From station to station

In degrees and out degrees