graph-analysis-graphframes(Scala)

Loading...

Graph Analysis with GraphFrames

This notebook goes over basic graph analysis using the GraphFrames package available on spark-packages.org. The goal of this notebook is to show you how to use GraphFrames to perform graph analysis. You're going to be doing this with Bay area bike share data from Kaggle.

Graph Theory and Graph Processing

Graph processing is an important aspect of analysis that applies to a lot of use cases. Fundamentally, graph theory and processing are about defining relationships between different nodes and edges. Nodes or vertices are the units while edges are the relationships that are defined between those.

Some business use cases could be to look at the central people in social networks (identifying who is most popular in a group of friends), the importance of papers in bibliographic networks (determining which papers are most referenced), and ranking web pages.

Graphs and Bike Trip Data

As mentioned, in this example you'll be using Bay area bike share data. The way you're going to orient your analysis is by making every vertex a station and each trip will become an edge connecting two stations. This creates a directed graph.

Requirements

This notebook requires Databricks Runtime for Machine Learning.

Further Reference:

Create DataFrames

bikeStations: org.apache.spark.sql.DataFrame = [id: int, name: string ... 5 more fields] tripData: org.apache.spark.sql.DataFrame = [id: int, duration: int ... 9 more fields]

      It can often times be helpful to look at the exact schema to ensure that you have the right types associated with the right columns.

      root |-- id: integer (nullable = true) |-- name: string (nullable = true) |-- lat: double (nullable = true) |-- long: double (nullable = true) |-- dock_count: integer (nullable = true) |-- city: string (nullable = true) |-- installation_date: string (nullable = true) root |-- id: integer (nullable = true) |-- duration: integer (nullable = true) |-- start_date: string (nullable = true) |-- start_station_name: string (nullable = true) |-- start_station_id: integer (nullable = true) |-- end_date: string (nullable = true) |-- end_station_name: string (nullable = true) |-- end_station_id: integer (nullable = true) |-- bike_id: integer (nullable = true) |-- subscription_type: string (nullable = true) |-- zip_code: string (nullable = true)

      Imports

      You're going to need to import several things before you can continue. You're going to import a variety of SQL functions that are going to make working with DataFrames much easier and you're going to import everything that you're going to need from GraphFrames.

      import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.graphframes._

      Build the graph

      Now that you've imported your data, you're going to need to build your graph. To do so you're going to do two things. You are going to build the structure of the vertices (or nodes) and you're going to build the structure of the edges. What's awesome about GraphFrames is that this process is incredibly simple. All that you need to do get the distinct id values in the Vertices table and rename the start and end stations to src and dst respectively for your edges tables. These are required conventions for vertices and edges in GraphFrames.

      stationVertices: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, name: string ... 5 more fields] tripEdges: org.apache.spark.sql.DataFrame = [id: int, duration: int ... 9 more fields]

          Now you can build your graph.

          You're also going to cache the input DataFrames to your graph.

          stationGraph: org.graphframes.GraphFrame = GraphFrame(v:[id: int, name: string ... 5 more fields], e:[src: string, dst: string ... 9 more fields]) res7: stationVertices.type = [id: int, name: string ... 5 more fields]

          Total Number of Stations: 70 Total Number of Trips in Graph: 669959 Total Number of Trips in Original Data: 669959

          Trips From station to station

          One question you might ask is what the most common destinations in the dataset are for a given starting location. You can do this by performing a grouping operation and adding the edge counts together. This will yield a new graph except each edge will now be the sum of all of the semantically same edges. Think about it this way: you have a number of trips that are the exact same from station A to station B, and you want to count those up.

          The following query identifies the most common station to station trips and prints out the top 10.

          You can see above that a given vertex being a Caltrain station seems to be significant. This makes sense train riders might need a way to get to their final destination after riding the train.

          In degrees and out degrees

          Remember that in this instance you've got a directed graph. That means that your trips are directional - from one location to another. Therefore you get access to a wealth of analysis that you can use. You can find the number of trips that go into a specific station and leave from a specific station.

          You can sort this information and find the stations with lots of inbound and outbound trips. Check out this definition of Vertex Degrees for more information.

          Now that you've defined that process, go ahead and find the stations that have lots of inbound and outbound traffic.

          0.0010k20k30k40k50k60kSan Francisco Caltrain (Townsend at 4th)Harry Bridges Plaza (Ferry Building)2nd at TownsendTOOLTIPidinDegree
          5 rows

          0.005.0k10k15k20k25k30k35k40k45k50kSan Francisco Caltrain (Townsend at 4th)Embarcadero at SansomeTOOLTIPidoutDegree
          5 rows

          One interesting follow up question you could ask is what is the station with the highest ratio of in degrees but fewest out degrees. As in, what station acts as almost a pure trip sink. A station where trips end at but rarely start from.

          You can do something similar by getting the stations with the lowest in degrees to out degrees ratios, meaning that trips start from that station but don't end there as often. This is essentially the opposite of what you have above.

            The conclusions of what you get from the above analysis should be relatively straightforward. If you have a higher value, that means many more trips come into that station than out, and a lower value means that many more trips leave from that station than come into it.