decision-trees-sfo-airport-survey-example(Python)

Loading...

SFO Survey Machine Learning Example

Each year, San Francisco Airport (SFO) conducts a customer satisfaction survey to find out what they are doing well and where they can improve. The survey gauges satisfaction with SFO facilities, services, and amenities. SFO compares results to previous surveys to discover elements of the guest experience that are not satisfactory.

The 2013 SFO Survey Results consists of customer responses to survey questions and an overall satisfaction rating with the airport. We investigated whether we could use machine learning to predict a customer's overall response given their responses to the individual questions. That in and of itself is not very useful because the customer has already provided an overall rating as well as individual ratings for various aspects of the airport such as parking, food quality and restroom cleanliness. However, we didn't stop at prediction instead we asked the question:

What factors drove the customer to give the overall rating?

Here is an outline of our data flow:

  • Load data: Load the data as a DataFrame
  • Understand the data: Compute statistics and create visualizations to get a better understanding of the data to see if we can use basic statistics to answer the question above.
  • Create Model On the training dataset:
    • Evaluate the model: Now look at the test dataset. Compare the initial model with the tuned model to see the benefit of tuning parameters.
    • Feature Importance: Determine the importance of each of the individual ratings in determining the overall rating by the customer

This dataset is available as a public dataset from https://catalog.data.gov/dataset/2013-sfo-customer-survey-d3541.

Load the Data

    Use display() to confirm the data has been loaded.

      Understand the Data

      Gain a basic understanding of the data by looking at the schema.

        root |-- RESPNUM: integer (nullable = true) |-- CCGID: string (nullable = true) |-- RUN: integer (nullable = true) |-- INTDATE: integer (nullable = true) |-- GATE: integer (nullable = true) |-- STRATA: integer (nullable = true) |-- PEAK: integer (nullable = true) |-- METHOD: integer (nullable = true) |-- AIRLINE: integer (nullable = true) |-- FLIGHT: integer (nullable = true) |-- DEST: integer (nullable = true) |-- DESTGEO: integer (nullable = true) |-- DESTMARK: integer (nullable = true) |-- ARRTIME: string (nullable = true) |-- DEPTIME: string (nullable = true) |-- Q2PURP1: integer (nullable = true) |-- Q2PURP2: integer (nullable = true) |-- Q2PURP3: integer (nullable = true) |-- Q2PURP4: integer (nullable = true) |-- Q2PURP5: integer (nullable = true) |-- Q2PURP6: string (nullable = true) |-- Q3GETTO1: integer (nullable = true) |-- Q3GETTO2: integer (nullable = true) |-- Q3GETTO3: integer (nullable = true) |-- Q3GETTO4: integer (nullable = true) |-- Q3GETTO5: string (nullable = true) |-- Q3GETTO6: string (nullable = true) |-- Q3PARK: integer (nullable = true) |-- Q4BAGS: integer (nullable = true) |-- Q4BUY: integer (nullable = true) |-- Q4FOOD: integer (nullable = true) |-- Q4WIFI: integer (nullable = true) |-- Q5FLYPERYR: integer (nullable = true) |-- Q6TENURE: double (nullable = true) |-- SAQ: integer (nullable = true) |-- Q7A_ART: integer (nullable = true) |-- Q7B_FOOD: integer (nullable = true) |-- Q7C_SHOPS: integer (nullable = true) |-- Q7D_SIGNS: integer (nullable = true) |-- Q7E_WALK: integer (nullable = true) |-- Q7F_SCREENS: integer (nullable = true) |-- Q7G_INFOARR: integer (nullable = true) |-- Q7H_INFODEP: integer (nullable = true) |-- Q7I_WIFI: integer (nullable = true) |-- Q7J_ROAD: integer (nullable = true) |-- Q7K_PARK: integer (nullable = true) |-- Q7L_AIRTRAIN: integer (nullable = true) |-- Q7M_LTPARK: integer (nullable = true) |-- Q7N_RENTAL: integer (nullable = true) |-- Q7O_WHOLE: integer (nullable = true) |-- Q8COM1: integer (nullable = true) |-- Q8COM2: integer (nullable = true) |-- Q8COM3: integer (nullable = true) |-- Q9A_CLNBOARD: integer (nullable = true) |-- Q9B_CLNAIRTRAIN: integer (nullable = true) |-- Q9C_CLNRENT: integer (nullable = true) |-- Q9D_CLNFOOD: integer (nullable = true) |-- Q9E_CLNBATH: integer (nullable = true) |-- Q9F_CLNWHOLE: integer (nullable = true) |-- Q9COM1: integer (nullable = true) |-- Q9COM2: integer (nullable = true) |-- Q9COM3: string (nullable = true) |-- Q10SAFE: integer (nullable = true) |-- Q10COM1: integer (nullable = true) |-- Q10COM2: integer (nullable = true) |-- Q10COM3: integer (nullable = true) |-- Q11A_USEWEB: integer (nullable = true) |-- Q11B_USESFOAPP: integer (nullable = true) |-- Q11C_USEOTHAPP: integer (nullable = true) |-- Q11D_USESOCMED: integer (nullable = true) |-- Q11E_USEWIFI: integer (nullable = true) |-- Q12COM1: integer (nullable = true) |-- Q12COM2: integer (nullable = true) |-- Q12COM3: integer (nullable = true) |-- Q13_WHEREDEPART: integer (nullable = true) |-- Q13_RATEGETTO: integer (nullable = true) |-- Q14A_FIND: integer (nullable = true) |-- Q14B_SECURITY: integer (nullable = true) |-- Q15_PROBLEMS: integer (nullable = true) |-- Q15COM1: integer (nullable = true) |-- Q15COM2: integer (nullable = true) |-- Q15COM3: string (nullable = true) |-- Q16_REGION: integer (nullable = true) |-- Q17_CITY: string (nullable = true) |-- Q17_ZIP: integer (nullable = true) |-- Q17_COUNTRY: string (nullable = true) |-- HOME: integer (nullable = true) |-- Q18_AGE: integer (nullable = true) |-- Q19_SEX: integer (nullable = true) |-- Q20_INCOME: integer (nullable = true) |-- Q21_HIFLYER: integer (nullable = true) |-- Q22A_USESJC: integer (nullable = true) |-- Q22B_USEOAK: integer (nullable = true) |-- LANG: integer (nullable = true) |-- WEIGHT: double (nullable = true)

        As you can see above there are many questions in the survey including what airline the customer flew on, where do they live, etc. For the purposes of answering the above, focus on the Q7A, Q7B, Q7C .. Q7O questions since they directly related to customer satisfaction, which is what you want to measure. If you drill down on those variables you get the following:

        Column Name|Data Type|Description ---|---|---|--- Q7B_FOOD|INTEGER|Restaurants Q7C_SHOPS|INTEGER|Retail shops and concessions Q7D_SIGNS|INTEGER|Signs and Directions inside SFO Q7E_WALK|INTEGER|Escalators / elevators / moving walkways Q7F_SCREENS|INTEGER|Information on screens and monitors Q7G_INFOARR|INTEGER|Information booth near arrivals area Q7H_INFODEP|INTEGER|Information booth near departure areas Q7I_WIFI|INTEGER|Airport WiFi Q7J_ROAD|INTEGER|Signs and directions on SFO airport roadways Q7K_PARK|INTEGER|Airport parking facilities Q7L_AIRTRAIN|INTEGER|AirTrain Q7M_LTPARK|INTEGER|Long term parking lot shuttle Q7N_RENTAL|INTEGER|Airport rental car center Q7O_WHOLE|INTEGER|SFO Airport as a whole

        The possible values for the above are:

        0 = no answer, 1 = Unacceptable, 2 = Below Average, 3 = Average, 4 = Good, 5 = Outstanding, 6 = Not visited or not applicable

        Select only the fields you are interested in.

          Let's get some basic statistics such as looking at the average of each column.

          "'missingValues(Q7A_ART) Q7A_ART', 'missingValues(Q7B_FOOD) Q7B_FOOD', 'missingValues(Q7C_SHOPS) Q7C_SHOPS', 'missingValues(Q7D_SIGNS) Q7D_SIGNS', 'missingValues(Q7E_WALK) Q7E_WALK', 'missingValues(Q7F_SCREENS) Q7F_SCREENS', 'missingValues(Q7G_INFOARR) Q7G_INFOARR', 'missingValues(Q7H_INFODEP) Q7H_INFODEP', 'missingValues(Q7I_WIFI) Q7I_WIFI', 'missingValues(Q7J_ROAD) Q7J_ROAD', 'missingValues(Q7K_PARK) Q7K_PARK', 'missingValues(Q7L_AIRTRAIN) Q7L_AIRTRAIN', 'missingValues(Q7M_LTPARK) Q7M_LTPARK', 'missingValues(Q7N_RENTAL) Q7N_RENTAL', 'missingValues(Q7O_WHOLE) Q7O_WHOLE'"

          Let's start with the overall rating.

          [Row(Q7O_WHOLE=3.8776130748764728)]

          The overall rating is only 3.87, so slightly above average. Let's get the averages of the constituent ratings:

          Looking at the bar chart above - the overall average rating of the airport is actually lower than all of the individual ratings. This appears clearer using the bar chart visualization below.

            0.000.501.001.52.02.53.03.54.04.55.05.50.000.501.001.52.02.53.03.54.04.55.05.50.000.501.001.52.02.53.03.54.04.55.05.50.000.501.001.52.02.53.03.54.04.55.05.50.000.501.001.52.02.53.03.54.04.55.05.50.000.501.001.52.02.53.03.54.04.55.05.50.000.501.001.52.02.53.03.54.04.55.05.50.000.501.001.52.02.53.03.54.04.55.05.50.000.501.001.52.02.53.03.54.04.55.05.50.000.501.001.52.02.53.03.54.04.55.05.50.000.501.001.52.02.53.03.54.04.55.05.50.000.501.001.52.02.53.03.54.04.55.05.50.000.501.001.52.02.53.03.54.04.55.05.50.000.501.001.52.02.53.03.54.04.55.05.5TOOLTIPQ7A_ARTQ7B_FOODQ7C_SHOPSQ7D_SIGNSQ7G_INFOARRQ7E_WALKQ7F_SCREENSQ7N_RENTALQ7H_INFODEPQ7I_WIFIQ7J_ROADQ7K_PARKQ7L_AIRTRAINQ7M_LTPARKQ7A_ART, Q7B_FOOD, Q7C_SHOPS, Q7D_SIGNS, Q7G_INFOARR, Q7E_WALK, Q7F_SCREENS, Q7N_RENTAL, Q7H_INFODEP, Q7I_WIFI, Q7J_ROAD, Q7K_PARK, Q7L_AIRTRAIN, Q7M_LTPARKQ7A_ARTQ7A_ARTQ7B_FOODQ7B_FOODQ7C_SHOPSQ7C_SHOPSQ7D_SIGNSQ7D_SIGNSQ7G_INFOARRQ7G_INFOARRQ7E_WALKQ7E_WALKQ7F_SCREENSQ7F_SCREENSQ7N_RENTALQ7N_RENTALQ7H_INFODEPQ7H_INFODEPQ7I_WIFIQ7I_WIFIQ7J_ROADQ7J_ROADQ7K_PARKQ7K_PARKQ7L_AIRTRAINQ7L_AIRTRAINQ7M_LTPARKQ7M_LTPARK
            2,631 rows

            So basic statistics can't seem to answer the question: What factors drove the customer to give the overall rating?

            So let's try to use a predictive algorithm to see if these individual ratings can be used to predict an overall rating.

            Create Model

            Here use a decision tree algorithm to create a predictive model.

            First need to treat responses of 0 = No Answer and 6 = Not Visited or Not Applicable as missing values. One of the ways you can do this is a technique called mean impute which is when we use the mean of the column as a replacement for the missing value. You can use a replace function to set all values of 0 or 6 to the average rating of 3. You also need a label column of type double so do that as well.

                Define the machine learning pipeline.

                Now call fit to build the model.

                  Downloading artifacts: 0%| | 0/20 [00:00<?, ?it/s]
                  Uploading artifacts: 0%| | 0/4 [00:00<?, ?it/s]

                  Now call display to view the tree.

                    4.9259259259259264.645833333333333feature: 84.5714285714285713feature: 9feature: 74.5617977528089894.295081967213115feature: 94.18754feature: 1feature: 5feature: 04.1224489795918363.882750845546787feature: 53.3382352941176472.5714285714285716feature: 1feature: 14.4210526315789483.6724137931034484feature: 14.3753.2678899082568806feature: 13feature: 1feature: 5feature: 3<=4.50e+0>4.50e+0<=1.50e+0>1.50e+0<=4.50e+0>4.50e+0<=4.50e+0>4.50e+0<=3.50e+0>3.50e+0<=4.50e+0>4.50e+0<=4.50e+0>4.50e+0<=4.50e+0>4.50e+0<=1.50e+0>1.50e+0<=2.50e+0>2.50e+0<=4.50e+0>4.50e+0<=4.50e+0>4.50e+0<=3.50e+0>3.50e+0<=4.50e+0>4.50e+0

                    Call transform to view the predictions.

                    Evaluate the model

                    Validate the model using root mean squared error (RMSE).

                    0.5470732399238778

                    Save the model.

                    Feature Importance

                    Now let's look at the feature importances. The variable is shown below.

                      SparseVector(14, {0: 0.0546, 1: 0.1245, 3: 0.5302, 5: 0.2296, 7: 0.0109, 8: 0.0046, 9: 0.0119, 13: 0.0337})

                      Feature importance is a measure of information gain. It is scaled from 0.0 to 1.0. As an example, feature 1 in the example above is rated as 0.0826 or 8.26% of the total importance for all the features.

                      Let's map the features to their proper names to make them easier to read.

                        <zip at 0x7f15841111c0>

                        Let's convert this to a DataFrame so you can view it and save it so other users can rely on this information.

                              As you can see below, the 3 most important features are:

                              1. Signs
                              2. Screens
                              3. Food

                              This is useful information for the airport management. It means that people want to first know where they are going. Second, they check the airport screens and monitors so they can find their gate and be on time for their flight. Third, they like to have good quality food.

                              This is especially interesting considering that taking the average of these feature variables told us nothing about the importance of the variables in determining the overall rating by the survey responder.

                              These 3 features combine to make up 65% of the overall rating.

                                [Row(sum(Importance)=0.884304077071294)]

                                You can also show this graphically.

                                  Q7D_SIGNSQ7F_SCREENSQ7B_FOODQ7A_ARTQ7N_RENTALQ7J_ROADQ7H_INFODEPQ7I_WIFIQ7K_PARKQ7M_LTPARKQ7L_AIRTRAINQ7E_WALKQ7C_SHOPSQ7G_INFOARR53%23%12%5%3%1%1%0%0%0%0%0%0%0%FeatureQ7D_SIGNSQ7D_SIGNSQ7F_SCREENSQ7F_SCREENSQ7B_FOODQ7B_FOODQ7A_ARTQ7A_ARTQ7N_RENTALQ7N_RENTALQ7J_ROADQ7J_ROADQ7H_INFODEPQ7H_INFODEPQ7I_WIFIQ7I_WIFIQ7K_PARKQ7K_PARKQ7M_LTPARKQ7M_LTPARKQ7L_AIRTRAINQ7L_AIRTRAINQ7E_WALKQ7E_WALKQ7C_SHOPSQ7C_SHOPSQ7G_INFOARRQ7G_INFOARR
                                  14 rows

                                    0.000.100.200.300.400.50Q7D_SIGNSQ7F_SCREENSQ7B_FOODQ7A_ARTQ7N_RENTALTOOLTIPFeatureImportance
                                    5 rows

                                    So if you run SFO, artwork and shopping are nice-to-haves but signs, monitors, and food are what keep airport customers happy!

                                    True