Quickstart-Scala: H3 NYC Taxi LaGuardia to Manhattan(Scala)

Loading...

NYC Taxi Trips between LaGuardia and Manhattan's Financial Districts (Scala)

Demonstration of Databricks Built-In H3 API available in DBR 11.2 for Scala, Python and SQL bindings to Spark Clusters.

This example persists tables under Database nyc_taxi, shown under catalog geospatial_docs.

This notebook uses the following public datasets:

  1. Data from Databricks public datasets for NYC Yellow Taxi pickups and dropoffs available under DBFS at /databricks-datasets/nyctaxi/tables/nyctaxi_yellow
  2. Zip codes are included from NYC Data to use in analysis.

NOTE: Run setup_uc_nyc_taxi_trips.py notebook 1x to setup taxi tables within Unity Catalog (UC). If you are not using UC, then you will have to adjust how you choose to make calls to the tables.

Setup

Import Databricks functions to get H3; also, import spark functions.

import com.databricks.sql.functions._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import spark.implicits._

lazy val formatter = java.text.NumberFormat.getIntegerInstance
import com.databricks.sql.functions._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ import spark.implicits._ formatter: java.text.NumberFormat = <lazy>

This example shows catalog geospatial_docs within Unity Catalog; please adjust to your preferred catalog.

Example assumes usage grant on catalog and usage and create grant within database, more at https://www.databricks.com/product/unity-catalog

spark.catalog.setCurrentCatalog("geospatial_docs")
spark.catalog.setCurrentDatabase("nyc_taxi")
// display(spark.catalog.listTables)

dfYellowTrip DataFrame

There are 1.6B rows in this data (see note above for setup).

lazy val dfYellowTrip = spark.table("yellow_trip")
println(f"count? ${formatter.format(dfYellowTrip.count)}")
// println(dfYellowTrip.printSchema)
count? 1,611,611,035 dfYellowTrip: org.apache.spark.sql.DataFrame = <lazy>
display(dfYellowTrip)
 
vendor_id
pickup_datetime
dropoff_datetime
passenger_count
trip_distance
pickup_longitude
pickup_latitude
rate_code_id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
VTS
2009-11-19T10:40:00.000+0000
2009-11-19T10:49:00.000+0000
1
1.64
-73.988087
40.731822
null
VTS
2009-11-20T11:33:00.000+0000
2009-11-20T11:59:00.000+0000
3
3
-73.987065
40.724667
null
VTS
2009-11-20T22:04:00.000+0000
2009-11-20T22:14:00.000+0000
1
1.63
-73.983477
40.726132
null
VTS
2009-11-21T02:12:00.000+0000
2009-11-21T02:17:00.000+0000
4
0.76
-73.98197
40.728307
null
VTS
2009-11-23T10:19:00.000+0000
2009-11-23T10:33:00.000+0000
1
2.42
-73.982778
40.735485
null
VTS
2009-11-21T12:26:00.000+0000
2009-11-21T12:41:00.000+0000
1
4.94
-73.985323
40.731768
null
VTS
2009-11-20T01:29:00.000+0000
2009-11-20T01:39:00.000+0000
1
2.25
-73.987977
40.728437
null
VTS
2009-11-21T02:29:00.000+0000
2009-11-21T02:39:00.000+0000
1
4.47
-73.98878
40.722638
null
VTS
2009-11-20T19:44:00.000+0000
2009-11-20T19:53:00.000+0000
1
4.28
-73.987592
40.732983
null
VTS
2009-11-21T04:52:00.000+0000
2009-11-21T05:05:00.000+0000
1
3.04
-73.986622
40.69995
null
VTS
2009-11-17T14:13:00.000+0000
2009-11-17T14:23:00.000+0000
5
1.12
-73.985165
40.728428
null
VTS
2009-11-18T09:19:00.000+0000
2009-11-18T09:29:00.000+0000
1
1.31
-73.986143
40.722267
null
VTS
2009-11-18T21:49:00.000+0000
2009-11-18T21:59:00.000+0000
1
2.23
-73.987738
40.729457
null
VTS
2009-11-19T17:50:00.000+0000
2009-11-19T18:27:00.000+0000
1
6.57
-73.9807
40.720483
null
VTS
2009-11-18T22:10:00.000+0000
2009-11-18T22:28:00.000+0000
1
4.28
-73.97901
40.723945
null
VTS
2009-11-21T12:55:00.000+0000
2009-11-21T13:01:00.000+0000
2
1.18
-73.980677
40.72572
null
VTS
2009-11-19T01:13:00.000+0000
2009-11-19T01:19:00.000+0000
1
1.49
-73.9881
40.732112
null
1,000 rows|Truncated data

dfNycZipcode DataFrame

The yellow_trip table lets us know place, time, passengers, and cost but we also want to understand zip codes of the pickups and dropoffs, verify data is setup (see instructions at start of notebook).

lazy val dfNycZipcode = spark.table("nyc_zipcode")
println(f"count? ${formatter.format(dfNycZipcode.count)}")
// println(dfNycZipcode.printSchema)
count? 263 dfNycZipcode: org.apache.spark.sql.DataFrame = <lazy>
display(dfNycZipcode)
 
ZIPCODE
BLDGZIP
PO_NAME
POPULATION
AREA
STATE
COUNTY
ST_FIPS
CTY_FIPS
URL
SHAPE_AREA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
11214
0
Brooklyn
89061
61096539.3891465
NY
Kings
36
047
http://www.usps.com/
0
11691
0
Far Rockaway
60267
83927815.1123475
NY
Queens
36
081
http://www.usps.com/
0
11096
0
Inwood
147
111973.73396184
NY
Queens
36
081
http://www.usps.com/
0
11223
0
Brooklyn
79864
58702923.4662996
NY
Kings
36
047
http://www.usps.com/
0
11693
0
Far Rockaway
11052
1527599.16446935
NY
Queens
36
081
http://www.usps.com/
0
11692
0
Arverne
18018
23992395.3811918
NY
Queens
36
081
http://www.usps.com/
0
11235
0
Brooklyn
76130
69053803.0317134
NY
Kings
36
047
http://www.usps.com/
0
11693
0
Far Rockaway
11052
12270916.5305379
NY
Queens
36
081
http://www.usps.com/
0
10306
0
Staten Island
55602
174102872.04866
NY
Richmond
36
085
http://www.usps.com/
0
11694
0
Rockaway Park
20163
48101606.6144704
NY
Queens
36
081
http://www.usps.com/
0
11224
0
Brooklyn
49600
46701983.7536391
NY
Kings
36
047
http://www.usps.com/
0
10471
0
Bronx
23477
89651407.0583499
NY
Bronx
36
005
http://www.usps.com/
0
10470
0
Bronx
14740
21543461.581388
NY
Bronx
36
005
http://www.usps.com/
0
10466
0
Bronx
68942
55262490.6124597
NY
Bronx
36
005
http://www.usps.com/
0
17 rows|Truncated data

Generate H3 Tables

While these could be a view as well, persisting as a table for a 1x cost to setup.

dfNycZipcodeH3_12 -- Write as Table

Uses h3_polyfillash3 function to fill the polygon at resolution 12.

dfNycZipcode
  .select(
    explode(h3_polyfillash3($"geom_wkt", lit(12))).alias("cell"),
    $"zipcode", $"po_name", $"county"
  )
  .write
    .mode("ignore") // <-- "overwrite" to replace, "ignore" to skip if exists
    .option("mergeSchema", "true")
  .saveAsTable("nyc_zipcode_h3_12")

// sql("optimize nyc_zipcode_h3_12 zorder by (cell)") # <-- optional: z-order by 'cell'

lazy val dfNycZipcodeH3_12 = spark.table("nyc_zipcode_h3_12")
display(dfNycZipcodeH3_12)
 
cell
zipcode
po_name
county
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
631243922750381000
10170
New York
New York
631243922750317000
10170
New York
New York
631243922750317600
10170
New York
New York
631243922750318100
10170
New York
New York
631243922750310400
10170
New York
New York
631243922750318600
10170
New York
New York
631243922750423600
10170
New York
New York
631243922750319600
10170
New York
New York
631243922750424600
10170
New York
New York
631243922750425600
10170
New York
New York
631243922750426100
10170
New York
New York
631243922750426600
10170
New York
New York
631243922750322200
10170
New York
New York
631243922750379000
10170
New York
New York
631243922750322700
10170
New York
New York
631243922750395400
10170
New York
New York
631243922750380500
10170
New York
New York
1,000 rows|Truncated data

dfYellowTripH3_12 -- Write as Table

Uses h3_longlatash3 to identify the resolution 12 cell for each point. Recommend running on an autoscaling cluster with ~2-8 nodes.

dfYellowTrip
  .select(
    h3_longlatash3($"pickup_longitude", $"pickup_latitude", lit(12)).as("pickup_cell"),
    h3_longlatash3($"dropoff_longitude", $"dropoff_latitude", lit(12)).as("dropoff_cell"),
    $"*"
  )
  .drop(
    "rate_code_id",
    "store_and_fwd_flag"
  )
  .write
    .mode("ignore") // <-- "overwrite" to replace, "ignore" to skip if exists
    .option("mergeSchema", "true")
  .saveAsTable("yellow_trip_h3_12")

// sql("optimize yellow_trip_h3_12 zorder by (pickup_cell)") // <-- optional: z-order by 'pickup_cell'

lazy val dfYellowTripH3_12 = spark.table("yellow_trip_h3_12")
dfYellowTripH3_12: org.apache.spark.sql.DataFrame = <lazy>
display(dfYellowTripH3_12.filter($"pickup_cell".isNotNull))
 
pickup_cell
dropoff_cell
vendor_id
pickup_datetime
dropoff_datetime
passenger_count
trip_distance
payment_type
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
631243922710233600
631243921534609400
VTS
2009-04-29T01:26:00.000+0000
2009-04-29T01:38:00.000+0000
1
5.7
Credit
631243922708223000
631243922752071200
VTS
2009-04-18T22:46:00.000+0000
2009-04-18T23:30:00.000+0000
5
6.62
Credit
631243922709959200
631243922751907300
VTS
2009-04-06T07:33:00.000+0000
2009-04-06T07:49:00.000+0000
1
3.72
CASH
631243922710227500
631243949902408700
VTS
2009-04-25T07:44:00.000+0000
2009-04-25T07:55:00.000+0000
5
3.53
CASH
631243922708280300
631243922649177100
VTS
2009-04-25T04:14:00.000+0000
2009-04-25T04:19:00.000+0000
3
1.29
CASH
631243922708227600
631243951195053600
VTS
2009-04-03T01:32:00.000+0000
2009-04-03T01:57:00.000+0000
1
5.34
CASH
631243922708763600
631243951175787500
VTS
2009-04-08T00:44:00.000+0000
2009-04-08T00:58:00.000+0000
5
6.72
CASH
631243922708223500
631243922890304500
DDS
2009-04-16T23:44:42.000+0000
2009-04-16T23:51:08.000+0000
1
1
CASH
631243922708274700
631243922650848300
CMT
2009-04-16T00:06:48.000+0000
2009-04-16T00:11:11.000+0000
1
1.5
Cash
631243922708182000
631243949905357300
CMT
2009-04-19T00:11:11.000+0000
2009-04-19T00:31:24.000+0000
3
6.1
Cash
631243922708205000
631243949787778000
CMT
2009-04-11T02:53:27.000+0000
2009-04-11T03:10:57.000+0000
1
4.8
Cash
631243922708673500
631243922850475500
CMT
2009-04-14T23:54:49.000+0000
2009-04-15T00:07:50.000+0000
2
2.9
Credit
631243922708272600
631243949967776300
CMT
2009-04-19T21:26:18.000+0000
2009-04-19T21:41:47.000+0000
1
3.9
Cash
631243922708810200
631243949971284500
CMT
2009-04-19T01:41:47.000+0000
2009-04-19T01:50:06.000+0000
1
2.5
Credit
631243922708289000
631243922891279400
CMT
2009-04-10T20:14:19.000+0000
2009-04-10T20:23:13.000+0000
1
1.1
Cash
631243922708675600
631243922702490600
CMT
2009-04-15T01:43:45.000+0000
2009-04-15T01:52:35.000+0000
2
2.6
Cash
631243922708810200
631243951190789100
CMT
2009-04-15T01:25:10.000+0000
2009-04-15T01:39:59.000+0000
1
8
Credit
1,000 rows|Truncated data

Analyze Trips between LaGuardia (LGA) and Manhattan's Financial District Zip Codes

LGA is exclusively within 11371 and Financial District Zip Codes are 10004, 10005, 10006, 10007, 10038

dfLgaPickupH3_12

All 25M pickups from LGA zip code. You can make this a table to speed up subsequent queries.

lazy val dfLgaPickupH3_12 = dfYellowTripH3_12
    .join(dfNycZipcodeH3_12.filter("zipcode = '11371'"), $"pickup_cell" === $"cell", "inner")

println(f"count? ${formatter.format(dfLgaPickupH3_12.count)}")
// println(dfLgaPickupH3_12.printSchema)
count? 25,250,015 dfLgaPickupH3_12: org.apache.spark.sql.DataFrame = <lazy>
display(dfLgaPickupH3_12)
 
pickup_cell
dropoff_cell
vendor_id
pickup_datetime
dropoff_datetime
passenger_count
trip_distance
payment_type
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
631243923274357200
631243922689974800
CMT
2012-07-16T12:34:30.000+0000
2012-07-16T12:51:37.000+0000
1
8.2
CRD
631243923274357200
631243922702096900
VTS
2012-07-16T21:14:00.000+0000
2012-07-16T21:31:00.000+0000
1
9.75
CSH
631243923274349000
631243922682342900
VTS
2012-07-16T17:21:00.000+0000
2012-07-16T17:59:00.000+0000
2
8.65
CRD
631243923274357200
631243921466988000
CMT
2012-07-16T19:06:00.000+0000
2012-07-16T19:20:32.000+0000
1
7.9
CRD
631243923270849500
631243922748122600
VTS
2012-07-16T19:26:00.000+0000
2012-07-16T19:45:00.000+0000
2
10.78
CRD
631243923274355200
631243949854745600
VTS
2012-07-16T19:49:00.000+0000
2012-07-16T20:10:00.000+0000
1
9.03
CSH
631243923270576100
631243922750606300
CMT
2012-07-16T19:49:01.000+0000
2012-07-16T20:02:02.000+0000
1
8.1
CRD
631243923274357200
631243922753099300
VTS
2012-07-16T13:41:00.000+0000
2012-07-16T14:30:00.000+0000
1
10.76
CRD
631243923274357200
631243949906247200
VTS
2012-07-16T20:05:00.000+0000
2012-07-16T20:41:00.000+0000
3
13.41
CSH
631243923274356200
631243949902011900
VTS
2012-07-17T10:33:00.000+0000
2012-07-17T11:08:00.000+0000
5
10.31
CSH
631243923274357200
631243922825833500
CMT
2012-07-16T12:37:15.000+0000
2012-07-16T12:56:42.000+0000
1
9.1
CRD
631243923274357200
631243921440408000
VTS
2012-07-17T00:50:00.000+0000
2012-07-17T01:09:00.000+0000
1
8.88
CRD
631243923274356200
631243922754569700
VTS
2012-07-16T20:34:00.000+0000
2012-07-16T20:53:00.000+0000
1
8.89
CSH
631243923270834700
631243922750918100
VTS
2012-07-16T20:15:00.000+0000
2012-07-16T20:34:00.000+0000
4
7.38
CRD
631243923270841300
631243922923005400
VTS
2012-07-17T08:55:00.000+0000
2012-07-17T09:03:00.000+0000
6
3.59
CSH
631243923274355200
631243935103638500
CMT
2012-07-16T23:33:22.000+0000
2012-07-16T23:50:05.000+0000
1
9.9
CSH
631243923270851600
631243949903370200
VTS
2012-07-16T22:48:00.000+0000
2012-07-16T23:07:00.000+0000
1
10.99
CRD
1,000 rows|Truncated data

dfFdDropoffH3_12

All 34M dropoffs in Manhattan's Financial Districts.

lazy val dfFdDropoffH3_12 = dfYellowTripH3_12
    .join(dfNycZipcodeH3_12.filter($"zipcode".isin("10004", "10005", "10006", "10007", "10038")), $"dropoff_cell" === $"cell", "inner")

println(f"count? ${formatter.format(dfFdDropoffH3_12.count)}")
// println(dfFdDropoffH3_12.printSchema)
count? 34,306,823 dfFdDropoffH3_12: org.apache.spark.sql.DataFrame = <lazy>
display(dfFdDropoffH3_12)
 
pickup_cell
dropoff_cell
vendor_id
pickup_datetime
dropoff_datetime
passenger_count
trip_distance
payment_type
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
631243922710227500
631243949902408700
VTS
2009-04-25T07:44:00.000+0000
2009-04-25T07:55:00.000+0000
5
3.53
CASH
631243922708182000
631243949905357300
CMT
2009-04-19T00:11:11.000+0000
2009-04-19T00:31:24.000+0000
3
6.1
Cash
631243922708205000
631243949966991900
CMT
2009-04-04T22:20:44.000+0000
2009-04-04T22:39:22.000+0000
2
4.2
Cash
631243922708270100
631243949901952500
CMT
2009-04-21T00:45:27.000+0000
2009-04-21T01:03:32.000+0000
2
4.1
Dispute
631243922708280300
631243949902491100
CMT
2009-04-29T01:31:09.000+0000
2009-04-29T01:43:34.000+0000
1
4.2
Cash
631243922710248000
631243949904313900
CMT
2009-04-01T07:55:03.000+0000
2009-04-01T07:59:25.000+0000
1
3
Credit
631243922708291100
631243949902489600
DDS
2009-04-16T23:28:20.000+0000
2009-04-16T23:53:27.000+0000
1
2.6
CREDIT
631243922708159500
631243949902192100
CMT
2009-04-04T00:15:34.000+0000
2009-04-04T00:33:02.000+0000
2
4.2
Cash
631243922709962200
631243949902660100
VTS
2009-04-16T09:39:00.000+0000
2009-04-16T09:47:00.000+0000
1
3.17
CASH
631243922708675600
631243949966952400
VTS
2009-04-30T19:53:00.000+0000
2009-04-30T20:17:00.000+0000
1
3.93
CASH
631243922708268500
631243949905384000
DDS
2009-08-28T08:05:38.000+0000
2009-08-28T08:30:37.000+0000
1
4.6
CASH
631243922709993000
631243949894735400
DDS
2009-08-27T08:03:12.000+0000
2009-08-27T08:09:30.000+0000
1
3.4
CASH
631243922708546600
631243949902050300
VTS
2009-08-23T23:54:00.000+0000
2009-08-24T00:14:00.000+0000
3
5.84
CASH
631243922708224000
631243949906237400
CMT
2009-08-23T03:23:57.000+0000
2009-08-23T03:45:26.000+0000
2
4.5
Cash
631243922708177900
631243949966892500
CMT
2009-08-02T02:18:33.000+0000
2009-08-02T02:29:32.000+0000
1
5.6
Credit
631243922708223500
631243949902029800
CMT
2009-08-19T03:43:31.000+0000
2009-08-19T03:56:32.000+0000
1
5.2
Cash
631243922710412300
631243949905308200
CMT
2009-08-30T06:11:14.000+0000
2009-08-30T06:16:55.000+0000
1
3.5
Credit
1,000 rows|Truncated data

dfLgaFdDropoffH3_12

Join dfFdDropoffH3_12 on dfLgaPickupH3_12 to get all 827K dropoffs in the Financial District which were picked-up around LGA.

lazy val lgaPickupCells = for (r: Row <- dfLgaPickupH3_12.select($"pickup_cell").distinct.collect) yield r.get(0).asInstanceOf[Long]
println(f"len? ${formatter.format(lgaPickupCells.size)}")
println(lgaPickupCells.take(1))
len? 9,010 [J@653749a1 lgaPickupCells: Array[Long] = <lazy>
// lgaPickupCells
val dfLgaFdDropoffH3_12 = dfFdDropoffH3_12
    .filter($"pickup_cell".isin(lgaPickupCells:_*))

println(f"count? ${formatter.format(dfLgaFdDropoffH3_12.count)}")
// println(dfLgaFdDropoffH3_12.printSchema)
count? 827,797 dfLgaFdDropoffH3_12: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [pickup_cell: bigint, dropoff_cell: bigint ... 20 more fields]
display(dfLgaFdDropoffH3_12)
 
pickup_cell
dropoff_cell
vendor_id
pickup_datetime
dropoff_datetime
passenger_count
trip_distance
payment_type
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
631243923274515500
631243949901990900
VTS
2011-04-28T22:29:00.000+0000
2011-04-28T23:01:00.000+0000
5
12.81
CRD
631243923274515500
631243949903223800
VTS
2011-04-27T15:11:00.000+0000
2011-04-27T15:37:00.000+0000
1
1.06
CSH
631243923274517000
631243949894571500
CMT
2011-04-26T12:03:50.000+0000
2011-04-26T12:34:42.000+0000
1
6.1
CSH
631243923274517500
631243949905300500
VTS
2011-04-07T08:29:00.000+0000
2011-04-07T09:14:00.000+0000
1
12.12
CSH
631243923274525700
631243949893708800
VTS
2011-04-06T00:32:00.000+0000
2011-04-06T01:06:00.000+0000
1
13.82
CRD
631243923274516000
631243949903028700
VTS
2011-04-11T07:57:00.000+0000
2011-04-11T08:48:00.000+0000
1
1.06
CRD
631243923274524200
631243949902213600
CMT
2011-04-10T18:14:41.000+0000
2011-04-10T18:43:36.000+0000
1
16.4
CSH
631243923274515500
631243949894543400
VTS
2011-04-14T23:52:00.000+0000
2011-04-15T00:23:00.000+0000
1
13.01
CRD
631243923274517500
631243949967567400
VTS
2011-04-14T21:56:00.000+0000
2011-04-14T22:20:00.000+0000
1
11.96
CRD
631243923274501600
631243949901953000
VTS
2011-04-16T14:23:00.000+0000
2011-04-16T14:46:00.000+0000
1
15.01
CRD
631243923274524200
631243949905300000
CMT
2011-04-12T11:05:00.000+0000
2011-04-12T11:43:06.000+0000
1
14.9
CRD
631243923274521100
631243949966950900
CMT
2011-04-17T20:17:15.000+0000
2011-04-17T20:40:41.000+0000
2
11.8
CSH
631243923274524200
631243949967420900
VTS
2011-04-21T08:58:00.000+0000
2011-04-21T09:55:00.000+0000
1
17.3
CRD
631243923274517500
631243949903099900
CMT
2010-12-27T13:37:01.000+0000
2010-12-27T14:04:01.000+0000
2
14.1
CSH
631243923274524200
631243949902216700
CMT
2010-12-05T10:51:53.000+0000
2010-12-05T11:16:38.000+0000
1
13.2
CSH
631243923274509800
631243949901710800
VTS
2010-12-05T18:49:00.000+0000
2010-12-05T19:17:00.000+0000
1
12.23
CRD
631243923274524200
631243949902324700
VTS
2010-12-06T21:04:00.000+0000
2010-12-06T21:25:00.000+0000
1
12.15
CRD
1,000 rows|Truncated data

Dropoffs per Financial District Zip Code

display(
  dfLgaFdDropoffH3_12
    .groupBy("zipcode")
      .count()
    .orderBy("zipcode")
)
 
zipcode
count
1
2
3
4
5
10004
149686
10005
184103
10006
108879
10007
197813
10038
187316
5 rows

Here is a reduced-size screenshot of the bar chart rendered in the imported notebook.

Show code

Render Dropoffs with Python

Kepler needs string values for H3 cells, so we convert from BigInt used by Databricks with h3_h3tostring.

For Scala, just installing KeplerGL on the Driver node.

%sh /databricks/python3/bin/pip3 install keplergl --quiet
WARNING: You are using pip version 21.2.4; however, version 22.2.2 is available. You should consider upgrading via the '/databricks/python3/bin/python -m pip install --upgrade pip' command.

Create a temp view from the scala dataframe dfLgaFdDropoffH3_12

This allows Python and other languages to access.

dfLgaFdDropoffH3_12.createOrReplaceTempView("lga_fd_dropoff_h3_12")
%python
from pyspark.sql import functions as F
from pyspark.sql.functions import col, udf
from pyspark.sql.types import *

df_dropoff = (
  spark
    .table("lga_fd_dropoff_h3_12")
      .selectExpr("zipcode", "h3_h3tostring(dropoff_cell) as dropoff_cell")
    .groupBy("zipcode","dropoff_cell")
      .count()
      .selectExpr("*","format_number(count,0) as count_disp")
    .orderBy("zipcode", F.desc("count"))
)
pdf_dropoff = df_dropoff.toPandas() # <-- convert to pandas for Kepler
print(f"count? {df_dropoff.count():,}")
display(df_dropoff.limit(5))
count? 5,608
 
zipcode
dropoff_cell
count
count_disp
1
2
3
4
5
10004
8c2a10728a963ff
1836
1,836
10004
8c2a10728a905ff
1682
1,682
10004
8c2a1072815cdff
1575
1,575
10004
8c2a1072815c9ff
1541
1,541
10004
8c2a10728a907ff
1468
1,468
5 rows

Note: you can adjust the color scale, default renders red as lower and yellow as higher values.

%python
from keplergl import KeplerGl

map_1 = KeplerGl(height=600, config={'mapState': {'latitude': 40.71, 'longitude': -74.01, 'zoom': 14}})
map_1.add_data(data=pdf_dropoff, name="LGA Financial District Dropoffs")
displayHTML(map_1._repr_html_().decode("utf-8"))
User Guide: https://docs.kepler.gl/docs/keplergl-jupyter

Here is a reduced-size screenshot of what is rendered with kepler in the imported notebook.

Show code