setup_uc_nyc_taxi_trips.py(Python)

Loading...

Setup Data in Unity Catalog

This assumes / requires that you have usage grant on the target catalog (e.g. geospatial_docs) and create and usage grant on the target database (nyc_taxi), more here.

After copying (aka running this notebook), any future access to the prepared tables will be as simple as the following:

%sql 
-- Example assumes usage grant on catalog
-- and usage and create grant within database
-- More at https://www.databricks.com/product/unity-catalog
use catalog geospatial_docs;
use database nyc_taxi;

!!! NOTE: This must be run outside of DBSQL, e.g. from within Databricks Spark Clusters!!!

%pip install geopandas fsspec --quiet
Python interpreter will be restarted. Python interpreter will be restarted.

Table yellow_trip

We will use existing NYC Yellow Taxi trips dataset already available to all Databricks customers. Because it is in DBFS, this is a one time call to set it up in Unity Catalog

  • DBFS Sample Dataset: /databricks-datasets/nyctaxi/tables/nyctaxi_yellow (1.6B trips)

The data will be registered as temp view yellow_trip_view in the hive metastore by pointing to the DBFS location, then will be referenced from catalog geospatial_docs to create the table within database nyc_taxi.

df_yellow_trip = spark.table("delta.`/databricks-datasets/nyctaxi/tables/nyctaxi_yellow`")
print(f"count? {df_yellow_trip.count():,}")
count? 1,611,611,035

Register as a temporary view in hive metastore in order to copy as table into unity catalog.

df_yellow_trip.createOrReplaceTempView("yellow_trip_view")

Create table in unity catalog.

%sql -- 1x operation to copy over from hive_metastore into unity catalog
-- drop table if exists geospatial_docs.nyc_taxi.yellow_trip;
create table if not exists geospatial_docs.nyc_taxi.yellow_trip as
  select * from yellow_trip_view
Query returned no results
%sql select * from geospatial_docs.nyc_taxi.yellow_trip
 
vendor_id
pickup_datetime
dropoff_datetime
passenger_count
trip_distance
pickup_longitude
pickup_latitude
rate_code_id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
VTS
2009-11-19T10:40:00.000+0000
2009-11-19T10:49:00.000+0000
1
1.64
-73.988087
40.731822
null
VTS
2009-11-20T11:33:00.000+0000
2009-11-20T11:59:00.000+0000
3
3
-73.987065
40.724667
null
VTS
2009-11-20T22:04:00.000+0000
2009-11-20T22:14:00.000+0000
1
1.63
-73.983477
40.726132
null
VTS
2009-11-21T02:12:00.000+0000
2009-11-21T02:17:00.000+0000
4
0.76
-73.98197
40.728307
null
VTS
2009-11-23T10:19:00.000+0000
2009-11-23T10:33:00.000+0000
1
2.42
-73.982778
40.735485
null
VTS
2009-11-21T12:26:00.000+0000
2009-11-21T12:41:00.000+0000
1
4.94
-73.985323
40.731768
null
VTS
2009-11-20T01:29:00.000+0000
2009-11-20T01:39:00.000+0000
1
2.25
-73.987977
40.728437
null
VTS
2009-11-21T02:29:00.000+0000
2009-11-21T02:39:00.000+0000
1
4.47
-73.98878
40.722638
null
VTS
2009-11-20T19:44:00.000+0000
2009-11-20T19:53:00.000+0000
1
4.28
-73.987592
40.732983
null
VTS
2009-11-21T04:52:00.000+0000
2009-11-21T05:05:00.000+0000
1
3.04
-73.986622
40.69995
null
VTS
2009-11-17T14:13:00.000+0000
2009-11-17T14:23:00.000+0000
5
1.12
-73.985165
40.728428
null
VTS
2009-11-18T09:19:00.000+0000
2009-11-18T09:29:00.000+0000
1
1.31
-73.986143
40.722267
null
VTS
2009-11-18T21:49:00.000+0000
2009-11-18T21:59:00.000+0000
1
2.23
-73.987738
40.729457
null
VTS
2009-11-19T17:50:00.000+0000
2009-11-19T18:27:00.000+0000
1
6.57
-73.9807
40.720483
null
VTS
2009-11-18T22:10:00.000+0000
2009-11-18T22:28:00.000+0000
1
4.28
-73.97901
40.723945
null
VTS
2009-11-21T12:55:00.000+0000
2009-11-21T13:01:00.000+0000
2
1.18
-73.980677
40.72572
null
VTS
2009-11-19T01:13:00.000+0000
2009-11-19T01:19:00.000+0000
1
1.49
-73.9881
40.732112
null
Truncated results, showing first 1,000 rows.

The yellow_trips tables has a lot of good information but we also want to understand zip codes of the pickups and dropoffs from NYC Data.

%sh wget -O nyc_zip_codes.zip https://data.cityofnewyork.us/download/i8iw-xf4u/application%2Fzip
ls
--2022-09-28 14:55:04-- https://data.cityofnewyork.us/download/i8iw-xf4u/application%2Fzip Resolving data.cityofnewyork.us (data.cityofnewyork.us)... 52.206.68.26, 52.206.140.199, 52.206.140.205 Connecting to data.cityofnewyork.us (data.cityofnewyork.us)|52.206.68.26|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://data.cityofnewyork.us/api/views/i8iw-xf4u/files/YObIR0MbpUVA0EpQzZSq5x55FzKGM2ejSeahdvjqR20?filename=ZIP_CODE_040114.zip [following] --2022-09-28 14:55:05-- https://data.cityofnewyork.us/api/views/i8iw-xf4u/files/YObIR0MbpUVA0EpQzZSq5x55FzKGM2ejSeahdvjqR20?filename=ZIP_CODE_040114.zip Reusing existing connection to data.cityofnewyork.us:443. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/octet-stream] Saving to: ‘nyc_zip_codes.zip’ 0K .......... .......... .......... .......... .......... 841K 50K .......... .......... .......... .......... .......... 844K 100K .......... .......... .......... .......... .......... 422K 150K .......... .......... .......... .......... .......... 842K 200K .......... .......... .......... .......... .......... 844K 250K .......... .......... .......... .......... .......... 840K 300K .......... .......... .......... .......... .......... 840K 350K .......... .......... .......... .......... .......... 423K 400K .......... .......... .......... .......... .......... 836K 450K .......... .......... .......... .......... .......... 92.7M 500K .......... .......... .......... .......... .......... 839K 550K .......... .......... .......... .......... .......... 846K 600K .......... .......... .......... .......... .......... 91.5M 650K .......... .......... .......... .......... .......... 864K 700K .......... .......... .......... .......... .......... 60.7M 750K .......... .......... .......... .......... .......... 84.0M 800K .......... .......... .......... .......... .......... 108M 850K .......... .......... .......... .......... .......... 119M 900K .......... .......... .......... .......... .......... 870K 950K .......... .......... .......... .......... .......... 70.0M 1000K .......... .......... .......... .......... .......... 103M 1050K .......... .......... .......... .......... .......... 99.4M 1100K .......... .......... .......... .......... .......... 123M 1150K .......... .......... .......... .......... .......... 102M 1200K .......... .......... .......... .......... .......... 124M 1250K .......... .......... .......... .......... .......... 137M 1300K .......... .......... .......... .......... .......... 132M 1350K .......... .......... .......... .......... .......... 113M 1400K .......... .......... .......... .......... .......... 904K 1450K .......... .......... ........ 109M=0.9s 2022-09-28 14:55:06 (1.52 MB/s) - ‘nyc_zip_codes.zip’ saved [1514401] azure conf eventlogs ganglia hadoop_accessed_config.lst logs nyc_zip_codes.zip preload_class.lst
import geopandas
import fsspec

with fsspec.open('nyc_zip_codes.zip') as file:
    gdf = geopandas.read_file(file)

Notice: CRS is 2263, want to standardize to 4326 for H3

gdf.crs
Out[8]: <Derived Projected CRS: EPSG:2263> Name: NAD83 / New York Long Island (ftUS) Axis Info [cartesian]: - X[east]: Easting (US survey foot) - Y[north]: Northing (US survey foot) Area of Use: - name: United States (USA) - New York - counties of Bronx; Kings; Nassau; New York; Queens; Richmond; Suffolk. - bounds: (-74.26, 40.47, -71.8, 41.3) Coordinate Operation: - name: SPCS83 New York Long Island zone (US Survey feet) - method: Lambert Conic Conformal (2SP) Datum: North American Datum 1983 - Ellipsoid: GRS 1980 - Prime Meridian: Greenwich
gdf.to_crs(4326, inplace=True)
gdf

Verify new CRS is 4326

gdf.crs
Out[10]: <Geographic 2D CRS: EPSG:4326> Name: WGS 84 Axis Info [ellipsoidal]: - Lat[north]: Geodetic latitude (degree) - Lon[east]: Geodetic longitude (degree) Area of Use: - name: World. - bounds: (-180.0, -90.0, 180.0, 90.0) Datum: World Geodetic System 1984 ensemble - Ellipsoid: WGS 84 - Prime Meridian: Greenwich

Get the geometry data WKT which is more portable than the shapely geometry used by geopandas.

gdf['geom_wkt'] = [geom.wkt for geom in gdf['geometry']]

Convert to Spark DataFrame

df_zip = spark.createDataFrame(gdf.drop(['geometry'], axis=1, inplace=False))
print(f"count? {df_zip.count()}")
print(df_zip.printSchema())
display(df_zip)
count? 263 root |-- ZIPCODE: string (nullable = true) |-- BLDGZIP: string (nullable = true) |-- PO_NAME: string (nullable = true) |-- POPULATION: double (nullable = true) |-- AREA: double (nullable = true) |-- STATE: string (nullable = true) |-- COUNTY: string (nullable = true) |-- ST_FIPS: string (nullable = true) |-- CTY_FIPS: string (nullable = true) |-- URL: string (nullable = true) |-- SHAPE_AREA: double (nullable = true) |-- SHAPE_LEN: double (nullable = true) |-- geom_wkt: string (nullable = true) None
 
ZIPCODE
BLDGZIP
PO_NAME
POPULATION
AREA
STATE
COUNTY
ST_FIPS
CTY_FIPS
URL
SHAPE_AREA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
11436
0
Jamaica
18681
22699295.4594145
NY
Queens
36
081
http://www.usps.com/
0
11213
0
Brooklyn
62426
29631004.437939
NY
Kings
36
047
http://www.usps.com/
0
11212
0
Brooklyn
83866
41972104.0714198
NY
Kings
36
047
http://www.usps.com/
0
11225
0
Brooklyn
56527
23698630.1249252
NY
Kings
36
047
http://www.usps.com/
0
11218
0
Brooklyn
72280
36868798.9897562
NY
Kings
36
047
http://www.usps.com/
0
11226
0
Brooklyn
106132
39408598.2790117
NY
Kings
36
047
http://www.usps.com/
0
11219
0
Brooklyn
92561
42002738.4672304
NY
Kings
36
047
http://www.usps.com/
0
11210
0
Brooklyn
67067
47887022.9921604
NY
Kings
36
047
http://www.usps.com/
0
11230
0
Brooklyn
80857
49926703.3933172
NY
Kings
36
047
http://www.usps.com/
0
11204
0
Brooklyn
77354
43555184.613285
NY
Kings
36
047
http://www.usps.com/
0
10471
0
Bronx
23477
89651407.0583499
NY
Bronx
36
005
http://www.usps.com/
0
10470
0
Bronx
14740
21543461.581388
NY
Bronx
36
005
http://www.usps.com/
0
10466
0
Bronx
68942
55262490.6124597
NY
Bronx
36
005
http://www.usps.com/
0
10467
0
Bronx
97932
69336166.3726341
NY
Bronx
36
005
http://www.usps.com/
0
Truncated results, showing first 26 rows.

Register as a temporary view in hive metastore in order to copy as table into unity catalog.

df_zip.createOrReplaceTempView("zip_view")

Create table in unity catalog.

%sql -- 1x operation to copy over from hive_metastore into unity catalog
-- drop table if exists geospatial_docs.nyc_taxi.nyc_zipcode;
create table if not exists geospatial_docs.nyc_taxi.nyc_zipcode as
  select * from zip_view;
Query returned no results
%sql select * from geospatial_docs.nyc_taxi.nyc_zipcode
 
ZIPCODE
BLDGZIP
PO_NAME
POPULATION
AREA
STATE
COUNTY
ST_FIPS
CTY_FIPS
URL
SHAPE_AREA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
10303
0
Staten Island
27048
81629773.5174369
NY
Richmond
36
085
http://www.usps.com/
0
11234
0
Brooklyn
85853
206201040.087973
NY
Kings
36
047
http://www.usps.com/
0
10302
0
Staten Island
18123
29555323.0233747
NY
Richmond
36
085
http://www.usps.com/
0
11693
0
Far Rockaway
11052
46471727.1709791
NY
Queens
36
081
http://www.usps.com/
0
11209
0
Brooklyn
69255
57842313.9612272
NY
Kings
36
047
http://www.usps.com/
0
10304
0
Staten Island
41502
98226679.797328
NY
Richmond
36
085
http://www.usps.com/
0
10314
0
Staten Island
85430
473985727.130054
NY
Richmond
36
085
http://www.usps.com/
0
11228
0
Brooklyn
43080
43893024.7317224
NY
Kings
36
047
http://www.usps.com/
0
11096
0
Inwood
147
1512446.30548939
NY
Queens
36
081
http://www.usps.com/
0
10305
0
Staten Island
41746
111441829.317708
NY
Richmond
36
085
http://www.usps.com/
0
11229
0
Brooklyn
79608
60119326.1442695
NY
Kings
36
047
http://www.usps.com/
0
11370
0
East Elmhurst
40591
23600231.4880305
NY
Queens
36
081
http://www.usps.com/
0
10021
0
New York
45299
10495133.4084987
NY
New York
36
061
http://www.usps.com/
0
11358
0
Flushing
37743
54622541.0857367
NY
Queens
36
081
http://www.usps.com/
0
Truncated results, showing first 34 rows.