Ingest Google Drive files into Databricks
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.
The standard Google Drive connector in Lakeflow Connect allows you to use Databricks Spark and SQL functions (read_files, spark.read, COPY INTO, and Auto Loader) to create Spark dataframes, materialized views, and streaming tables directly from files in Google Drive.
With this approach, you can create custom pipelines for common file ingestion use cases:
- Streaming file ingestion (unstructured): Ingesting many source files (for example, PDFs, Google Docs, and Google Slides) into a single target table as binary data, ideal for RAG pipelines.
- Streaming file ingestion (structured): Merging many source files (for example, CSV and JSON files) into a single, structured target table.
- Batch file ingestion: Ingesting a single, specific file (such as a Google Sheet) or a batch of files into one target table.
These interfaces are supported:
- Declarative Automation Bundles
- Databricks APIs
- Databricks SDKs
- Databricks CLI
Limitations
- The connector is API-only and does not support pipeline creation in the Databricks UI.
- You can use the
pathGlobFilteroption to filter files by name (for example,pathGlobFilter => '*.csv'). Built-in Google formats (for example, Google Docs or Google Sheets) can't be filtered using this option. Folder path filtering is also not supported. - Unsupported formats include Google Forms, Google Sites, Google Jams, and Google Vids. The ingestion process skips unsupported formats.
- Setting
recursiveFileLookup=falseis not supported for Spark batch reads. UsingrecursiveFileLookup=falsewill behave the same asrecursiveFileLookup=true.
Requirements
Before you begin, verify that you have:
- A Unity Catalog-enabled workspace.
CREATE CONNECTIONpermissions.- Databricks Runtime 17.3 or above.
- The Excel file format Beta feature enabled, if you want to ingest Google Sheets or Excel files. See Read Excel files.
- A Google account with the necessary permissions to create a Google Cloud project.
Configure OAuth 2.0
Set up a Google Cloud project and enable the Google Drive API
- Go to the Google Cloud console.
- Create a new project. You might be prompted to set up two-factor authentication.
- Go to APIs & Services > Library.
- Search for "Google Drive API".
- Select Google Drive API.
- Select Enable.
Set up the OAuth consent screen for your project
- On the Google Cloud console home screen, go to APIs & Services > OAuth Consent Screen. You will see a message that says "Google Auth Platform not configured yet".
- Select Get Started.
- Fill in the App Information section. Enter any name for the app name (for example,
Databricks connection). The support email can be any email in your organization. - Select Next.
- In the Audience section, select External, then select Next.
- Fill in the Contact Information section, then select Next.
- Review the Google API Services User Data Policy, then select Create.
- Go back to Google Auth Platform > Data Access.
- Select Add or remove scopes.
- Add the following scope in the Manually add scopes section, select Add to table, then select Update:
https://www.googleapis.com/auth/drive.readonly - Select Save.
Create OAuth 2.0 client credentials
- On the Google Cloud console home screen, go to APIs & Services > Credentials.
- Select Create Credentials > OAuth Client ID.
- Choose Web Application and set a custom name.
- In Authorized Redirect URIs, select Add URI.
- Add a redirect URI to
<databricks-instance-url>/login/oauth/google.html, replacing<databricks-instance-url>with your Databricks instance URL. For example:https://instance-name.databricks.com/login/oauth/google.html - Select Create. A dialog with your credentials appears.
- Record the following values. Alternatively, you can download the OAuth Client JSON file, which contains this information:
- Client ID (format:
0123******-********************************.apps.googleusercontent.com) - Client secret (format:
ABCD**-****************************)
- Client ID (format:
Add test users to your project
- Go to Google Auth Platform > Audience.
- Under Test users, select Add users.
- Add the email address of the Google account you will use to create the connection.
Create a connection
-
In the Databricks workspace, select Catalog > External locations > Connections > Create connection.
-
On the Connection basics page of the Set up connection wizard, specify a unique Connection name.
-
In the Connection type dropdown menu, search for and select Google Drive.
-
(Optional) Add a comment.
-
Select Next.
-
On the Authentication page, enter the following:
- OAuth scope:
https://www.googleapis.com/auth/drive.readonly - Client secret: The client secret from Create OAuth 2.0 client credentials.
- Client ID: The client ID from Create OAuth 2.0 client credentials.
- OAuth scope:
-
Select Sign in with Google and sign in with the Google account from Add test users to your project.
-
Select Continue, then select Continue again.
-
After you're redirected back to the Databricks workspace, select Create connection.
Ingest files from Google Drive
Ingest files from Google Drive using the connector and read_files (Databricks SQL), Auto Loader (.readStream with cloudFiles), COPY INTO, and spark.read. You must provide the following values:
- The Google Drive URL as the path.
- The Unity Catalog connection using the
databricks.connectiondata source option. - The URL of the Google Drive resource you want to access. The URL can refer to a specific file, a folder, or an entire drive. For example:
https://docs.google.com/spreadsheets/d/12345/edit?random_query_params_herehttps://drive.google.com/drive/u/0/folders/12345https://docs.google.com/document/d/12345/edithttps://drive.google.com/file/d/1kiXnHmU4Y8X66ijULky5EPDNCGtT14Ps/view?usp=drive_linkhttps://drive.google.com/drive/https://drive.google.com/drive/my-drivehttps://drive.google.com/
Stream Google Drive files using Auto Loader
Auto Loader is the most efficient way to incrementally ingest structured files from Google Drive. It automatically detects new files and processes them as they arrive. It can also ingest structured and semi-structured files such as CSV and JSON with automatic schema inference and evolution. For details about Auto Loader usage, see Common data loading patterns.
# Incrementally ingest new PDF files
df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "binaryFile")
.option("databricks.connection", "my_gdrive_conn")
.option("cloudFiles.schemaLocation", <path to a schema location>)
.option("pathGlobFilter", "*.pdf")
.load("https://drive.google.com/drive/folders/1a2b3c4d...")
.select("*", "_metadata")
)
# Incrementally ingest CSV files with automatic schema inference and evolution
df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("databricks.connection", "my_gdrive_conn")
.option("pathGlobFilter", "*.csv")
.option("inferColumnTypes", True)
.option("header", True)
.load("https://drive.google.com/drive/folders/1a2b3c4d...")
)
Read Google Drive files using Spark batch read
The following example shows how to ingest Google Drive files in Python using the spark.read function. For a list of supported file formats and Spark reader options, see Generic File Source Options in the Apache Spark documentation.
Setting recursiveFileLookup=false is not supported for Spark batch reads. Using recursiveFileLookup=false will behave the same as recursiveFileLookup=true.
# Read unstructured data as binary files
df = (spark.read
.format("binaryFile")
.option("databricks.connection", "my_gdrive_conn")
.option("recursiveFileLookup", True)
.option("pathGlobFilter", "*.pdf")
.load("https://drive.google.com/drive/folders/1a2b3c4d..."))
# Read a batch of CSV files, infer the schema, and load the data into a DataFrame
df = (spark.read
.format("csv")
.option("databricks.connection", "my_gdrive_conn")
.option("pathGlobFilter", "*.csv")
.option("recursiveFileLookup", True)
.option("inferSchema", True)
.option("header", True)
.load("https://drive.google.com/drive/folders/1a2b3c4d..."))
Read a single structured Google Drive file
The following example reads a single tab of a Google Sheet and loads it into a DataFrame. It demonstrates a few common parsing options.
For a full list of supported parsing options for Excel files and Google Sheets, see Read Excel files. For a full list of all other supported file formats and Spark reader options, see Generic File Source Options in the Apache Spark documentation.
df = (spark.read
.format("excel") # use 'excel' for Google Sheets
.option("databricks.connection", "my_gdrive_conn")
.option("headerRows", 1) # optional
.option("inferColumns", True) # optional
.option("dataAddress", "Sheet1!A1:Z10") # optional
.load("https://docs.google.com/spreadsheets/d/9k8j7i6f..."))
Read Google Drive files using Spark SQL
The following example shows how to ingest Google Drive files in SQL using the read_files table-valued function. For details about read_files usage, see read_files table-valued function.
-- Read pdf files
CREATE TABLE my_table AS
SELECT * FROM read_files(
"https://drive.google.com/drive/folders/1a2b3c4d...",
`databricks.connection` => "my_gdrive_conn",
format => "binaryFile",
pathGlobFilter => "*.pdf", -- optional. Example: only ingest PDFs
schemaEvolutionMode => "none"
);
-- Read a Google Sheet and range
CREATE TABLE my_sheet_table AS
SELECT * FROM read_files(
"https://docs.google.com/spreadsheets/d/9k8j7i6f...",
`databricks.connection` => "my_gdrive_conn",
format => "excel",
headerRows => 1, -- optional
dataAddress => "Sheet1!A2:D10", -- optional
schemaEvolutionMode => "none"
);
Ingest Google Drive files using Lakeflow Spark Declarative Pipelines
The Google Drive connector requires Databricks Runtime 17.3 or later. To use the connector, set "CHANNEL" = "PREVIEW" in your pipeline settings. For more information on previews, see Pipeline properties reference.
This examples shows how to read Google Drive files using Auto Loader in Lakeflow Spark Declarative Pipelines. For more information, see Lakeflow Spark Declarative Pipelines
- Databricks SQL
- Python
-- Incrementally ingest new PDF files
CREATE OR REFRESH STREAMING TABLE gdrive_pdf_table
AS SELECT * FROM STREAM read_files(
"https://drive.google.com/drive/folders/1a2b3c4d...",
format => "binaryFile",
`databricks.connection` => "my_gdrive_conn",
pathGlobFilter => "*.pdf");
-- Incrementally ingest CSV files with automatic schema inference and evolution
CREATE OR REFRESH STREAMING TABLE gdrive_csv_table
AS SELECT * FROM STREAM read_files(
"https://drive.google.com/drive/folders/1a2b3c4d...",
format => "csv",
`databricks.connection` => "my_gdrive_conn",
pathGlobFilter => "*.csv",
"header", "true");
-- Read a specific Excel file from Google Drive in a materialized view
CREATE OR REFRESH MATERIALIZED VIEW gdrive_excel_table
AS SELECT * FROM read_files(
"https://docs.google.com/spreadsheets/d/9k8j7i6f...",
`databricks.connection` => "my_gdrive_conn",
format => "excel",
headerRows => 1, -- optional
dataAddress => "Sheet1!A2:D10", -- optional
`cloudFiles.schemaEvolutionMode` => "none"
);
from pyspark import pipelines as dp
# Incrementally ingest new PDF files
@dp.table
def gdrive_pdf_table():
return (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "binaryFile")
.option("databricks.connection", "my_gdrive_conn")
.option("pathGlobFilter", "*.pdf")
.load("https://drive.google.com/drive/folders/1a2b3c4d...")
)
# Incrementally ingest CSV files with automatic schema inference and evolution
@dp.table
def gdrive_csv_table():
return (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("databricks.connection", "my_gdrive_conn")
.option("pathGlobFilter", "*.csv")
.option("inferColumnTypes", True)
.option("header", True)
.load("https://drive.google.com/drive/folders/1a2b3c4d...")
)
# Read a specific Excel file from Google Drive in a materialized view
@dp.table
def gdrive_excel_table():
return (spark.read.format("excel")
.option("databricks.connection", "my_gdrive_conn")
.option("headerRows", 1) # optional
.option("inferColumnTypes", True) # optional
.option("dataAddress", "Sheet1!A1:M20") # optional
.load("https://docs.google.com/spreadsheets/d/9k8j7i6f..."))
Parse unstructured files with ai_parse_document
When you ingest unstructured files from Google Drive (such as PDFs, Word documents, or PowerPoint files) using the standard Google Drive connector with binaryFile format, the file contents are stored as raw binary data. To prepare these files for AI workloads—such as RAG, search, classification, or document understanding—you can parse the binary content into structured, queryable output using ai_parse_document.
The following example shows how to parse unstructured documents stored in a bronze Delta table named documents, adding a new column with parsed content:
CREATE TABLE documents AS
SELECT *, _metadata FROM read_files(
"https://drive.google.com/drive/folders/1a2b3c4d...",
`databricks.connection` => "my_gdrive_conn",
format => "binaryFile",
pathGlobFilter => "*.{pdf,jpeg}"
);
SELECT *, ai_parse_document(content) AS parsed_content
FROM documents;
The parsed_content column contains extracted text, tables, layout information, and metadata that can be directly used for downstream AI pipelines.
Incremental parsing with Lakeflow Spark Declarative Pipelines
You can also use ai_parse_document within Lakeflow Spark Declarative Pipelines to enable incremental parsing. As new files stream in from Google Drive, they are automatically parsed as your pipeline updates.
For example, you can define a materialized view that continuously parses newly ingested documents:
CREATE OR REFRESH STREAMING TABLE documents
AS SELECT *, "_metadata" FROM STREAM read_files(
"https://drive.google.com/drive/folders/1a2b3c4d...",
format => "binaryFile",
`databricks.connection` => "my_gdrive_conn",
pathGlobFilter => "*.{pdf,jpeg}");
CREATE OR REFRESH MATERIALIZED VIEW documents_parsed
AS
SELECT *, ai_parse_document(content) AS parsed_content
FROM documents;
This approach ensures that:
- Newly ingested Google Drive files are parsed automatically whenever the materialized view refreshes
- Parsed outputs stay in sync with incoming data
- Downstream AI pipelines always operate on up-to-date document representations
For supported formats and advanced options, see ai_parse_document function.
How built-in Google formats are handled
You do not need to export your built-in Google files (Docs, Sheets) manually. The connector automatically exports them to an open format during ingestion.
Google format | Exported as (default) |
|---|---|
Google Docs |
|
Google Sheets |
|
Google Slides |
|
Google Drawings |
|
Google Drive export format configuration
You can configure how Google Drive native files are exported by setting Spark configurations using spark.conf.set(). These configurations determine the MIME type used when exporting Google Docs, Sheets, Slides, and Drawings.
Configuration keys:
fs.gdrive.format.document.export: Google Docs export format.fs.gdrive.format.spreadsheet.export: Google Sheets export format.fs.gdrive.format.presentation.export: Google Slides export format.fs.gdrive.format.drawing.export: Google Drawings export format.
For a complete list of supported export formats, see Export MIME types for Google Workspace documents in the Google Workspace documentation.
The following example exports a Google Docs file as TXT.
spark.conf.set("fs.gdrive.format.document.export", "text/plain")
df = spark.read.text("https://docs.google.com/document/d/1a2b3c4d...")
The following example exports a Google Sheets file as CSV.
spark.conf.set("fs.gdrive.format.spreadsheet.export", "text/csv")
df = spark.read.option("header", "true").csv("https://docs.google.com/spreadsheets/d/1a2b3c4d...")
Schemas
binaryFile format
When you use format => 'binaryFile', the resulting table has the following schema:
path(string): The full URL to the file.modificationTime(timestamp): The last modified time of the file.length(long): The size of the file in bytes.content(binary): The raw binary content of the file.
_metadata column
You can get metadata information for input files with the _metadata column such as file_name, file_path, file_size, and file_modification_time. The _metadata column is a hidden column, and is available for all input file formats. To include the _metadata column in the returned DataFrame, you must explicitly select it in the read query where you specify the source. For more information, see File metadata column.
For Google Drive native files (such as Google Docs, Google Sheets, and Google Slides), the file_size field refers to the size of the file stored on Google Drive, not the size of the file in the exported format (for example, DOCX, XLSX, or PPTX).
Example of selecting _metadata:
SELECT *, _metadata FROM read_files(
"https://drive.google.com/",
`databricks.connection` => "my_connection",
format => "binaryFile"
);
FAQ
I have a folder of multiple structured files (for example, multiple Google Sheets). How do I load each sheet or file as its own Delta table?
You must create a new ingestion query for each file to ingest into its own Delta table.
My files require custom parsing. How do I provide these parsing parameters to ensure my files are read correctly?
The connector supports all file format options available in Auto Loader, COPY INTO, and Spark. For details, see the following:
Are files in subfolders recursively ingested?
When you use Auto Loader APIs (spark.readStream and read_files), all subfolders are recursively discovered and ingested. This is also true for batch spark.read, where files in the provided folder path are always read recursively.
My Google Sheet has many irregularities and requires specific parsing and cell-range extraction (for example, multiple tables per sheet). The schema cannot be inferred automatically. How do I handle this?
You can use Excel file format parsing options to parse your Google sheet file to your desired format. See Read Excel files.
Alternatively, you can disable schema inference on Auto Loader, PySpark, or read_files. The resulting table has default column names, all data types are cast to a string, and the table might be sparse. You can then perform any necessary transformations downstream.
Next steps
- Explore Common data loading patterns
- Review
ai_parse_documentfunction