PySpark custom data sources

PySpark custom data sources are created using the Python (PySpark) DataSource API, which enables reading from custom data sources and writing to custom data sinks in Apache Spark using Python. You can use PySpark custom data sources to define custom connections to data systems and implement additional functionality to build out reusable data sources.

note

PySpark custom data sources require Databricks Runtime 15.4 LTS and above, or serverless environment version 2.

DataSource class

The PySpark DataSource is a base class that provides methods to create data readers and writers.

Implement the data source subclass

Depending on your use case, the following must be implemented by any subclass to make a data source either readable, writable, or both:

Property or Method	Description
`name`	Required. The name of the data source
`schema`	Required. The schema of the data source to be read or written
`reader()`	Must return a `DataSourceReader` to make the data source readable (batch)
`writer()`	Must return a `DataSourceWriter` to make the data sink writeable (batch)
`streamReader()` or `simpleStreamReader()`	Must return a `DataSourceStreamReader` to make the data stream readable (streaming)
`streamWriter()`	Must return a `DataSourceStreamWriter` to make the data stream writeable (streaming)

note

The user-defined DataSource, DataSourceReader, DataSourceWriter, DataSourceStreamReader, DataSourceStreamWriter, and their methods must be serializable. In other words, they must be a dictionary or nested dictionary that contains a primitive type.

Register the data source

After implementing the interface, you must register it, then you can load or otherwise use it as shown in the following example:

Python
# Register the data source
spark.dataSource.register(MyDataSourceClass)

# Read from a custom data source
spark.read.format("my_datasource_name").load().show()

Example 1: Create a PySpark DataSource for batch query

To demonstrate PySpark DataSource reader capabilities, create a data source that generates example data using the faker Python package. For more information about faker, see the Faker documentation.

Install the faker package using the following command:

Python
%pip install faker

Step 1: Define the example DataSource

First, define your new PySpark DataSource as a subclass of DataSource with a name, schema, and reader. The reader() method must be defined to read from a data source in a batch query.

Python
from pyspark.sql.datasource import DataSource, DataSourceReader
from pyspark.sql.types import StructType

class FakeDataSource(DataSource):
    """
    An example data source for batch query using the `faker` library.
    """

    @classmethod
    def name(cls):
        return "fake"

    def schema(self):
        return "name string, date string, zipcode string, state string"

    def reader(self, schema: StructType):
        return FakeDataSourceReader(schema, self.options)

Step 2: Implement the reader for a batch query

Next, implement the reader logic to generate example data. Use the installed faker library to populate each field in the schema.

Python
class FakeDataSourceReader(DataSourceReader):

    def __init__(self, schema, options):
        self.schema: StructType = schema
        self.options = options

    def read(self, partition):
        # Library imports must be within the method.
        from faker import Faker
        fake = Faker()

        # Every value in this `self.options` dictionary is a string.
        num_rows = int(self.options.get("numRows", 3))
        for _ in range(num_rows):
            row = []
            for field in self.schema.fields:
                value = getattr(fake, field.name)()
                row.append(value)
            yield tuple(row)

Step 3: Register and use the example data source

To use the data source, register it. By default, the FakeDataSource has three rows, and the schema includes these string fields: name, date, zipcode, state. The following example registers, loads, and outputs the example data source with the defaults:

Python
spark.dataSource.register(FakeDataSource)
spark.read.format("fake").load().show()

Output
+-----------------+----------+-------+----------+
|             name|      date|zipcode|     state|
+-----------------+----------+-------+----------+
|Christine Sampson|1979-04-24|  79766|  Colorado|
|       Shelby Cox|2011-08-05|  24596|   Florida|
|  Amanda Robinson|2019-01-06|  57395|Washington|
+-----------------+----------+-------+----------+

Only string fields are supported, but you can specify a schema with any fields that correspond to faker package providers' fields to generate random data for testing and development. The following example loads the data source with name and company fields:

Python
spark.read.format("fake").schema("name string, company string").load().show()

Output
+---------------------+--------------+
|name                 |company       |
+---------------------+--------------+
|Tanner Brennan       |Adams Group   |
|Leslie Maxwell       |Santiago Group|
|Mrs. Jacqueline Brown|Maynard Inc   |
+---------------------+--------------+

To load the data source with a custom number of rows, specify the numRows option. The following example specifies 5 rows:

Python
spark.read.format("fake").option("numRows", 5).load().show()

Output
+--------------+----------+-------+------------+
|          name|      date|zipcode|       state|
+--------------+----------+-------+------------+
|  Pam Mitchell|1988-10-20|  23788|   Tennessee|
|Melissa Turner|1996-06-14|  30851|      Nevada|
|  Brian Ramsey|2021-08-21|  55277|  Washington|
|  Caitlin Reed|1983-06-22|  89813|Pennsylvania|
| Douglas James|2007-01-18|  46226|     Alabama|
+--------------+----------+-------+------------+

Example 2: Create a PySpark GitHub DataSource using variants

To demonstrate the use of variants in a PySpark DataSource, this example creates a data source that reads pull requests from GitHub.

note

Variants are supported with PySpark custom data sources in Databricks Runtime 17.1 and above.

For information about variants, see Query variant data.

Step 1: Define the GitHub DataSource

First, define your new PySpark GitHub DataSource as a subclass of DataSource with a name, schema, and method reader(). The schema includes these fields: id, title, user, created_at, updated_at. The user field is defined as a variant.

Python
import json
import requests

from pyspark.sql import Row
from pyspark.sql.datasource import DataSource, DataSourceReader
from pyspark.sql.types import VariantVal

class GithubVariantDataSource(DataSource):
    @classmethod
    def name(self):
        return "githubVariant"
    def schema(self):
        return "id int, title string, user variant, created_at string, updated_at string"
    def reader(self, schema):
        return GithubVariantPullRequestReader(self.options)

Step 2: Implement the reader to retrieve pull requests

Next, implement the reader logic to retrieve pull requests from the specified GitHub repository.

Python
class GithubVariantPullRequestReader(DataSourceReader):
    def __init__(self, options):
        self.token = options.get("token")
        self.repo = options.get("path")
        if self.repo is None:
            raise Exception(f"Must specify a repo in `.load()` method.")

    def read(self, partition):
        header = {
            "Accept": "application/vnd.github+json",
        }
        if self.token is not None:
            header["Authorization"] = f"Bearer {self.token}"
        url = f"https://api.github.com/repos/{self.repo}/pulls"
        response = requests.get(url)
        response.raise_for_status()
        prs = response.json()
        for pr in prs:
            yield Row(
                id = pr.get("number"),
                title = pr.get("title"),
                user = VariantVal.parseJson(json.dumps(pr.get("user"))),
                created_at = pr.get("created_at"),
                updated_at = pr.get("updated_at")
            )

Step 3: Register and use the data source

To use the data source, register it. The following example registers, then loads the datasource and outputs three rows of the GitHub repository PR data:

Python
spark.dataSource.register(GithubVariantDataSource)
spark.read.format("github_variant").option("numRows", 3).load("apache/spark").display()

Output
+---------+-----------------------------------------------------+---------------------+----------------------+----------------------+
| id      | title                                               | user                | created_at           | updated_at           |
+---------+---------------------------------------------------- +---------------------+----------------------+----------------------+
|   51293 |[SPARK-52586][SQL] Introduce AnyTimeType             |  {"avatar_url":...} | 2025-06-26T09:20:59Z | 2025-06-26T15:22:39Z |
|   51292 |[WIP][PYTHON] Arrow UDF for aggregation              |  {"avatar_url":...} | 2025-06-26T07:52:27Z | 2025-06-26T07:52:37Z |
|   51290 |[SPARK-50686][SQL] Hash to sort aggregation fallback |  {"avatar_url":...} | 2025-06-26T06:19:58Z | 2025-06-26T06:20:07Z |
+---------+-----------------------------------------------------+---------------------+----------------------+----------------------+

Example 3: Create PySpark DataSource for streaming read and write

To demonstrate PySpark DataSource stream reader and writer capabilities, create an example data source that generates two rows in every microbatch using the faker Python package. For more information about faker, see the Faker documentation.

Install the faker package using the following command:

Python
%pip install faker

Step 1: Define the example DataSource

First, define your new PySpark DataSource as a subclass of DataSource with a name, schema, and methods streamReader() and streamWriter().

Python
from pyspark.sql.datasource import DataSource, DataSourceStreamReader, SimpleDataSourceStreamReader, DataSourceStreamWriter
from pyspark.sql.types import StructType

class FakeStreamDataSource(DataSource):
    """
    An example data source for streaming read and write using the `faker` library.
    """

    @classmethod
    def name(cls):
        return "fakestream"

    def schema(self):
        return "name string, state string"

    def streamReader(self, schema: StructType):
        return FakeStreamReader(schema, self.options)

    # If you don't need partitioning, you can implement the simpleStreamReader method instead of streamReader.
    # def simpleStreamReader(self, schema: StructType):
    #    return SimpleStreamReader()

    def streamWriter(self, schema: StructType, overwrite: bool):
        return FakeStreamWriter(self.options)

Step 2: Implement the stream reader

Next, implement the example streaming data reader that generates two rows in every microbatch. You can implement DataSourceStreamReader, or if the data source has low throughput and doesn't require partitioning, you can implement SimpleDataSourceStreamReader instead. Either simpleStreamReader() or streamReader() must be implemented, and simpleStreamReader() is only invoked when streamReader() is not implemented.

DataSourceStreamReader implementation

The streamReader instance has an integer offset that increases by 2 in every microbatch, implemented with the DataSourceStreamReader interface.

Python
from pyspark.sql.datasource import InputPartition
from typing import Iterator, Tuple
import os
import json

class RangePartition(InputPartition):
    def __init__(self, start, end):
        self.start = start
        self.end = end

class FakeStreamReader(DataSourceStreamReader):
    def __init__(self, schema, options):
        self.current = 0

    def initialOffset(self) -> dict:
        """
        Returns the initial start offset of the reader.
        """
        return {"offset": 0}

    def latestOffset(self) -> dict:
        """
        Returns the current latest offset that the next microbatch will read to.
        """
        self.current += 2
        return {"offset": self.current}

    def partitions(self, start: dict, end: dict):
        """
        Plans the partitioning of the current microbatch defined by start and end offset. It
        needs to return a sequence of :class:`InputPartition` objects.
        """
        return [RangePartition(start["offset"], end["offset"])]

    def commit(self, end: dict):
        """
        This is invoked when the query has finished processing data before end offset. This
        can be used to clean up the resource.
        """
        pass

    def read(self, partition) -> Iterator[Tuple]:
        """
        Takes a partition as an input and reads an iterator of tuples from the data source.
        """
        start, end = partition.start, partition.end
        for i in range(start, end):
            yield (i, str(i))

SimpleDataSourceStreamReader implementation

The SimpleStreamReader instance is the same as the FakeStreamReader instance that generates two rows in every batch, but implemented with the SimpleDataSourceStreamReader interface without partitioning.

Python
class SimpleStreamReader(SimpleDataSourceStreamReader):
    def initialOffset(self):
        """
        Returns the initial start offset of the reader.
        """
        return {"offset": 0}

    def read(self, start: dict) -> (Iterator[Tuple], dict):
        """
        Takes start offset as an input, then returns an iterator of tuples and the start offset of the next read.
        """
        start_idx = start["offset"]
        it = iter([(i,) for i in range(start_idx, start_idx + 2)])
        return (it, {"offset": start_idx + 2})

    def readBetweenOffsets(self, start: dict, end: dict) -> Iterator[Tuple]:
        """
        Takes start and end offset as inputs, then reads an iterator of data deterministically.
        This is called when the query replays batches during restart or after a failure.
        """
        start_idx = start["offset"]
        end_idx = end["offset"]
        return iter([(i,) for i in range(start_idx, end_idx)])

    def commit(self, end):
        """
        This is invoked when the query has finished processing data before end offset. This can be used to clean up resources.
        """
        pass

Step 3: Implement the stream writer

Now implement the streaming writer. This streaming data writer writes the metadata information of each microbatch to a local path.

Python
from pyspark.sql.datasource import DataSourceStreamWriter, WriterCommitMessage

class SimpleCommitMessage(WriterCommitMessage):
   def __init__(self, partition_id: int, count: int):
       self.partition_id = partition_id
       self.count = count

class FakeStreamWriter(DataSourceStreamWriter):
   def __init__(self, options):
       self.options = options
       self.path = self.options.get("path")
       assert self.path is not None

   def write(self, iterator):
       """
       Writes the data and then returns the commit message for that partition. Library imports must be within the method.
       """
       from pyspark import TaskContext
       context = TaskContext.get()
       partition_id = context.partitionId()
       cnt = 0
       for row in iterator:
           cnt += 1
       return SimpleCommitMessage(partition_id=partition_id, count=cnt)

   def commit(self, messages, batchId) -> None:
       """
       Receives a sequence of :class:`WriterCommitMessage` when all write tasks have succeeded, then decides what to do with it.
       In this FakeStreamWriter, the metadata of the microbatch(number of rows and partitions) is written into a JSON file inside commit().
       """
       status = dict(num_partitions=len(messages), rows=sum(m.count for m in messages))
       with open(os.path.join(self.path, f"{batchId}.json"), "a") as file:
           file.write(json.dumps(status) + "\n")

   def abort(self, messages, batchId) -> None:
       """
       Receives a sequence of :class:`WriterCommitMessage` from successful tasks when some other tasks have failed, then decides what to do with it.
       In this FakeStreamWriter, a failure message is written into a text file inside abort().
       """
       with open(os.path.join(self.path, f"{batchId}.txt"), "w") as file:
           file.write(f"failed in batch {batchId}")

Step 4: Register and use the example data source

To use the data source, register it. After it is registered, you can use it in streaming queries as a source or sink by passing a short name or full name to format(). The following example registers the data source, then starts a query that reads from the example data source and outputs to the console:

Python
spark.dataSource.register(FakeStreamDataSource)
query = spark.readStream.format("fakestream").load().writeStream.format("console").start()

Alternatively, the following example uses the example stream as a sink and specifies an output path:

Python
query = spark.readStream.format("fakestream").load().writeStream.format("fake").start("/output_path")

Troubleshooting

If the output is the following error, your compute does not support PySpark custom data sources. You must use Databricks Runtime 15.2 or above.

Error: [UNSUPPORTED_FEATURE.PYTHON_DATA_SOURCE] The feature is not supported: Python data sources. SQLSTATE: 0A000

DataSource class​

Implement the data source subclass​

Register the data source​

Example 1: Create a PySpark DataSource for batch query​

Step 1: Define the example DataSource​

Step 2: Implement the reader for a batch query​

Step 3: Register and use the example data source​

Example 2: Create a PySpark GitHub DataSource using variants​

Step 1: Define the GitHub DataSource​

Step 2: Implement the reader to retrieve pull requests​

Step 3: Register and use the data source​

Example 3: Create PySpark DataSource for streaming read and write​

Step 1: Define the example DataSource​

Step 2: Implement the stream reader​

DataSourceStreamReader implementation​

SimpleDataSourceStreamReader implementation​

Step 3: Implement the stream writer​

Step 4: Register and use the example data source​

Troubleshooting​

DataSource class

Implement the data source subclass

Register the data source

Example 1: Create a PySpark DataSource for batch query

Step 1: Define the example DataSource

Step 2: Implement the reader for a batch query

Step 3: Register and use the example data source

Example 2: Create a PySpark GitHub DataSource using variants

Step 1: Define the GitHub DataSource

Step 2: Implement the reader to retrieve pull requests

Step 3: Register and use the data source

Example 3: Create PySpark DataSource for streaming read and write

Step 1: Define the example DataSource

Step 2: Implement the stream reader

DataSourceStreamReader implementation

SimpleDataSourceStreamReader implementation

Step 3: Implement the stream writer

Step 4: Register and use the example data source

Troubleshooting