DataSource
A base class for data sources.
This class represents a custom data source that allows for reading from and/or writing to it. The data source provides methods to create readers and writers for reading and writing data, respectively. At least one of the methods reader() or writer() must be implemented by any subclass to make the data source either readable or writable (or both).
After implementing this interface, you can load your data source using spark.read.format(...).load() and save data using df.write.format(...).save().
Syntax
from pyspark.sql.datasource import DataSource
class MyDataSource(DataSource):
@classmethod
def name(cls):
return "my_data_source"
Parameters
Parameter | Type | Description |
|---|---|---|
| dict | A case-insensitive dictionary representing the options for this data source. |
Methods
Method | Description |
|---|---|
Returns a string representing the format name of this data source. By default, returns the class name. Override to provide a customized short name. | |
Returns the schema of the data source as a | |
Returns a | |
Returns a | |
| Returns a |
| Returns a |
Returns a |
Examples
Define and register a custom readable data source:
from pyspark.sql.datasource import DataSource, DataSourceReader, InputPartition
class MyDataSource(DataSource):
@classmethod
def name(cls):
return "my_data_source"
def schema(self):
return "a INT, b STRING"
def reader(self, schema):
return MyDataSourceReader(schema)
class MyDataSourceReader(DataSourceReader):
def read(self, partition):
yield (1, "hello")
yield (2, "world")
spark.dataSource.register(MyDataSource)
df = spark.read.format("my_data_source").load()
df.show()
Define a data source with a StructType schema:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
class MyDataSource(DataSource):
def schema(self):
return StructType().add("a", "int").add("b", "string")