The Kinesis connector for Structured Streaming is packaged in Databricks Runtime 3.0 and above and Spark 2.1.1-db5+.
This topic includes:
The schema of the records is:
Use DataFrame operations (
cast("string"), udfs) to explicitly deserialize the
Let’s start with a quick example: WordCount. The following notebook demonstrates how to run WordCount using Structured Streaming with Kinesis.
Kinesis WordCount with Structured Streaming¶
|streamName||A comma-separated list of stream names.||None (required param)||The stream names to subscribe to.|
|region||Region for the streams to be specified.||Locally resolved region||The region the streams are defined in.|
|initialPosition||latest, trim_horizon, earliest (alias for trim_horizon)||latest||Where to start reading from in the stream.|
|maxRecordsPerFetch||A positive integer.||10,000||How many records to be read per API request to Kinesis. Number of records returned may actually be higher depending on whether sub-records were aggregated into a single record using the Kinesis Producer Library.|
|maxFetchRate||How fast to fetch data from Kinesis in mb/s per shard.||1.0||We will rate limit our fetching rate accordingly to avoid ProvisionedThroughputExceededExceptions.|
|maxFetchDuration||A duration string, for example, 2m for 2 minutes.||10s||How long to fetch new data for asynchronously per Spark task.|
|fetchBufferSize||A byte string, for example, 2gb or 10mb.||20gb||How much data to buffer for the next trigger. This is used as a stopping condition and not a strict upper bound, therefore more data may be buffered than what’s specified for this value.|
|shardsPerTask||A positive integer.||5||How many Kinesis shards to read from in parallel per Spark task.|
|shardFetchInterval||A duration string, for example, 2m for 2 minutes.||1s||How often to poll kinesis for resharding.|
Depending on your use case, here is how you might go about configuring some of these parameters:
- ETL from Kinesis to S3
- When you’re performing ETL into long term storage, you would prefer to have a small number of large files. In this case, you may want to
set a large stream trigger interval, for example, 5-10 minutes. In addition, you may want to increase your
maxFetchDurationso that you buffer large blocks that will be written out during processing, and increase
fetchBufferSizeso that you don’t stop fetching too early in between triggers, and start falling behind in your stream.
- Monitoring and alerting
- When you have an alerting use case, you would want lower latency. To achieve that, you may set
maxFetchRateto a small value in order to make data available to your stream as fast as possible.
If you have multiple consumers reading from Kinesis, be sure to adjust
maxFetchRate accordingly. As you decrease
maxFetchRate, you may increase
to increase the utilization of your resources. For the best performance, we recommend using a cluster with
number of CPUs >=
# of total Kinesis shards /
The execute once trigger (
Trigger.Once()) is not supported with Kinesis due to rate limiting performed by Kinesis, and limitations in the Kinesis API.
For authentication with Kinesis, we use Amazon’s default credential provider chain by default.
We recommend launching your Databricks clusters with an IAM Role that can access Kinesis. If you want to use keys for access, you can provide them using the options
You can also assume an IAM Role using the
roleArn option. You can optionally specify the external id with
roleExternalId and a session name with
roleSessionName. In order to assume a role,
you can either launch your cluster with permissions to assume the role or provide access keys through
awsSecretKey. For cross-account authentication, we recommend using
roleArn to hold the assumed role, which can then be assumed through your Databricks AWS account. For more information about cross-account authentication, see Delegate Access Across AWS Accounts Using IAM Roles. The ability to assuming roles requires Databricks Runtime 3.5 and above.
The Kinesis Source requires
GetShardIterator permissions. If you hit
Amazon: Access Denied exceptions,
double-check that your user or profile has these permissions. See Controlling Access to Amazon Kinesis Data Streams Resources Using IAM for more details.