Configure MySQL on Amazon EC2 for ingestion
The MySQL connector is in Public Preview. Contact your Databricks account team to request access.
Learn how to configure MySQL running on Amazon EC2 for ingestion. You must enable binary logging and configure binlog retention to support change data capture.
Requirements
- MySQL version 5.7.44 and later, or 8.0 and later installed on your server.
- Permission to modify MySQL configuration files and restart the MySQL service.
Modify the MySQL configuration file
Binary logging is configured in the MySQL configuration file. The location of this file varies by operating system:
- Linux (most distributions):
/etc/my.cnfor/etc/mysql/my.cnf - Linux (Debian/Ubuntu):
/etc/mysql/mysql.conf.d/mysqld.cnf
Required configuration parameters
Add or modify the following parameters in the [mysqld] section of your MySQL configuration file:
[mysqld]
# Enable binary logging
log-bin=mysql-bin
# Set a unique server ID (required for replication)
server-id=1
# Set binlog format to ROW for change data capture
binlog_format=ROW
# Log all columns, not just changed columns
binlog_row_image=FULL
# For MySQL 8.0 and later: Set binlog retention to 7 days (604800 seconds)
binlog_expire_logs_seconds=604800
# For MySQL 5.7: Use expire_logs_days instead
# expire_logs_days=7
Server ID considerations
The server-id parameter must be a unique integer between 1 and 4294967295. If you have multiple MySQL servers, each must have a different server ID.
For a single server, you can use server-id=1. For multiple servers, you might use a convention like:
- Primary server:
server-id=1 - Replica 1:
server-id=2 - Replica 2:
server-id=3
Restart the MySQL service
After modifying the configuration file, restart the MySQL service to apply the changes:
- Linux (systemd)
- Linux (init.d)
sudo systemctl restart mysql
# or
sudo systemctl restart mysqld
sudo service mysql restart
# or
sudo /etc/init.d/mysql restart
Verify the binlog configuration
After restarting MySQL, verify that binary logging is properly configured. Connect to MySQL and run:
-- Check if binary logging is enabled
SHOW VARIABLES LIKE 'log_bin';
-- Should return:
-- +---------------+-------+
-- | Variable_name | Value |
-- +---------------+-------+
-- | log_bin | ON |
-- +---------------+-------+
-- Check server ID
SHOW VARIABLES LIKE 'server_id';
-- Should return a non-zero value:
-- +---------------+-------+
-- | Variable_name | Value |
-- +---------------+-------+
-- | server_id | 1 |
-- +---------------+-------+
-- Check binlog format
SHOW VARIABLES LIKE 'binlog_format';
-- Should return:
-- +---------------+-------+
-- | Variable_name | Value |
-- +---------------+-------+
-- | binlog_format | ROW |
-- +---------------+-------+
-- Check binlog row image
SHOW VARIABLES LIKE 'binlog_row_image';
-- Should return:
-- +------------------+-------+
-- | Variable_name | Value |
-- +------------------+-------+
-- | binlog_row_image | FULL |
-- +------------------+-------+
-- For MySQL 8.0: Check retention in seconds
SHOW VARIABLES LIKE 'binlog_expire_logs_seconds';
-- For MySQL 5.7: Check retention in days
SHOW VARIABLES LIKE 'expire_logs_days';
Binlog retention considerations
Databricks recommends a binlog retention period of 7 days. Setting a lower value might cause binlogs to be cleaned up before the ingestion gateway replays them, which would require a full refresh of all tables.
Monitor your server's disk space usage. Longer retention periods require more storage for binary logs.
Read replica support
The MySQL connector supports ingesting from read replicas on EC2. To set up a read replica:
- Configure binary logging on the primary server.
- Set up MySQL replication to create a read replica.
- Verify that the read replica has binary logging enabled.
- Use the read replica's connection string when creating the Databricks connection.
There might be a replication lag between the primary server and the replica, which can affect data freshness.
Network configuration
Ensure that your MySQL server is accessible from Databricks:
- Configure firewall rules to allow inbound traffic on port 3306 (or your custom MySQL port).
- If using cloud infrastructure, configure security groups or network ACLs.
- For on-premises databases, ensure proper network connectivity through VPN, Direct Connect, or ExpressRoute.
- Consider using SSL/TLS connections for security.
Example security group rule (EC2)
Add an inbound rule to your EC2 security group:
- Type: MySQL
- Protocol: TCP
- Port Range: 3306
- Source: Databricks IP ranges
Next steps
An admin can now either:
- Use Catalog Explorer to create a connection so non-admins can create pipelines
- Use the data ingestion UI to create a connection and a pipeline