Skip to main content

Configure MySQL on Amazon EC2 for ingestion

Preview

The MySQL connector is in Public Preview. Contact your Databricks account team to request access.

Learn how to configure MySQL running on Amazon EC2 for ingestion. You must enable binary logging and configure binlog retention to support change data capture.

Requirements

  • MySQL version 5.7.44 and later, or 8.0 and later installed on your server.
  • Permission to modify MySQL configuration files and restart the MySQL service.

Modify the MySQL configuration file

Binary logging is configured in the MySQL configuration file. The location of this file varies by operating system:

  • Linux (most distributions): /etc/my.cnf or /etc/mysql/my.cnf
  • Linux (Debian/Ubuntu): /etc/mysql/mysql.conf.d/mysqld.cnf

Required configuration parameters

Add or modify the following parameters in the [mysqld] section of your MySQL configuration file:

ini
[mysqld]
# Enable binary logging
log-bin=mysql-bin

# Set a unique server ID (required for replication)
server-id=1

# Set binlog format to ROW for change data capture
binlog_format=ROW

# Log all columns, not just changed columns
binlog_row_image=FULL

# For MySQL 8.0 and later: Set minimum binlog retention to one day (86400 seconds). Recommended: seven days (604800 seconds)
binlog_expire_logs_seconds=604800

# For MySQL 5.7: Use expire_logs_days instead. Minimum: one day. Recommended: seven days.
# expire_logs_days=7

Server ID considerations

The server-id parameter must be a unique integer between 1 and 4294967295. If you have multiple MySQL servers, each must have a different server ID.

For a single server, you can use server-id=1. For multiple servers, you might use a convention like:

  • Primary server: server-id=1
  • Replica 1: server-id=2
  • Replica 2: server-id=3

Restart the MySQL service

After modifying the configuration file, restart the MySQL service to apply the changes:

Bash
sudo systemctl restart mysql
# or
sudo systemctl restart mysqld

Verify the binlog configuration

After restarting MySQL, verify that binary logging is properly configured. Connect to MySQL and run:

SQL
-- Check if binary logging is enabled
SHOW VARIABLES LIKE 'log_bin';

-- Should return:
-- +---------------+-------+
-- | Variable_name | Value |
-- +---------------+-------+
-- | log_bin | ON |
-- +---------------+-------+

-- Check server ID
SHOW VARIABLES LIKE 'server_id';

-- Should return a non-zero value:
-- +---------------+-------+
-- | Variable_name | Value |
-- +---------------+-------+
-- | server_id | 1 |
-- +---------------+-------+

-- Check binlog format
SHOW VARIABLES LIKE 'binlog_format';

-- Should return:
-- +---------------+-------+
-- | Variable_name | Value |
-- +---------------+-------+
-- | binlog_format | ROW |
-- +---------------+-------+

-- Check binlog row image
SHOW VARIABLES LIKE 'binlog_row_image';

-- Should return:
-- +------------------+-------+
-- | Variable_name | Value |
-- +------------------+-------+
-- | binlog_row_image | FULL |
-- +------------------+-------+

-- For MySQL 8.0: Check retention in seconds
SHOW VARIABLES LIKE 'binlog_expire_logs_seconds';

-- For MySQL 5.7: Check retention in days
SHOW VARIABLES LIKE 'expire_logs_days';

Binlog retention considerations

The minimum binlog retention period is one day (24 hours or 86400 seconds). Databricks recommends a binlog retention period of seven days (168 hours or 604800 seconds). If you set a lower value, binlogs might be deleted before the ingestion gateway processes them, requiring a full refresh of all tables.

A full refresh is also required if the gateway cannot continuously process binlogs. If the gateway is stopped for longer than binlog_expire_logs_seconds, binlogs might be deleted before ingestion resumes.

Monitor your server's disk space usage. Longer retention periods require more storage for binary logs.

Read replica support

The MySQL connector supports ingesting from read replicas on EC2. To set up a read replica:

  1. Configure binary logging on the primary server.
  2. Set up MySQL replication to create a read replica.
  3. Verify that the read replica has binary logging enabled.
  4. Use the read replica's connection string when creating the Databricks connection.
note

There might be a replication lag between the primary server and the replica, which can affect data freshness.

Network configuration

Ensure that your MySQL server is accessible from Databricks:

  • Configure firewall rules to allow inbound traffic on port 3306 (or your custom MySQL port).
  • If using cloud infrastructure, configure security groups or network ACLs.
  • For on-premises databases, ensure proper network connectivity through VPN, Direct Connect, or ExpressRoute.
  • Consider using SSL/TLS connections for security.

Example security group rule (EC2)

Add an inbound rule to your EC2 security group:

  • Type: MySQL
  • Protocol: TCP
  • Port Range: 3306
  • Source: Databricks IP ranges

Next steps

An admin can now either: