MySQL Binary Log Group Commit

1. What is binary log?

MySQL is a multi-engine architecture database, which means it supports more than one storage engine during transaction processing.

The Binary Log is a server-level log that records every modification issued by the client at the storage engine level. This log is an important facility in the MySQL ecosystem for propagating data modifications to replicas or other systems. Therefore, it is crucial to ensure the accuracy of the binary log in order to maintain consistency between replicas or other systems.

The criterion for determining the correctness of the binary log is that all records in the binary log must be consistent with the associated storage engines. Any modifications made in the binary logs must be committed in the corresponding storage engines.

2. How the binary log works and why it becomes a bottleneck?

First, let’s understand how and when the binary log is written. (The procedure described below applies to MySQL 5.6 and earlier versions.)

Every connection handling thread has a designated cache for binary log records. During transaction processing, before committing, every successfully executed statement issued by the client will generate binary log content and write it to this buffer.

After client issuing commit, innodb as the storage engine, steps is listed below:

  1. Prepare in innodb, write innodb prepare log and sync.
  2. Write binlog content and append commit mark in binary log and invoke fsync().
  3. Commit in innodb, write innodb commit log and sync.

During crash-safe operation, if a crash occurs between steps 2 and 3, any transaction that is prepared but not committed in InnoDB will be examined based on the binary log. If the transaction is marked as committed in the binary log, InnoDB will re-commit the transaction. Otherwise, it will rollback the transaction. Binary log just work as a Write-ahead log between server and storage engines.

Why does writing to the binlog become a performance bottleneck?

Every committed transaction needs to write to the binlog on disk (if durability is crucial) and the storage engine log (i.e., redo log of InnoDB). Both the binlog and redo log must maintain the same order, which is a requirement for InnoDB hot backup. To meet this requirement, a global mutex called prepare_commit_mutex exists. The prepare_commit_mutex is held during the prepare phase of the InnoDB transaction and released upon commit. Since this mutex is global, concurrent transactions committing will be serialized on this mutex, ensuring that there is no chance of misordering between the binlog and InnoDB redo log.

Another critical resource is the binlog file handler. Since there is only one thread that can hold this handler and write the binlog content generated from local transactions, and flush this content to disk using the handler, another mutex-like object should be used for this purpose. This mutex contention may be happend between non-transactional query and transaction-query.

The binary log operation mentioned above causes disk IO, which can be expensive. Specifically, each transaction may result in a disk sync operation if the sync_binlog parameter is set to 1.

3. Group Commit as the Solution

3.1 High-level principle of Group Commit

The main reason why binlog can become a bottleneck when MySQL server traffic is heavy is due to critical resource contention and redundant disk IO of binary log if it is enabled.

Usually, compacting multiple disk sync operations into one is useful. Therefore, the high-level principle can be to consolidate these operations.

The requirement for maintaining the identical order of binlog transactions and InnoDB transactions cannot be broken. However, instead of using mutex-like operating system facilities, using a queue can achieve the same consequence.

So, the high-level principle of the Group commit can be summarized as follows:

  1. Use a queue-like data structure to maintain the same order of the commit transactions between the binlog and the storage engine.
  2. Compact the disk I/O operations.

3.2 Key parts of binlog operation during transactions commit

When the client queries a commit statement, these tasks should be done sequentially:

  1. Fill the appropriate content into the binlog cache processed by the thread.
  2. Flush the localized binlog cache to the binlog handler.
  3. Invoke the sync() function of the binlog handler to write the binlog content to disk, ensuring durability.
  4. Perform commits per storage engine sequentially.

3.3 Pipelining the key stages using distinct queues

To consolidate disk operations among multiple committing transactions while preserving their order, three identical queues are suggested: the flush queue, sync queue, and commit queue. These queues represent the flush stage, sync stage, and commit stage of a committing transaction state, respectively.

When a THD issues a commit for a transaction, the thread will first join the flush queue. After completing the enqueue operation, the current THD will wait and be blocked until the entire commit is finished. If the enrolling THD is the first member of the flush queue, this thread will be assigned as the ’leader’ of the current group.

The leader thread will then perform the remaining work and notify the follower THDs until the storage engine commits.

The leader thread will flush the work for each THD in the flush queue one by one. Then, the leader thread will dequeue the flush queue and enqueue all pending THDs to the sync stage queue.

In the sync state queue, the leader thread will acquire the binlog handler mutex. After that, the leader thread will acquire the flush stage queue mutex again to prevent new THDs from coming in. Therefore, the next thread that successfully acquires the flush queue mutex will become the new leader.

After invoking the fsync() function of the binlog handler, all the storage engines should be committed in order. Consequently, the THDs will be moved from the sync stage queue to the commit stage queue and perform the engine commit one by one.

The sync_binlog server parameter controls the frequency of fsync() operations. In reality, this parameter functions as a counter. In MySQL 5.6 and previous versions, the fsync() function is invoked every sync_binlog number of transaction commits.

However, in the group commit implementation, sync_binlog indicates the number of leader threads that perform the sync operation before actually performing the fsync(). In other words, this value represents the group number rather than the transaction number.