How SQLite Uses pthread_mutex and fcntl to Coordinate Reads and Writes from Multiple Threads and Processes
linuxsqliteThis article looks at the behavior of pthread_mutex and fcntl, which the SQLite client library uses to serialize reads and writes from multiple threads and processes against the file-based database.
pthread_mutex
Threads within the same process share the address space, so a mutex placed on it can be used as a lock.
#include <pthread.h>
#include <stdio.h>
#include <string.h>
#define THREADS 4
#define ITERS 1000000
static long counter = 0;
static pthread_mutex_t mu = PTHREAD_MUTEX_INITIALIZER;
static int use_mutex = 0;
static void *worker(void *arg) {
for (int i = 0; i < ITERS; i++) {
if (use_mutex) pthread_mutex_lock(&mu);
counter++;
if (use_mutex) pthread_mutex_unlock(&mu);
}
return NULL;
}
int main(int argc, char **argv) {
if (argc > 1 && strcmp(argv[1], "lock") == 0) use_mutex = 1;
pthread_t th[THREADS];
for (int i = 0; i < THREADS; i++) pthread_create(&th[i], NULL, worker, NULL);
for (int i = 0; i < THREADS; i++) pthread_join(th[i], NULL);
printf("counter=%ld (expected=%ld)\n", counter, (long)THREADS * ITERS);
return 0;
}
With the lock taken, the threads’ increments do not conflict and the counter reaches the expected value.
$ gcc -O2 -pthread counter.c -o counter
$ ./counter
counter=2021586 (expected=4000000)
$ ./counter lock
counter=4000000 (expected=4000000)
fcntl
fcntl is a POSIX standard system call that performs various operations on a file descriptor. Passing F_RDLCK or F_WRLCK acquires a shared or exclusive lock on l_len bytes starting at offset l_start. These are advisory locks, so a process can ignore them and read or write the file anyway.
Create a table, start a writer in the background that takes a write lock with BEGIN IMMEDIATE, sleeps for 3 seconds, and then commits. While that runs, an INSERT from another process is rejected with database is locked.
$ sqlite3 test.db 'CREATE TABLE t(v TEXT);'
$ (
{
echo "BEGIN IMMEDIATE;"
echo "INSERT INTO t VALUES('x');"
echo ".system sleep 3"
echo "COMMIT;"
} | sqlite3 test.db
) &
$ sleep 1
$ sqlite3 test.db "INSERT INTO t VALUES('blocked');"
Error: stepping, database is locked (5)
Observing the writer with strace shows F_WRLCK being taken at the BEGIN IMMEDIATE timing.
$ strace -f -e trace=fcntl,fcntl64 sh -c '...' 2>&1 | grep F_WRLCK
fcntl(4, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=1073741825, l_len=1}) = 0
fcntl(4, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
fcntl(4, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
SQLite reserves 512 bytes starting at 1073741824 (0x40000000) to represent locks.
A Reader takes a temporary F_RDLCK on PENDING_BYTE (1073741824) and then F_RDLCK on the SHARED region (510 bytes starting at 1073741826) to enter the SHARED state. F_RDLCK can be held by multiple processes at the same time, so Readers can read in parallel. The 510-byte width is apparently a remnant of the Win95/98/ME era when read and write locks could not be distinguished.
A Writer, to avoid keeping Readers waiting for long, does not take an exclusive lock until just before COMMIT and acquires locks in stages. First it enters the SHARED state like a Reader, and then takes F_WRLCK on RESERVED_BYTE (1073741825) to enter the RESERVED state. Another Writer cannot take F_WRLCK on RESERVED_BYTE so its BEGIN IMMEDIATE fails, but Readers are unaffected. Changes from INSERT/UPDATE during the transaction are held in user memory, so Readers can keep reading the original data. At COMMIT time, it first takes F_WRLCK on PENDING_BYTE to enter the PENDING state, preventing new Readers from entering the SHARED state and waiting for existing Readers to leave it, then takes F_WRLCK on the SHARED region to acquire an exclusive lock and write the data.