Tips & Caveats

Running an application that uses SQLite and Litestream can require some small tweaks to optimize performance and usability. There are also some important caveats to know.

Busy timeout

SQLite is built to run as a multi-process embedded database and Litestream acts as just another process. However, Litestream requires periodic but short write locks on the database when checkpointing occurs. SQLite will return an error by default if your application tries to obtain a write lock at the same time.

To prevent this, your application will need to set the busy_timeout pragma when it initializes the database connection. This pragma will wait up to a given number of milliseconds before failing a query if it is blocked on a write. It is recommended to set this to 5 seconds:

PRAGMA busy_timeout = 5000;

WAL journal mode

Litestream only works with the SQLite WAL journaling mode. This is the recommended mode for most applications as it generally performs better and allows for concurrent read access. Litestream will automatically set the database mode to WAL if it has not already been enabled by the application.

Deleting SQLite databases

If you’re deleting and recreating a SQLite database from scratch, there are 3 files which must be deleted:

  • Database file
  • Shared memory file (-shm)
  • WAL file (-wal)

If you delete your database file but not your WAL file then SQLite will try to apply those old WAL pages to your new database. Litestream also tracks changes via the WAL so it can cause replication issues if the WAL file is leftover.

Additionally, Litestream currently does not track database deletions. If you remove your database and recreate it, you should delete the -litestream directory next to your database file and restart Litestream.

Synchronous PRAGMA

SQLite must call fsync() to flush data to disk to ensure transactions are durable. While in WAL journaling mode, fsync calls can be relaxed in exchange for durability without risking data corruption.

To do this, you can change the SYNCHRONOUS mode to "NORMAL":

PRAGMA synchronous = NORMAL;

This mode will ensure that the fsync() calls are only called when the WAL becomes full and has to checkpoint to the main database file. This is safe as the WAL file is append only.

Data loss window

Litestream performs asynchronous replication which means that changes are replicated out-of-band from the transaction that wrote the changes. This is how many replication tools work including PostgreSQL’s log-shipping replication. Asynchronous replication is typically much faster than synchronous replication but it trades off durability of recent writes.

By default, Litestream will replicate new changes to an S3 replica every second. During this time where data has not yet been replicated, a catastrophic crash on your server will result in the loss of data in that time window.

For more typical shutdown scenarios, when Litestream receives a signal to close, it will attempt to synchronize all outstanding WAL changes to the S3 replica before terminating.

Synchronous replication is on the Litestream roadmap but has not yet been implemented.

Increase snapshots frequency to improve restore performance

By default, the snapshot-interval on a replica is unset so a new snapshot is taken when the previous snapshot is removed because of retention. For example, if your retention policy is the default setting of 24h then a new snapshot will be taken once per day.

However, if you’re writing data often then WAL files will build up over that time period and increase your restore time. If you have frequent writes then it is recommended to either decrease your retention period or to set the snapshot-interval to something lower such as 1h.

For example, if your retention period is one day and your snapshot-interval is one hour then you will see a rolling set of 24 snapshots for your replica.

Disable autocheckpoints for high write load servers

By default, SQLite allows any process to perform a checkpoint. A checkpoint is when pages that are written to the WAL are copied back to the main database file. Litestream works by controlling this checkpointing process and replicating the pages before they get copied back into the main database. Litestream prevents other processes from checkpointing by maintaining a read lock on the database in between its checkpoint requests.

However, under high load with many small write transactions (e.g. tens of thousands per second), the application’s SQLite instance can perform a checkpoint in between Litestream-initiated checkpoints and cause Litestream to miss a WAL file.

To prevent this, it is recommended to run your application with autocheckpointing disabled. To do this, run the following PRAGMA when you open your SQLite connection:

PRAGMA wal_autocheckpoint = 0;

Ephemeral hosts / multiple applications replicating into location can corrupt

Multiple applications replicating into the same bucket can cause situations where you will be unable to restore. It is your responsibility to ensure you do not have multiple applications replicating concurrently. This typically occurs if you on an ephemeral/serverless platform such as Google Cloud Run as they will run a second instance in parallel during deployments. Those types of environments are not supported.

In the off-chance that it does happen, and you’re unable to restore, you may see an error along the lines of:

cannot find max wal index for restore: missing initial wal segment: generation=f6d6d1e96d38dafb index=00000093 offset=4152

In this case, manually copy your most recent snapshot in generations/<id>/snapshots/<snapshot>.lz4 and decompress with lz4.

It’s a good idea to perform an integrity check on the database using sqlite3:

$ sqlite3 /path/to/db
sqlite> PRAGMA integrity_check;
ok

You now use this snapshot file as your application’s database file and continue replicating it again with Litestream.