Disaster Recovery from Catastrophic Database Failures

Describe how disaster recovery from catastrophic failures is handled. Illustrate in detail

Catastrophes can be broken down into two subcategories: physical problems and catastrophes. Physical catastrophes can be air conditioner failures, power issues, fire, or hacking attacks. Catastrophic errors are usually caused by major disasters such as earthquakes or floods, which deem the data center no longer usable, as Elmasri & Navathe (2015) described.

When a physical problem or failure occurs with the database, such as a power outage, the database has restart capabilities built into it at start-up. The server power is restored, and the database management system is started. When starting, the database will go into recovery mode and replay or REDO the transactions in flight when the power outage occurred.

To understand how REDO works, let's use PostgreSQL as an example.

Postgresql stores its REDO in write-ahead logs (WAL) files as described by PostgreSQL (2017). A database gets initialized when it is first created. A directory is specified during initialization for a PostgreSQL database, and a series of sub-directories are created. One sub-directory is the xlog directory.

The WAL log files are stored in the xlog directory and are 16 megabytes files in size. Each file stores the inserts, updates, and deletes as they occur in sequential order from the database.

PostgreSQL (2017) points out that in the event of a crash, it is known the database will be able to recover the database using the WAL logs in the log directory. Any changes that have not been applied to the data pages can be redone from the WAL log. This roll-forward recovery is also known as REDO.

In a catastrophic failure, one needs to take backups regularly and be sure point and time recovery is enabled on the database management system. Point and time recovery allows one to recover from a database with a backup and the logs saved, which contain the insert updates and deletes.

Again, let's use PostgreSQL to illustrate how the backup and recovery method works. With PostgreSQL, one takes backups using pg_basebackup as described by PostgreSQL (2017). This backup is a physical backup of the database from a disk from a given moment in time. During this backup, all the WAL files are stored in the xlog directory. The backup writes in the WAL file when it starts and completes. If the database is running in archive mode, the WAL files get archived from the xlog directory to the archive location specified in the database configuration as described by PostgreSQL (2017).

The database can run in archive or non-archive mode. In archive mode the WAL files are written to an archive location.

The WAL files in the archive location, along with the physical backup, can be used to recover the database to a point and time. This type of backup and recovery scheme is expected and would allow one to recover from a catastrophic failure as long as the backup and archive logs are not stored on the disk of the database that had the catastrophic failure.

References PostgreSQL (2017). PostgreSQL 9.6.2 Documentation Write-Ahead Logging (WAL) [Reference Book]. Retrieved from (https://www.postgresql.org/docs/9.6/static/wal-intro.html)

Elmasri, R., Navathe, S. B. (2015-06-01). Fundamentals of Database Systems, 7th Edition [VitalSource Bookshelf version]. Retrieved from (https://bookshelf.vitalsource.com/books/9781323139899)

Posts in this series