Database Failures Types and Solutions

Types of Computer Failures

Database failures can usually be categorized into transaction, system, or media failures. Failed transactions in the middle of the execution can occur for many reasons, as described by Elmasri & Navathe (2015).

Most Common Failures:

The most common failures are described as computer failure, transaction or system error, exception or local error, database concurrency, or deadlock.

Computer failure:

Computer failures, sometimes called system crashes by specialists, occur when a piece of hardware fails. In high-end systems, there are usually redundant hardware components to keep the failures from occurring, but they do happen. A software failure is another type of computer failure. The software failure could be due to a vendor defect or by an application programming defect introduced in the code. The last example of a computer failure would be a network failure. A network failure can occur when a firewall change is introduced, the Internet connection fails, the firmware is updated on the switch, or other networking issues.

System Error

A transaction or system error can occur.

When a transaction is running, errors can cause a system error or failure. An example of a failure we had at our company this week had to do with a boolean field. The PostgreSQL database expects a true or false operator by default, but a defect occurred where it was populated with null. A null value caused a system error and didn’t allow the software to operate correctly with a null only with a true or false value. The user can cancel transactions or queries. PostgreSQL logs all canceled queries. One can identify the time, user and application running when the transaction was cancelled.

Local errors

The transaction, procedure or function can code and detect local errors or exception conditions. If the condition code in one of these is not met, the transaction could be set up to abort and produce an exception. Another way to handle exception conditions is to abort and write a message to the Windows event log or send an alert, email, or text message to let the programmers know an issue has occurred.

Another way local errors could be captured is to log them into a table, have a script monitor the error table for the issue, and alert the team members when problems occur. The ELK (Elasticsearch, Logstash, and Kibana) open-source stack has become very popular for monitoring local error logs and producing alerts, dashboards, and other great information. Our company is migrating to Amazon Web Services(AWS) and is working to use ELK to monitor all local errors on the system and let us know of failures.

Concurrency Control

Concurrency control enforcement. The concurrency control method (see Chapter 21) may abort a transaction because it violates serializability (see Section 20.5), or it may abort one or more transactions to resolve a state of deadlock among several transactions (see Section 21.1.3). Transactions aborted because of serializability violations or deadlocks are typically restarted automatically later.

Concurrency control enforcement can occur on the database. One transaction may be running and lock a particular row, block, or table, which will not allow a second transaction to occur. The database system may allow transactions to wait or abort them. The system I am currently managing sometimes has runaway tasks that are submitted that lock import rows in the database. The long-running transaction caused fifty or more sessions to go into a wait state and caused the CPU to spike. Once an alert is notified to the team, the offending transaction gets killed, and the hung-up sessions are completed in seconds.

Least Common Failures:

The least common failures are disk failures, catastrophes, physical problems, and challenges caused by user error.

Disk failure

Disk failures typically occur on hardware without redundancy or raid systems built into the server. One redundancy, such as RAID 5, is in place. One can replace the bad drive and rebuild the raid set once the disk is replaced. Most enterprise systems of today use SAN, NAS redundant disks, or RAID storage arrays, so these times of failures are rare. Some larger systems have spare disks sitting in a bay awaiting a failure and will be brought online when a disk fails automatically.

Catastrophes

Catastrophes can be broken down into two subcategories: physical problems and catastrophes. Physical catastrophes can be air conditioner failures, power issues, fire, or hacking attacks. Human error can cause a disaster. One team member with whom I used to work accidentally did an ‘rm –rf *’ in the root directory of a Linux machine in the office.

The rm Linux command removes or deletes everything. So, this deletes the whole drive while the system is running.

Catastrophes can also occur caused by mother nature.

These type of catastrophes should have disaster recovery plans. When in the Christchurch New Zealand earthquake, many companies downtown had their disaster recovery plans tested. If a building was red-stickered, one was not allowed into the building to retrieve anything from the data center. One had to rebuild the company infrastructure from the backup tapes. We managed to get all our systems going after the earthquake.

What is meant by catastrophic failure?

Catastrophic failure typically means a server or a database system fails and cannot return online without recovery from the last good backup. The earthquake in New Zealand was an excellent example of a catastrophic system failure. No systems were available to come back online, and access to the original hardware was off-limits to the company since the building was destroyed.

What Is Timestamp Ordering Protocol For Concurrency Control?

Several database concurrency control procedures are used to ensure that database transactions have a noninterference or isolation property of concurrently executing database transactions, as described by Elmasri & Navathe (2015). Most of these techniques ensure the serializability of schedules. These use concurrency control protocols or a collection of rules that guarantee the database transaction in a sequential order.

One concurrency control protocols use timestamps. The timestamp, used for transaction uniqueness, is the unique key or identifier for each transaction and is produced by the database system. A Timestamp value is produced in the same order as the transaction start times. The idea for the timestamp ordering algorithm for the concurrency control scheme is to apply the corresponding sequential order of each transaction based on the unique key of the timestamp. The plan or schedule in which the transactions are shared is then serializable, and the only same serial schedule allowed has the transactions in order of their timestamp values.

This protocol is called timestamp ordering. Timestamp ordering differs from two-phase locking, where a schedule is sequential by being equivalent to a serial schedule, which the locking protocols allow. In timestamp ordering, however, the schedule is equal to the particular serial order corresponding to the order of the transaction timestamps. The algorithm provides interleaving of transaction operations, but it must ensure that for each pair of conflicting operations in the schedule, the order in which the item is accessed must follow the timestamp order. To achieve this, the algorithm associates each database item with two timestamp values as described by Elmasri & Navathe (2015).

The Difference Between Strict Timestamp Ordering Differ From Basic Timestamp Ordering

Basic timestamp ordering occurs when a transaction tries to issue a read or a write operation, and the basic timestamp ordering algorithm compares the timestamp of a transaction with read and write to guarantee that the timestamp order of transaction execution isn’t violated. Suppose the timestamp order of transaction execution is violated. In that case, the transaction will be aborted and resubmitted to the system as a brand new transaction and a new timestamp, as described by Elmasri & Navathe (2015).

Strict timestamp ordering is a variant of basic timestamp ordering, and it makes sure that the schedules are strict to allow for easy recoverability and serializable conflict. With the strict timestamp order variation, a transaction issues a read or write such that the transaction's timestamp is greater than the write transaction's read or write operation delayed until the transaction that wrote the value has committed or aborted. To implement this algorithm, it is necessary to simulate the locking of an item that has been written by the transaction until the transaction is either committed or aborted, as described by Elmasri & Navathe (2015)

References

Elmasri, R., Navathe, S. B. (2015). Fundamentals of Database Systems, 7th Edition [VitalSource Bookshelf version]. Retrieved from (https://bookshelf.vitalsource.com/books/9781323139899)

Posts in this series