Conflict Serializability

Alex Ou

E-mail: creamyfish@gmail.com

0. Background
1. The theory
1.1 Work in Adya's and Fekete's articles
1.2 Building on top of Adya's and Fekete's work
1.3 Type A, B, C of the Serializability Theorem and the proof
1.4 An issue with Condition 3'
1.5 Neighborhood of match change
1.6 Comparing three types of the Serializability Theorem
1.7 Generalized Serializability Theorems
2. Application to NDB Cluster
2.1 Commit logic of NDB Cluster
2.2 Completing the proof of the Serializability Theorem for NDB Cluster
2.3 Application of the Serializability Theorem to NDB Cluster
2.4 Applying the method to other Read-Committed implementations
2.5 An example: the TPCC Benchmark
2.6 Application of the Generalized Serializability Theorems to TPCC
3. The Ramification Theorem(s)
3.1 Issues with timestamp related fields
3.2 The process of ramification
3.3 Split of transactions to Ram(B)'
3.4 The Ramification Theorems
3.5 Joins, sub-queries and unions
3.6 Durability of consistency
3.7 Application to standalone systems
3.7.1 MySQL InnoDB
3.7.1.1 Relation between consistency, durability and durability of consistency
4. Type D of the Serializability Theorem
4.1 Isolation levels for a field-based database system
4.2 Demonstration of type C of the Serializability Theorem under the newly defined isolation levels
4.3 Type D of the Serializability Theorem
4.4 Relations between different types of the Serializability Theorem
4.5 Type H of the Serializability Theorem
5. Generic generalization of Serializable Snapshot Isolation to a distributed database system
5.1 Making the generalization generic
5.2 List of distributed clocks satisfying the Clock Condition
5.2.1 HLC
5.2.2 Lamport's logical clock
5.2.3 An increasing sequence
5.2.4 'Happened before'
5.3 Alternatives for a field-based Snapshot Isolation implementation
5.3.1 Alternative one
5.3.2 Alternative two
5.4 Primary algorithm for the generalization
5.5 A flag system
5.6 False Positives
5.7 Summary and discussion
6. Field level capable locking system and a hybrid system:TODO
7. History, issues and the possible future
7.1 Questions to be answered
7.1.1 Question one
7.1.2 Question two
7.1.3 Free and paid service
7.1.4 Question three
7.2 Hot spot problem
7.3 Security implications
References
Appendix A: The theoretical framework, etc.
A.1 The theoretical framework
Appendix B: The 'happened before' partial order
Appendix C: More info about conflict serializability in [Be 87]
Appendix D: Survey of consistency implementations in distributed database systems
D.1 Google's Spanner
D.2 OceanBase
D.3 CockroachDB
D.4 TiDB

One of the primary goals of this article is to cure the insomnia of some of the sophisticated database application developers.

0. Background

In 1992, SQL standard, ANSI/ISO SQL-92 was drafted. Among all the specifications were those of isolation levels. It was published in 1993.

In 1995, H. Berenson etc. published the paper A Critique of ANSI SQL Isolation Levels ([Be 95]) which pointed out that the isolation levels specifications in ANSI/ISO SQL-92 were ambiguous. Their suggested definitions to avoid the ambiguity problem were, however, simply 'disguised versions of locking' and therefore disallowed optimistic and multi-version mechanisms.

In 2000, Adya etc. published the paper Generalized Isolation Level Definitions ([Ad 00]) suggesting new definitions of isolation levels intended to replace those in ANSI/ISO SQL-92. Among those definitions, there was a new definition of serializability.

In 2005, Fekete etc. published the paper Making Snapshot Isolation Serializable ([Fe 05]) which used Adya's definition of serializability to develop a way of achieving serializability on top of Snapshot Isolation.

In 2008, Cahill etc. published the paper Serializable Isolation for Snapshot Databases ([Ca 08]) introducing a practical way of utilizing the theory developed by Fekete etc. for an implementation of serializability on top of Snapshot Isolation.

In 2011, PostgreSQL adopted the method developed in the last two papers into their implementation of Snapshot Isolation (they used to call that serializability) and claim they had achieved serializability again. Now they call their database software 'the world's most advanced open source database'.

In 2014, bitcoin exchange Flexcoin suffered a consistency-based attack in which 896 bitcoins were ransacked since the isolation level used in their database was not serializable. This half-million heist ultimately forced the exchange to go under. There are links about this incident in [Wa 17]. This incident clearly indicates consistency has security implications on databases.

Example 0: In both MySQL InnoDB and NDB Cluster, with isolation level Read-Committed, create and populate a table t with the following statements:


	  
      create table t (id1 INT, id2 INT, id3 INT, value INT) ; (need to add 'ENGINE NDB' before ; for NDB Cluster)

      insert into t values(1, 2, 4, 0);
	  
	

Now if we execute 'select * form t;' from the mysql prompt, we will see exactly one row in t:

id1 id2 id3 value
1 2 4 0

	
       Prompt One: (T1)                                                         Prompt Two: (T2)

    start transaction;  

    update t set value=value+1 where id2=3;

                                                                                start transaction;

                                                                                update t set id2=3 where id1=1;

                                                                                commit;

    update t set id3=5 where id1=1;

    update t set value=value-1 where id2=3 and id3=5;

    commit;
	
    

Now if we execute 'select * form t;' again, we will have the following result in both InnoDB and NDB Cluster:

id1 id2 id3 value
1 3 5 -1

If these two transactions were to executed serially, 'value' should be equal to 0 in either order. So this history is NOT serializable. Actually the InnoDB example is a counter example to the Serializability Theorem in Adya's paper if we were to generalize it from tuple level granularity to field level granularity so that conflicts are interpreted at field level. This immediately casts doubts into the correctness of PostgresSQL's implementation of serializability since that is exactly what Fekete's paper does and Fekete's paper is what the implementation is based on. The NDB Cluster example, on the other hand, indicates we also need to address a similar problem if we were to develop a field level serializability implementation for distributed system like NDB Cluster. But to get to either of this, we must introduce the mathematical infrastructure first.


	                                                                                            ( To be continued … )                   
	

1. The theory

This is about consistency, the C as in ACID. And this part is based on the two papers, namely, [Ad 00] by Adya, etc. and [Fe 05] by Fekete, etc..

Remark: Although I think there are glitches in Adya's and Fekete's articles, both are corner stones in the literature. Actually I call Adya's paper “road map to consistency” and always go there for hints and solutions when I am stuck. Fekete's paper further develops on Adya's work and does give me a lot of inspirations.

The ultimate goal of consistency is to achieve serializability so that no abnormal phenomena can be observed for any history of transactions executed. We study lower isolation levels(in the sense that histories allowed in these isolation levels are super-set of those of serializability's) for that. The other reason for studying lower isolation levels is sometimes we need to sacrifice consistency for performance and that is not this article's purpose. We are after the Holy Grail, serializibility.

1.1 Work in Adya's and Fekete's articles

In Adya's paper, except for some insertion statements, every SQL statement is expressed as a predicate read followed by a number of regular reads and writes. The predicate read, which corresponds to the WHERE clause in a SQL statement, is an interpretation of the process of identifying which data objects we are to access. Interpreting such a process as a read is an important conceptual advance, because now we make it clear that a write can affect other SQL statements in two ways: affect the value of a data object that is to be read as usual, or affect whether a data object is selected for access.

A transaction is like a program in a programming language like C. It may have input parameters, return values(for example, in the form of output parameters of a stored procedure that encapsulates the transaction), statements in a regular programming language like branching statements and loops, deterministic math calculations, non-deterministic math calculations(like calling the Rand() function in MySQL) and finally SQL statements. SQL statements are of two types: those that access the database and those that don't. For those that don't access the database, their behavior depends on the environment, just like regular programming language statements; for those that access the database, their behavior also depends on the database state. What is more, no mater what branches the execution might follow, the SQL statements are executed in a sequential order. We can easily find the constructs we've just described in the following

Example 1: The following is the order-status transaction in TPCC. A variable preceded with a semi-colon is either a parameter passed in or an internal variable.

	  
	    The Order-Status transaction(Read-Only):

        if the customer in the order is represented by a name{                         //60% chances

          select count(c_id) into :namecnt from customer 
             where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id;

          declare c_name cursor for 
             select c_balance, c_first, c_middle, c_id from customer
                where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id
                order by c_first;

          open c_name;

          if (:namecnt%2) :namecnt++;
          for (n=0;n<:namecnt/2;n++) {
              fetch c_name
                 into :c_balance, :c_first, :c_middle, :c_id;
          }

          close c_name;
        } 
        else if the customer in the order is represented by an id{                    //40% chances

          select c_balance, c_first, c_middle, c_last from customer
             where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;
        }
    
        declare c_order cursor for
           select o_id, o_carrier_id, o_entry_d from orders 
              where o_d_id=:c_d_id and o_w_id=:c_w_id and o_c_id=:c_id
              order by o_id desc;

        open c_order;

        fetch c_order 
           into : o_id, :o_carrier_id, :o_entry_d;

        close c_order;

        select ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_delivery_d from order_line
           where ol_d_id=:c_d_id and ol_w_id=:c_w_id and ol_o_id=:o_id;
      

Only the SQL statements that are useful for later conflict analysis are included here and those that don't are skipped for simplicity. The last SQL statement, for example, is actually surrounded by a bunch of cursor related SQL statements like defining, opening a cursor and fetching rows in a cursor.


	                                                                                                       ## 
	  

The next few paragraphs are basically copy and paste of section 4.1 of Adya's paper, which is one of the major contributions to the literature by its authors. If you are familiar with the literature, you may just skim through it.

Database consists of data objects that can be read or written by transactions. While Fekete's paper considers different granularity of objects, e.g., pages, tuples, fields, Adya's paper basically only deals with tuples. Our article will deal with tuples, fields and most importantly, 'decision set's as we'll define.

A data object has one or more versions. However, transactions only interact with the database in terms of data objects; the database system maps each operation on a data object to a specific version of that object. A transaction may read versions created by committed, uncommitted, or even aborted transactions; constraints imposed by some isolation levels will prevent certain types of reads, for example, those created by aborted transactions.

When a transaction writes a data object x, it creates a new version of x. A transaction Ti, i>=0, can modify a data object multiple times; its first modification of data object x is denoted by xi:1, the second by xi:2, and so on and they are called internal versions. Version xi denotes the final modification of x performed by Ti before it commits or aborts. A transaction’s last operation, commit or abort, indicates whether its execution was successful or not; there is at most one commit or abort operation for each transaction. The committed state reflects the modifications of committed transactions. When transaction Ti commits, each version xi created by Ti becomes a part of the committed state and we say that Ti installs xi; the system determines the ordering of xi relative to other committed versions of x. If Ti aborts, xi does not become part of the committed state.

Conceptually, the initial committed state comes into existence as a result of running a special initialization transaction, Tinit. Transaction Tinit creates all data objects that will ever exist in the database; at this point, each data object x has an initial version, called the unborn version. When an application transaction creates a data object x (e.g., by inserting a tuple in a relation), we model it as the creation of a visible version for x. Thus, a transaction that loads the database creates the initial visible versions of the objects being inserted. When a transaction Ti deletes a data object x(e.g., by deleting a tuple from some relation), we model it as the creation of a special dead version, i.e., in this case, xi is a dead version. Thus, data object versions can be of three kinds: unborn, visible, and dead.

If a data object x is deleted from the committed database state and inserted later, we consider the two incarnations of x to be distinct objects. When a transaction Ti performs an insert operation, the system selects a unique data object x that has never been selected for insertion before and Ti creates a visible version of x if it commits.

We assume data object versions exist forever in the committed state to simplify the handling of inserts and deletes, i.e., we simply treat inserts/deletes as write (update) operations. An implementation only needs to maintain visible versions of data objects, and a single-version implementation can maintain just one visible version at a time. Furthermore, application transactions in a real system access only visible versions.

This generic multi-version view of database of Adya's paper is another, yet more important, conceptual advance. Now every concurrency control mechanism can be viewed as a special case of this setup. For example, MySQL InnoDB's MVCC is one that chooses to expose several visible versions of a data object to transactions, while NDB Cluster chooses to expose just one. So if we could prove a generic version of the Serializibility Theorem based on this setup, we could apply it to all the concurrency control mechanisms that satisfy the conditions in the Theorem. This is exactly what we are going to do in this article: killing multiple birds with one rock.

After the concepts of transaction and database model are coined, Adya's paper continues to capture an execution of a database system by a history. A history H over a set of transactions consists of two parts: a partial order of events E that reflects the order of the operations (e.g., read, write, abort, commit) of those transactions, and a version order, <<, which is a total order on committed versions of each data object.

A write operation on data object x by transaction Ti is denoted by wi(xi)(or wi(xi:m)); if it is useful to indicate the value v being written into xi, we use the notation, wi(xi, v). When a transaction Tj reads a version of x that was created by Ti, we denote this as rj(xi) (or rj(xi:m)). If it is useful to indicate the value v being read, we use the notation rj(xi, v). In a committed transaction, wi(xi) writes the committed version xi.

Now I start to list the conditions for Adya's Serializability Theorem.

The partial order of events E of a history satisfies the following conditions:

Condition 1: The 'happened before' partial order preserves the order of all events within a transaction including the commit and abort events.
Condition 2: If an event rj(xi:m) exists in E, it is preceded by wi(xi:m) in E, i.e., a transaction Tj cannot read version xi:m of object x before it has been produced by Ti. Note that the version read by Tj is not necessarily the most recently installed version in the committed database state; also, Ti may be uncommitted when rj(xi:m) occurs.
Condition 3: If an event wi(xi:m) is followed by ri(xj) without an intervening event wi(xi:n) in E, xj must be xi:m. This condition ensures that if a transaction modifies object x and later reads x, it will observe its last update to x.

In a standalone database system like MySQL InnoDB, usually a variant of the temporal order is chosen to be the partial order.

Example 2: Let t be the real clock time of the reference frame the system is in and C(t) be the clock reading of a standalone database system. And the following condition is satisfied:


	           t < t' => C(t) < C(t')          Monotonicity Condition(This implies C(t) < C(t') =>  t < t' )                                                                                            
	  

Namely, the clock C always advances forward. Under this condition we can define a variant of temporal order as follows:

For event e in E, set of events of a history, let Begin(e) and End(e) represent the starting time C(t0) and ending time C(t1) of e. An edge (e , e') is defined to be a pair of events e, e' in E that satisfies the following condition: End(e) < Begin(e').

We define a relation R of E as follows: R = { (e , e') | there are e0, e1, …, en in E, n > 0, such that e0=e, en=e', and each ( ei , ei+1) is an edge, 0 <= i <= n-1}. The sequence of e0, e1, …, en in this definition is called a path between e and e'. In this case, we say that e 'happened before' e' or e' 'happened after' e. And R is called the 'happened before' relation. It is easy to prove that this relation is strict partial order.


	                                                                                                       ## 
	  

Let's investigate the order of events inside a transaction in a standalone system and how the 'happened before' partial order preserves it. Although there are different execution paths in a transaction due to the loops and conditional branches in it, its execution follows exactly one such path. In this sequential execution of the transaction, SQL statements are executed one by one and the order between events from different SQL statements apparently coincides with the 'happened before' partial order. Also, in SQL statement like 'update t set col=col+1', the reading of col must happen before the writing since this is just the 'read-modify-write' sequence. Again this coincides with the 'happened before' partial order. Some events in a transaction can't be ordered by the 'happened before' partial order, for example, the predicate read and reads/writes that follow. This is because the predicate read is a logical operation and its duration can overlap with reads/writes that follow. But now we have shown that Condition 1 is satisfied with the 'happened before' partial order for standalone database system like MySQL InnoDB and PostgreSQL.

Condition 2 suggests that if a read happens, a write event of the version read exists and precedes the read in the partial order. This is a reasonable requirement if we choose the write event carefully so that a read is only possible after it ends.

Condition 3 is just another reasonable requirement for a transaction processing system and can also be interpreted without the 'happened before' partial order.

Remark: The partial order we just defined uses a local clock satisfying the Monotonicity Condition, which makes it unsuitable for distributed systems like NDB Cluster. We will introduce a new partial order shortly for distributed systems which coincides with the one we just defined when the system happens to consist of one node.

Condition 4: The history must be complete. Namely, if E contains a read or write event that mentions a transaction Ti, E must contain a commit or abort event for Ti.

From now on, we'll only consider complete histories. Notice that Condition 4 allows us to only consider committed transactions since atomicity guarantees the aborted ones NOT to have any effect on the system. So we'll only consider systems that can provide atomicity from now on.

For a standalone database system like MySQL InnoDB or PostgreSQL, the total order << we impose on the committed versions of a data object is just the temporal order of the write events. For distributed system like NDB Cluster where versions of the same data object might migrate from one machine to another(for example, when NDB Cluster performs a re-partition if a new node is added). In this case, an unambiguous total order can still be defined if we bring the order of machines in the migration path into the picture.

Now let's take a closer look at predicate read, a concept closely related to the process of identifying which data objects are to access, which is a key component of Adya's paper. A predicate read corresponds to the predicate in the WHERE clause in a SQL statement, which we use to decide what tuples we'll return for access in the statement. In some cases, the predicate read is exactly the process of identifying which data objects are to access. This is, for example, the case when we use a unique index to access the database, like the following statement in Example 1 when we we access a customer's info with an id:


      select c_balance, c_first, c_middle, c_last from customer
         where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;                                                                                                
	

In other cases, the predicate read is only part of the process of identifying which data objects are to access. This is, for example, the case when we use a non-unique index to access the database, but only part of the returned tuples qualifies. For instance, in Example 1, when we access a customer's info with the last name, we first return all the customer with that last name and then filter out the mid-point one in a specific order in a cursor to access. This is demonstrated with the following statements:


      declare c_name cursor for 
         select c_balance, c_first, c_middle, c_id from customer
            where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id
            order by c_first;

      open c_name;

      if (:namecnt%2) :namecnt++;
      for (n=0;n<:namecnt/2;n++) {
          fetch c_name
             into :c_balance, :c_first, :c_middle, :c_id;
      }

      close c_name;                                                                                    
	

In this case, the process of identifying which data objects are to access equals the predicate read specified in the WHERE clause, plus the cursor iteration. For now, we will only consider statements referencing only one table. Statements with complex constructs like joins, sub-queries or unions will be handled later.

As the name suggests, we consider predicate read as a type of read. Hence a write can affect its result. A new pair of conflicts arise from this: Directly predicate-read-depends and Directly predicate-anti-depends. Now let's see how Adya's paper define these two conflicts. Please pay close attention to them because it is these definitions that lead to Example 0.

Definition 1: Version set of a predicate read.

When a transaction executes a read or write based on a predicate P, the system selects a version for each tuple in P’s relations. The set of selected versions is called the version set of this predicate read and is denoted by Vset(P).

Notice the versions in Vset(P) could be either committed or internal ones. Predicate P is then evaluated against version set Vset(P), the tuples with a version in the version set that satisfies the predicate are chosen for later access. This evaluation process is the realization of a predicate read. A predicate read performed by transaction Ti is denoted by ri(P: Vset(P)) and let's call this subset of Vset(P) that satisfies the predicate the matching set. A write can change the matching set in two ways: it changes a non-matching version outside the set into a matching version inside the set(the IN operation) or it changes a matching version inside the set into a non-matching version outside the set(the OUT operation). Adya's paper define this formally with the following:

Definition 2: Change the matches of a predicate read.

We say that a transaction Ti changes the matches of a predicate read rj(P: Vset(P)) of Tj if Ti installs xi, xh immediately precedes xi in the version order, and xh matches P whereas xi does not or vice-versa. In this case, we also say that xi changes the matches of the predicate read.

The definition above identifies Ti to be a transaction where a change occurs for the matching set of rj(P: Vset(P)). Note that i can be equal to j.

Definition 3: Directly predicate-read-depends(predicate WR conflict).

Transaction Tj directly predicate-read-depends on Ti if Tj performs an operation rj(P: Vset(P)), xk belongs to Vset(P), i = k or xi << xk, and xi changes the matches of rj(P: Vset(P)).

Definition 4: Directly predicate-anti-depends(predicate RW conflict).

We say that Tj directly predicate-anti-depends on Ti if Tj overwrites an operation ri(P: Vset(P)), i.e., Tj installs a later version of some object that changes the matches of a predicate read performed by Ti.

Notice that in Definitions 3 and 4, the write that conflicts with the predicate read doesn't have to install a version that is in Vset(P) or immediately after the one in Vset(P) .

Example 3: Let t be the table described in Example 0, with a unique row (1,2,4,0) in it, i.e.,

id1 id2 id3 value
1 2 4 0

Execute the following two transactions alternatively for multiple times(the 'start transaction;' … 'commit;' boiler plate is skipped):


        update t set id3 = 5 where id1 = 1;

        update t set id3 = 4 where id1 = 1;
	  

After a while, we stop and execute the following transaction:


        select value from t where id2=2 and id3=5;
	  

This transaction directly predicate-read-depends on all the previous Read-Write transactions because what they did was just to move the row to and from the matching set, according to Definition 3.


	                                                                                                       ## 
	  

Definitions 1-4 are the way that Adya's paper tries to capture how a write affects a predicate read and it makes sense because everything there is tuple based. But Fekete's paper aims to generalize Adya's theory to other granularity, including field granularity and it seems they think that is just a process of copy and paste. Let's re-examine Definitions 1-4 and see if that is the case. In the next few paragraphs, we'll only pay attention to field granularity, since we hope it will help to minimize conflicts.

Remark: From this point on till the analysis of Example 0, all about Fekete's paper are my understanding of how they generalize Adya's work since some of the aspects require explanation which they didn't provide. My work, which is inspired by Adya's and Fekete's, starts after the analysis of Example 0. Whether my understanding about Fekete's generalization is correct or not doesn't affect the correctness of my work, in which every definition is in math and every major theorem is proved. However, if my understanding about Fekete's generalization is correct, the correctness of PostgreSQL's serializability implementation is in doubt because the existence of Example 0. Even if my guesses are wrong, they still should prove a version of the serializability Theorem since the one Adya's work relies on may not be appropriate for a modern workload as we'll see.

For Definition 1, Vset(P) just consists of field versions and these field versions are generated by field writes. And the matching set ri(P: Vset(P)) is the set of tuples where the corresponding field versions in Vset(P) satisfy the predicate P.

For Definition 2, it makes sense for the case when the predicate read involves only one field. It doesn't when it involves more than one field and this will become clear when we define what a 'decision set' is.

For Definitions 3 & 4, I am quoting Fekete's paper as follows:

"Definition 2.1 (Transactional Dependencies). Following the concepts of Adya et al. [2000], but with new terminology, we characterize several types of transactional dependency in an interleaved SI history.

We say that there is a Tm → Tn dependency (a predicate-write-read dependency) if Tm installs a data item version Xm so as to alter the set of items retrieved for a predicate read by Tn, and also the commit of Tm precedes the start of Tn, so that the result of the predicate read by Tn takes into account a version of data item X equal to or later than the one installed by Tm.

We say that there is a Tm → Tn dependency (a predicate-read-write dependency or predicate-anti-dependency) if Tn changes a data item X to version Xn so as to alter the set of items retrieved for a predicate read by Tm, and also the commit of Tn follows the start of Tm, so that the result of the predicate read by Tm takes into account a version of data item X prior to the one installed by Tn."

As contrary to Adya's paper, these definitions are defined purely in English, instead of math. So I have to guess what they actually mean. From the context, these are just restatement of those in Adya's, only this time for Snapshot Isolation. So 'so as to alter' is just 'change the matches of a predicate read ', in particular when a 'data item' is of field granularity. Again if the predicate in concerned is relevant to exactly one field, there is no ambiguity about this statement. But what if it is relevant to multiple fields like the last update in T1 of Example 0? What exactly does it mean by saying 'a field write to alter the set of items retrieved for a predicate read'?

Let's continue examining other conditions for Adya's Serializability Theorem.

Definition 5: Directly item-read-depends(item WR conflict).

We say that Tj directly item-read-depends on Ti if Ti installs some object version xi and Tj reads xi.

Definition 6: Directly item-anti-depends(item RW conflict).

We say that Tj directly item-anti-depends on transaction Ti if Ti reads some object version xk and Tj installs x’s next version (after xk) in the version order.

Definition 7: Directly Write-Depends(WW conflict).

A transaction Tj directly write-depends on Ti if Ti installs a version xi and Tj installs x’s next version (after xi) in the version order.

These three definitions define the common conflict types in consistency theory. Notice that the data object versions in these three definitions are of tuple level granularity in Adya's paper. But they are of any granularity, including field, tuple, page level in Fekete's paper.

We sometimes also combine Definitions 3 & 5, 4 & 6 and call them Directly Read-Depends(WR conflict), Directly Anti-Depends(RW conflict) respectively for convenience.

Definition 8: We define the direct serialization graph arising from a history H, denoted by DSG(H), as follows. Each node in the graph corresponds to a committed transaction and directed edges correspond to different types of direct conflict. There is a read/write/anti-dependency edge from transaction Ti to transaction Tj if Tj directly read/write/anti-depends on Ti.

Finally we are at a place to present the conditions relevant to conflicts that the serializability isolation level in Adya's paper has to satisfy. For a history H to be serializable, the following three conditions must be proscribed.

Condition 5: Aborted Reads.

A history H shows phenomenon 'Aborted Reads' if it contains an aborted transaction T1 and a committed transaction T2 such that T2 has read some object (maybe via a predicate read) modified by T1. This phenomenon can be represented using the following history fragments:


        w1(x1:i) : : : r2(x1:i) : : : (a1 and c2 in any order)

        w1(x1:i) : : : r2(P: x1:i, ...) : : : (a1 and c2 in any order)
                                                                                        
	  

Here a1 and c2 refer to the abortion of T1 and commit of T2 respectively.

Condition 6: Intermediate Reads.

A history H shows phenomenon 'Intermediate Reads' if it contains a committed transaction T2 that has read a version of object x (maybe via a predicate read) written by transaction T1 that was not T1’s final modification of x. The following history fragments represent this phenomenon:


        w1(x1:i) : : : r2(x1:i) : : : w1(x1:j ) : : : c2

        w1(x1:i) : : : r2(P: x1:i; :::) : : : w1(x1:j ) : : : c2
	  

Proscribing Condition 6 means the framework is essentially based on a Read-Committed isolation level. This assumption, however, is already implied in the definitions for different kinds of conflicts since versions involved are all committed ones. We already know that versions in Vset(P) could be either committed or internal ones, but reading of internal ones only happen when the corresponding write is in the same transaction and hence doesn't constitute a conflict.

Condition 7: Cycle of Conflicts.

A history H shows phenomenon 'Cycle of Conflicts' if DSG(H) contains a directed cycle with each edge being one of the five types of conflicts we defined earlier.

Everything between Definition 5 and Condition 7 applies to data object of field granularity and hence is ready for Fekete's generalization, baring the issue with predicate read with multiple fields we've mentioned in Definition 2.

Example 0 (Continuation...): Adya's paper claims a history H that satisfies Conditions 1-4 and proscribes Conditions 5-6 is serializable if and only if it proscribes Condition 7(We haven't defined a 'serializable history' mathematically, but one consequence is that it leaves the state of the database the same as a serial history. So we are getting ahead of ourselves here and we will fill in the blanks later). Example 0(the MySQL InnoDB part) has been shown to satisfy Conditions 1-3. It certainly is a complete history and satisfies Condition 4. It certainly proscribes Conditions 5-6 because the history is executed under the Read-Committed isolation level. Similar arguments can be applied to Fekete's generalization case(at least in the way I understand it) for everything except Condition 7. So now it's time to analyze its DSG and see if there is a cycle inside.

Clearly the first update statement of T1 and the only update statement of T2 give rise to a predicate RW conflict in both tuple level granularity and field level granularity. And since the history is NOT serializable, there must be a conflict of some sort from T2 to T1 whether it is interpreted at the tuple level as in Adya's paper or at the field level as in Fekete's generalization.

In the tuple level case, whether the update in T2 gives rise to a predicate WR conflict to the predicate read in the second update 'update t set value=value-1 where id2=3 and id3=5;' is debatable. To make the unique row (1,2,4,0) in t match this predicate, id2 and id3 have to be updated to 3 and 5 accordingly. The update of these two fields is split into two statements 'update t set id2=3 where id1=1;' in T2 and 'update t set id3=5 where id1=1;' in T1.It seems the first update does play a role here and should receive some credit. But what if the second update updates id3 to 6? In this case the final tuple version is not a match and it seems the first update doesn't have that credit any more. This means the first update alone doesn't change the match of the predicate read of the last statement in T1 and we should NOT consider it to be a predicate WR conflict. But a conflict from T2 to T1 still exist and so does a conflict loop since there is a WW conflict from T2's only update to T1's last two updates. So Example 0 is NOT a counter-example for the Serializability Theorem in Adya's paper in this case.

But in field level case, this completely falls apart. (In major commercial systems like MySQL InnoDB, Oracle, PostgreSQL, data objects are usually accessed in tuples and we haven't defined versions at the field level for these systems. Fortunately there is no ambiguity for this in Example 0. So let's get ahead of ourselves a little bit and we will fill in the blanks later.) The WW conflict in the tuple case doesn't exist any more since the update in T2 and the last two updates in T1 modify different fields. This indicates if Fekete's generalization of Adya's Serializability Theorem was correct, which interprets every conflict at the field level, there must be a conflict that is not of WW type from T2 to T1. But there is no apparent candidate since although this time the field versions used in the predicate read of the statement 'update t set value=value-1 where id2=3 and id3=5;' are committed ones, the issue of the possible predicate WR conflict remains. In other words, the only update in T2 alone doesn't change the match of a predicate read by itself since that would require the update of id3 in T1 too, but the definition of a predicate-based conflict should not involve a third party. Hence inconsistency shows up without a conflict loop and this is a counter-example to Fekete's generalization if I understand it correctly.

Now that we know that there might be something wrong with Fekete's generalization of Adya's Serializability Theorem(at least the way I understand it). The question is whether the problem is intrinsic in Adya's Serializability Theorem or it arises from Fekete's generalization. It turns out the problem is mainly from Fekete's generalization(at least the way I understand it). We will have more discussion about this after we prove our own versions of the Serializability Theorem.

But Adya's Serializability Theorem has its own issue too. Adya's paper claims their version of serializibility theorem can be proved by theorems in the book [Be 87]. When I trace it back to this book, it seems to me the only theorem that is relevant is the serializibility theorem in Chapter 2. But there is something peculiar in the proof of this theorem: First, it assumes the operations(item reads/writes) in a history to be identical to those of its corresponding serial history without a proof. Usually we can't assume that for today's online transaction processing system since whether an operation exist depends on the state of the database when we execute the statement. For example, two executions of the same transaction could take different execution paths which depend on different database states. Hence such an assumption requires a proof. Second, there is no evidence that the authors of this book are aware of predicate reads. I think the theorem only applies to simple database applications like the following: tuples are accessed strictly by primary keys and there is neither insertion, deletion of tuples nor primary key changes when the application is executing. In Appendix C, we'll talk about these in more details.

Let's summarize the basic facts we got so far.

  1. Although some concepts in Adya's paper apply to other granularity, the theory there is mainly designed for tuple level granularity. Fekete's paper claims they can prove a Serializability Theorem that is of field level granularity for Snapshot Isolation based on Adya's paper. They have to generalize Adya's Serializability Theorem to field level granularity first and prove it. I can't find such a proof in Fekete's paper.
  2. Adya's Serializability Theorem is itself problematic because it depends on one that may not be suitable for modern database applications where secondary key accesses are ubiquitous.

And when I try to generalize Adya's Serializability Theorem for them and prove it, Example 0 shows up. This immediately casts doubts into the correctness of PostgreSQL's implementation of serializable isolation level since it is based on Fekete's paper. I've performed some simple tests in PostgreSQL serializable isolation level and found that writes on different fields of the same tuple conflict with each other as in a regular Snapshot Isolation implementation. This implies PostgreSQL's implementation of serializable isolation level doesn't exactly follow Fekete's paper and it is not optimal since two non-conflicting transactions may be required to execute non-concurrently. So Example 0 is not allowed in PostgreSQL since the WW conflict from T2 to T1 is present. But I think both Fekete's team and PostgreSQL should really clarify this because of both facts above. Of course, in the worse case scenario where PostgreSQL's implementation of serializable isolation level turns out to be incorrect, the work I will present later is going to provide a fix.


	                                                                                            ( To be continued … )                   
	  

1.2 Building on top of Adya's and Fekete's work

Example 0(the NDB Cluster part) also indicates if we were to develop a field level Serializability Theorem for system like NDB Cluster, which is a major goal of this article, we must also pay attention to similar problems we've just described. Fortunately, Adya's paper has set up a framework that makes everything easy. We've introduced their work along the way. What we need is to fill in a few blanks and a suitable version of serializibility theorem will pop up. So let's get to it.

First let's explain why a serializability implementation based on data objects of field level granularity is important for 'pessimistic technology'. In 1993, right after SQL-92 was released, Alexander Thomasian published the paper [To 93], where thrashing behavior is shown when a system with two phase locking is under loads which cause too many locks. From that point on, when people ask why 'pessimistic technology' like MySQL InnoDB doesn't use their two phase locking based serializable isolation level as the default isolation level, the answer is usually 'its performance degrades under loads with excessive locks' and the like. But if Thomasian's paper is examined more carefully, one probably can see what really causes the thrashing behavior is 'excessive lock waits'.

If the real reason behind this thrashing behavior is 'excessive lock waits', there are two obvious approaches to reduce them: reduce the lock waiting time and reduce the number of conflicting locks. NDB Cluster has done the first one for us by moving all data from disk to memory. Consider the example of a locking read: for a disk based database system, the time of holding this read lock depends on whether the data object is in the main memory or on the disk. When the read is a miss, any transaction waiting on this lock to be released just have to wait for the data object to be fetched from the disk. NDB Cluster certainly minimizes the wait time due to disk accesses.

For minimizing the number of conflicting locks, consider the following

Example 4: Let t be the table in Example 0. Two transactions are started so that each of them executes a different one of the following two statements:


        select id2 from t where id1 = 1;

        update t set id3 = 4 where id1 = 1;                                                                                               
	  

If the serializability implementation is based on data objects of tuple level granularity, these are two conflicting statements since they are reading and writing the same tuple respectively. In 'pessimistic technology' where conflict resolution is preventive, we usually have to acquire row locks of appropriate type before accessing it in both transactions in case they are executed concurrently. This causes lock wait if they are accessing the tuple simultaneously. But if the serializability implementation is based on data objects of field level granularity, these two statements don't conflict with each other and we don't have to acquire any lock for the first statement, hence there will be no lock wait between them. So using data objects of field level granularity in a serializability implementation does in general reduce the number of conflicting locks. That is why we prefer data objects of field level granularity in our serializability implementation.


	                                                                                            ##                   
	  

After we understand the importance of developing a serializability implementation that is based on field level granularity data objects, it is time to define a field version system and impose a total order on it. Today's commercial systems like NDB Cluster, MySQL InnoDB, PostgreSQL still commit their modifications at the tuple level and derived committed field versions can be defined by splitting the committed tuple write into several committed field writes(only for the fields that are modified). And a tuple version can be viewed as an array of field versions. The following example demonstrates this idea.

Example 5: Assuming the same situation as in Example 0, with a tuple (1, 2, 4, 0) in table t. Consider the following sequence of updates executed in that order:


        update t set value = 1 where id1 = 1;

        update t set id2 = 3 where id1 = 1;

        update t set id3 = 5 where id1 = 1;

        update t set value = 2 where id1 = 1;          
	  

This sequence generates the following tuple versions: (1, 2, 4, 1), (1, 3, 4, 1), (1, 3, 5, 1), (1, 3, 5, 2). It generates one single derived field version for fields id2 and id3, being (3) and (5) respectively; but two derived field versions for the value field, being (1) and (2) respectively.


	                                                                                            ( To be continued … )                   
	  

Earlier we've shown that we may impose a total order on the committed tuple versions even if the tuple migrates to another node(like the re-partition NDB Cluster performs when a new node is added). So we can define a total order on the committed field versions as follows: a committed field version is before another if the underlying tuple version is so. It is easy to prove that this is a total order and we call it the 'derived order' on committed field versions. Of course, if we want to build a system that is completely free of tuple versions, we need to assume that a total order on the field versions can be imposed. For instance, in a system with a field level capable locking system, we may use the order of locking events on a field to introduce such an order.

Besides the committed versions, we also need to define an order for the internal versions of a data object inside a transaction in the case it is modified more than once. We've explained earlier before Example 1 that the execution of a transaction is a sequential process, namely, on a statement by statement basis. So assuming each statement doesn't update a data object twice, the order of these versions is obvious, for both tuple and field level versions.

Now let's develop partial order that is suitable for a distributed system like NDB Cluster and also an extension to that defined in Example 2.

Example 6: Let t be the real clock time of a reference frame where a distributed database system situates; Ni, 0 < i < m+1 be the nodes in the distributed system; Ci(t), 0 < i < m+1 be the clock reading on the corresponding nodes that satisfies the Monotonicity Condition, namely


        t < t' => Ci(t) < Ci(t'), for all 0 < i < m+1                                                                                             
	  

With all local clocks ticking forward, we further let E be the set of all events in all the nodes Ni, 0 < i < m+1. For event e in E which happens to be an event of node Ni, define Begin(e) and End(e) to represent the starting time Ci(t0) and ending time Ci(t1) of e as before. An edge (e , e') is defined to be a pair of events e, e' in E that satisfies either of the following conditions:

  1. Event e and e' being events of the same node and End(e) < Begin(e'). This is called a type I edge.
  2. Event e and e' happen in different nodes, e representing the sending of a message m, Send(m), to the underlying network, while e' representing the receiving of the message m, Receive(m), from the underlying network. This is called a type II edge.

We define a relation R of E as follows: R = { (e , e') | there are e0, e1, …, en in E, n > 0, such that e0=e, en=e', and each ( ei , ei+1) is an edge of either type, 0 <= i <= n-1}. The sequence of e0, e1, …, en in this definition is called a path between e and e'. We say that e 'happened before' e' and e' 'happened after' e and R the 'happened before' relation. Let's prove that it is a strict partial order.

Transitivity follows directly from the definition. For irreflexivity, we prove it by contradiction, namely, assuming (e , e) is in R for some e; in other words, there are events e1, …, en, n > 0, such that sequence e, e1, …, en, e becomes a path between e and itself. Here n > 0 because (e , e) can't be either type I or type II. For a type I edge (e' , e'') in the sequence, Begin(e') < End(e') < Begin(e'') by the Monotonicity Condition and definition respectively. This implies the real clock time represented by Begin(e') is smaller than that represented by Begin(e'') by the Monotonicity Condition. For a type II edge (e', e'') in the sequence, the real clock time represented by Begin(e'), which corresponds to the sending of the first bit of the message to the underlying network is smaller than the real clock time represented by Begin(e''), which corresponds to the receiving of the first bit of the message from the underlying network. That is because it takes time for the first bit to be transmitted over the underlying network. So we also have Begin(e') < Begin(e'') in this case. This immediately gives rise to a contradiction since the existence of the sequence e, e1, …, en, e implies the real clock time of represented by Begin(e) is smaller than itself. Hence the conclusion.

We use this partial order to capture the temporal order in a distributed system like NDB Cluster. Since this partial order R becomes the one defined in Example 2 if the distributed system happens to consist of one node, we will call it the 'happened before' partial order from now on. This implies if we could prove a theorem based on the partial order we just defined, it applies to all three of them, assuming that other conditions are satisfied too.

We also use this partial order to capture the causal relation between events. One typical scenario is that we say 'event e in node A causes event e' in node B' if a message m is sent by A after the happening of e, which is received by B before the happening of e' and it contains information that might affect the outcome of e'. For example in NDB Cluster, e might represent the event that a transaction coordinator(in node A) decides to commit, m might represent the message sent to a LQH in node B to start the local commit, e' might represent the local commit on B. When the amount of info in m exceeds the packet size limit of the underlying network, we have to send out more than one message. Of course it is always safe to start event e' after the reception of the last message. Some implementer of the system, however, might choose to start event e' before that to improve performance(for example, to pre-process the messages already received and set up some data structure for e' according to the info received so far, while delaying the core part of e' until the reception of the last message). This is all right as long as e' is started after the reception of the first message since the e 'happened before' e' relation is still captured. From this point on, when the causal relation between e in node A and e' in another node B is obvious from the context(like the example about commit logic we've just shown), we always assume that there is at least one message whose sending and receiving events together with e and e' constitutes the typical scenario described in this paragraph so that the causal relation is captured by the partial order.


	                                                                                            ##                   
	  

Before we prove that Conditions 1, 2, 3 & 4 with the 'happened before' partial order are satisfied by MySQL InnoDB, NDB Cluster and PostgreSQL, let's alter Conditions 2 & 3 a bit so that they are more appropriate for our Serializability Theorem. Notice that Example 0 is not affected by this change, so it is OK we didn't do this earlier.

Condition 2': If an event rj(xi:m) exists in E, it is preceded by wi(xi:m) in E, i.e., a transaction Tj cannot read version xi:m of object x before it has been produced by Ti; if an event rj(P: Vset(P)) exists in E, xi belongs to Vset(P) and xi:m is used for x's match, then wi(xi:m) exists and precedes the matching read. Notice that the version read by Tj is not necessarily the most recently installed version in the committed database state. Also, Ti may be uncommitted when rj(xi:m) or rj(P: Vset(P)) occurs, for example, if i equals to j.

In Condition 2' we didn't claim wi(xi:m) 'happened before' rj(P: Vset(P)). A predicate read could be a lengthy process, for example a full table scan in some cases, and wi(xi:m) could overlap with rj(P: Vset(P)).

Condition 3': If an event wi(xi:m) is followed by ri(xj) without an intervening event wi(xi:n) in E, xj must be xi:m; if an event wi(xi:m) is followed by ri(P: Vset(P)) without an intervening event wi(xi:n) in E and x belongs to Vset(P), then the version of x chosen by Vset(P) is wi(xi:m). This condition ensures that if a transaction modifies object x and later reads x in an item read or a predicate read, it will observe its last update to x.

Here the granularity of x may be of field level or tuple level for now. We will also consider a 'decision set' level when we define it.

For MySQL InnoDB and PostgreSQL, the 'happened before' partial order is just the one defined in Example 2. And we've already proved it satisfies Condition 1 & Condition 2' in an item read case. The argument for a predicate read case for Condition 2' is similar. For NDB Cluster, a proof need to be delayed until next section after we explain its commit logic. So we are good with Conditions 1 & 2'. Condition 3' is just another reasonable requirement for transaction processing system and can be interpreted without the 'happened before' partial order. We will also require the transactions to be complete, hence Condition 4 is satisfied too.

We earlier explained a total order can be imposed on the committed versions not only in standalone system like MySQL InnoDB and PostgreSQL where the write events of the same data object happen in a serial order at the same node, but also in the case where NDB Cluster performs a re-partition if a new node is added. In general, we may impose a stronger condition on the write events of the same data object by requiring that they happen in a serial order so that it coincides with the 'happened before' partial order. Notice that not every pair of temporally isolated events are related by the 'happened before' partial order, consider events in two nodes which don't communicate with each other for an example. This stronger condition implies the version order of the committed versions coincides with the 'happened before' partial order too, namely, one version is before another if and only if it 'happened before' another. From the definition of the 'happened before' partial order we can easily see that this stronger condition again applies to all three implementations in concern: MySQL InnoDB, PostgreSQL and NDB Cluster, for both tuple versions and derived field versions.

Not only the mathematical representation of the temporal order need to be re-defined, but also the definition of a serial history. A serial history in a standalone system like MySQL InnoDB is usually defined as a sequence of transactions executed 'one by one', which implies the first event in a transaction happens after the last event of the transaction immediately before it if such a transaction exists. This doesn't make too much sense in a distributed system like NDB Cluster because the first event of a following transaction may not happen in the same node as the last event of the previous transaction does and this issue must be addressed in the new definition. Also one consequence of a serial history is that a transaction sees the latest modifications made by previous transactions, but not the ones after it. We need to define a serial history in a distributed system so that this property holds.

Definition 9: Let t be real clock time of the reference frame where the system is located and upon which the 'happened before' partial order is defined. Given an ordered sequence of transactions T1, T2, …, Tn, pick an API node as a coordinator and execute the transactions in the given order so that the next transaction is only started after the previous one reports its state as committed to the coordinator. When a transaction is executed, it always reads the latest committed version of data object that is available. Such an execution is called the serial history of T1, T2, …, Tn.

Remark: In NDB Cluster, what we have to do is open a mysql prompt, execute the ordered transactions one by one and only start a new one after the previous one commits. In this case, the MySQL node that the prompt connects to serves as the coordinator. So the coordinator here is very different from a transaction coordinator in NDB Cluster.

Notice the requirement that a transaction in a serial execution to always read the latest available committed version of data object is in general NOT appropriate for histories where transactions may overlap. We can easily find such examples in Snapshot Isolation or Repeatable-Read isolation levels where a read reads an older version of an object when multiple versions are stored in the database. And in NDB Cluster's Read-Committed isolation level, MySQL InnoDB's Read-Committed and Repeatable-Read isolation levels, PostgreSQL's Snapshot Isolation, we can always arrange the transactions to be isolated enough so that this requirement is satisfied.

From this definition, it is clear that the order of transactions coincides with the order of write events of the same objects in the 'happened before' partial order, hence coincides with the total order of the object versions.

We assume that when a transaction reports back to the API coordinator its state as committed, all of its modifications are available for reading. This assumption is reasonable since distributed systems usually use two phase commit as their commit logic; in two phase commit a local component commits its modifications, hence ready for reading, before it sends the committed state back to the coordinator. For example, this assumption is satisfied by NDB Cluster and one can see that when we explain the commit logic of NDB Cluster in next section.

With this assumption, it's easy to see that a later transaction in the execution sequence sees the latest modifications in transactions before it: the time for the modifications of a previous transaction to be available is before the time for the commit message to reach the coordinator, which is before the time to start a later transaction, which is once again before the time the reading happens; and since the transaction reads the latest available modification, it reads the latest modifications in transactions before it.

On the other hand, an earlier transaction in the execution sequence sees no modification of a later transaction. Suppose the opposite was true and an earlier transaction could read a modification of a later transaction. The write operation 'happened before' the read operation for any reasonable implementation such that Condition 2' is satisfied with the 'happened before' partial order. However, the read operation 'happened before' the commit of the earlier transaction, which 'happened before' the start of the later transaction, which again 'happened before' the write operation. This contradicts with the fact that 'happened before' is a strict partial order.

Remark: Adya's Serializability Theorem assumes the partial order to be a generic one satisfying Conditions 1, 2, 3 & 4. It is very difficult to prove a generic version of Theorem like that. Of course, the fact that we can't prove it doesn't mean it is not provable. But we will only prove one in which the partial order is 'happened before'.

Definition 9 is also applies to standalone system like MySQL InnoDB where the coordinator and the only local component are co-located together. In that case it becomes the regular serial history definition where the transactions are executed one by one. Also this serial history concept can easily be generalized to a set of SQL statement snippets so that each entity in the set is just a subset of a transaction.

Next let's redefine the conflict types. Definitions 5, 6 & 7 remain the same, only this time field level data objects are included. In the Serializability Theorem we are going to present shortly, Conditions 5 & 6 will be proscribed. This makes the isolation level to start with practically become Read-Committed, so the object versions in Definitions 5, 6 & 7 are all committed versions.

Finally we are at a place to redefine the predicate-based conflicts that might have caused the appearance of Example 0. Starting from this point, we will constrain the predicate in a SQL statement to be one that only involves one table and complex predicates involving more than one table, like joins, sub-queries, unions, will be handled later. We define the set of columns in the only table that a predicate P uses to decide whether a tuple is selected for later access to be the 'decision set'. We also define the set of columns a write(a field write or a tuple write) changes to be the write set of that write operation.

Example 7: Consider the following two statements executed in two transactions against the table defined in Example 0:


        update t set id2 = 2, id3 = 3 where id1 = 1;

        select * from t where id1 = 1 and id2 = 2 and id3 = 4;                                                                                                      
	  

The 'decision set' for these two statements are {id1}, {id1, id2, id3} respectively. The 'write set' for the first statement is {id2, id3} if the writing part is viewed as a tuple write, and {id2} for the first write and {id3} for the second write if the writing part is viewed as two field writes.


	                                                                                            ##                   
	  

The Serializability Theorem that we are going to present next comes in three flavors with varying conditions on data object granularity. We will also redefine Definitions 1-4 so that they are suitable for the various flavors of it.

Type A: All the conflicts are based on tuple versions. It is used to demonstrate that Adya's Serializability Theorem can actually be proved, without resorting to the results in [Be 87].

For this type, the first internal tuple version generated by a write operation need to be paid more attention to. Besides the fields that are actually written in the write operation, there might be fields that are not. Since we are generating a tuple version, we have to fill in a value for each of those. Where do these values come from?

There are three cases. In the first case, all fields in the tuple are modified and we need to do nothing. In the second case, a tuple read 'happened before' the write in the same statement like in 'update...set col=col+1'; in this case, the values naturally come from the tuple read. In the third case, it is a blind write like 'update table grades set grade=”A”'. In this article, we assume the following:

The third case works the same as the second one except that it reads a tuple version implicitly before the write. In other words, there is no real blind write in our framework.

For MySQL InnoDB and NDB Cluster, the tuple is locked before it is updated and the locked tuple's value serves as the implicit read; for PostgreSQL, the write is always against a tuple version in the snapshot taken when the transaction starts and that tuple version serves the same purpose. This is so since no other transaction can write the tuple before the designated one does in any of the three cases. So the assumption works for the three implementations concerned in this article.

For predicate-based conflicts, Definitions 1 & 2 remain the same as in Adya's paper. The digression starts with the added definition for the 'conflict with' concept. The item-based conflicts remain the same as in Adya's paper, of course.

Type B: Updates in transactions still generate tuple versions, both internal and committed ones. Field versions are derived from these tuple versions as usual. All the regular item-based conflicts are interpreted at field level, namely, based on the derived field versions.

Like the derived field versions, we also derive 'decision set' versions from tuple versions. Namely, a derived 'decision set' version is generated whenever fields in the 'decision set' are modified in a tuple write and these derived 'decision set' versions are ordered the same way as their associated tuple versions.

Example 5 (Continuation...): If (id2, id3) turns out to be a 'decision set' of a predicate read, the second and third updates generate two derived 'decision set' versions, while the first and the fourth updates don't. If each update statement represents a transaction, these two derived 'decision set' versions are committed derived 'decision set' versions; if on the other hand, all four updates are in a transaction, these two derived 'decision set' versions are internal derived 'decision set' versions.


	                                                                                            ##                   
	  

For predicate reads, we use derived 'decision set' versions to capture the conflicts. For Definition 1, Vset(P) consists of a set of derived 'decision set' versions for the tuples to be matched. For Definition 2, we also use derived 'decision set' versions to replace the tuple versions in it. In other words, the versions xi, xh in the following restated definition are derived 'decision set' versions.

And we'll use derived 'decision set' versions to capture predicate-based conflicts as we'll describe in Definitions 3' & 4' shortly.

There is a new kind of conflict in this type of the Serializability Theorem though. They are the WR, RW and WW conflicts between reads and writes of the committed derived 'decision set' versions, just defined like those for tuple or field versions. It is easy to see that the serial order of write events on the 'decision set' coincides with the 'happened before' partial order since that of the tuple write events do. After the introduction of this new set of conflicts, we also include them into the item-based conflicts, as opposed to the predicate-based ones. In particular, Definitions 5, 6 & 7 for type B include committed 'decision set' versions now, not just committed field versions.

As in the tuple write case in type A, we may require an implicit 'decision set' version to be read before a 'decision set' version write if not all the fields in the 'decision set' are written. That is because in a system that still generates tuple versions, a 'decision set' version write where not all the fields in the 'decision set' are written implies a tuple version write where not all the fields in it are written. Hence we may require that a tuple read must precede such a tuple write as in the type A case and the derived 'decision set' read will serve our purpose. Similar to the type A case, MySQL InnoDB, NDB Cluster and PostgreSQL all fulfill this requirement. This implicit 'decision set' version is called the initial internal 'decision set' version. And this read is the only case that a 'decision set' read is referred to as an item read, all the others are involved in a predicate read. Notice this implies that in MySQL InnoDB, NDB Cluster and PostgreSQL, item RW conflict for 'decision set' actually doesn't exist since the next version is written by the accompanying write, not by another transaction. In MySQL InnoDB and NDB Cluster, that is because the read happens after the lock for the accompanying write is acquired and this lock will block the write from another transaction until the accompanying write finishes; in PostgreSQL, the two writes on the 'decision set' are considered to be writes on the same tuple and their associated transactions will be non-concurrent under Snapshot Isolation and hence the conclusion.

For possible item reads following a predicate read, we will use the 'derived' field versions from the same tuple version associated with the matching 'decision set' version.

Remark: The recognition of conflicts between 'decision set' versions is a key abstraction for the development of type D of the Serializability Theorem, which is free of tuple versions. When we apply type B of Serializability Theorem to NDB Cluster, we will see how these conflicts are buried under regular tuple lock conflicts.

This type of the Serializability Theorem is designed for the current generation of technology where data objects are mainly accessed by tuple. Notice that in the proof we will try to conclude that the field and 'decision set' versions, NOT the tuple versions, are read and written in the same way in a given history and its serial peer. An example will be given to demonstrate this after the proof.

Type C: This type of the Serializability Theorem still requires the tuple concept which corresponds to the row concept in a table, but it doesn't require the concept of tuple versions. It does require the existence of field versions and a total order generated by serial write operations on that field that is consistent with the 'happened before' partial order. All conflicts are interpreted at the field level and this implies that the 'decision set' of a predicate read only consists of one field. It also requires a way to identify which field version to read after a predicate read if item reads follow.

Type C of the Serializability Theorem is designed for the future, for example, when a field level capable locking system is available. Notice that the item-based conflicts between reads and writes of 'decision set' versions are just those between field versions.

In type C, the requirement of the way to specify what field versions to read after a predicate read is matched is implicit: as long as it makes a serial history satisfy its key property, namely, when a transaction is executed, it always reads the latest committed version of data object that is available.

So Definitions 1 & 2 are syntactically identical for all three types of the theorem with the item versions in concern being of different granularity. Before we present the Serializability Theorem, we need to redefine Definition 3 & 4. Let's start with the following

Definition: Conflict with

For i not equal to j, a write in transaction Ti which establishes a version xi is said to conflict with a predicate read rj(P: Vset(P)) for predicate P in transaction Tj if the following two conditions are satisfied:

Condition A: The intersection of the 'write set' of the write operation that establishes xi and the 'decision set' of P is not empty.
Condition B: The write operation in Ti changes the match of predicate P. And it is one that its generated version xi is closest to the version used in the predicate read in the version order, before or after it.

Version xi in this definition could represent either a field, a tuple or a 'decision set' version respectively in type A, B & C of the Serializability Theorem.

Note that in Adya's paper, it doesn't emphasize the write in Condition B that changes the match to be the closest one, although in general there could be more than one write that does so as the following example indicates.

Example 8: Let everything be the same as in Example 0. The following three transactions T1, T2, T3 are executed in the serial order: T1, T2, T3. And they contain only one single statement.


        T1: update t set id1=2 where id1 = 1

        T2: update t set id1=1 where id1 = 2

        T3: select * from t where id1 = 1                                                                                                  
	  

From T3's point of view, T1 moves the only tuple in t out of the matching set and T2 moves it back in, so they both change the match of the predicate read. But Condition B suggests that only T2's write counts.

Notice that in 'pessimistic technology', we still need to impose a write lock for T1's write. That is because it's a preventive approach and T1 might affect the match of the predicate read in T3 in general(for example, if T2 is removed from the history). But for 'optimistic technology', one might try to take advantage of this fact since now we have one less conflict than Adya's paper would have suggested.


	                                                                                            ##                   
	  

Now let's re-define Definitions 3 & 4 to suit our needs.

Definition 3': Directly predicate-read-depends.(predicate WR conflict).

Transaction Tj directly predicate-read-depends on Ti if Tj performs an operation rj(P: Vset(P)), xk belongs to Vset(P), i = k or xi << xk, and the write in Ti that establishes xi conflicts with rj(P: Vset(P)).

Definition 4': Directly predicate-anti-depends.(predicate RW conflict).

Transaction Tj directly predicate-anti-depends on Ti if Tj performs an operation rj(P: Vset(P)), xk belongs to Vset(P), xi >> xk, and the write in Ti that establishes xi conflicts with rj(P: Vset(P)). Also, Tj doesn't generate the version after xk itself.

Again a version like xi in these two definitions represents a tuple, a 'decision set' and a field version in type A, B & C of the Serializability Theorem respectively. Notice there is an extra requirement that Tj doesn't generate the version after xk itself in Definition 4' comparing with Definition 4 in Adya's work. If Tj does generate the version after xk itself, the dependence between Tj and Ti is captured with a sequence of WW conflicts on x.

Remark 1: We claim that Tj in Definition 4' generates the version after xk when it also writes x. In Adya's and also this article's framework, predicate read and the following reads/writes are interpreted as different operations. This might seem to imply that it was possible to insert a write operation by another transaction between them at first glance. For example, it might seem that after the predicate read identified a tuple as a matching one, another transaction could update it before the original transaction performed a read. But actually in today's page-oriented implementation of database system, this irrational behavior is impossible. For instance, in a lock based system like MySQL InnoDB, typically before a tuple is identified as matching, its containing page has to be locked with a semaphore; after it is identified, a lock on the tuple will be requested. In general there is no room for another transaction to intervene with this process with a write, even in the more complex case where the requested lock can't be acquired immediately and a lock wait becomes a must, and then the semaphore need to be given up and re-acquired until the lock is obtained to check if the path structure to the page has changed since then. In a lock free system like Snapshot Isolation, such behavior is impossible either since it would introduce conflicting updates in two concurrent transactions. So this claim is sound.

Remark2 : Such a modification is necessary for Definition 4' since Tj's write on x might change the match, but it doesn't constitute a conflict since a conflict is between two different transactions. Nevertheless, it does make xi lose the status of being closest to xk and violates Condition B. Notice in Adya's work, it doesn't require xi to be closest to xk and hence doesn't need this extra requirement. Such a modification is not necessary for Definition 3' since no committed version is generated before the predicate read in Tj.

We also need to re-visit Conditions 2' & 3' here. Although this time in the predicate case, a version of x doesn't just refer to tuple or field version, it also refers to 'decision set' version.

The fact that MySQL InnoDB and PostgreSQL satisfy Condition 2' in the 'decision set' case is clear since the 'decision set' versions are derived from tuple versions and Condition 2' is satisfied in the case of tuple versions. The fact that NDB Cluster satisfies in the 'decision set' case Condition 2' will be delayed after we explain its commit logic. And Condition 3' is again just a reasonable requirement for the 'decision set' case.

We also need to re-visit Definition 8 since it also involves 'decision set' versions.

Here we combine Definitions 3' & 5, 4' & 6 and call them Directly Read-Depends(WR conflict), Directly Anti-Depends(RW conflict) respectively for convenience.

Definition 8 remains syntactically identical although predicate-based conflicts are semantically different for the three different types. And for type B, we also have the item-based conflicts between reads & writes of derived 'decision set' versions.

Now we are ready to define a serializable history.

Definition 10: Let H and H' be histories generated by different executions of the same set of transactions, H is equivalent to H' if

  1. H and H' give rise to the same set of database operations: corresponding predicate reads return the same set of tuples, corresponding item reads read the same version of data with same value, while corresponding writes write the same value to the same version of data, for both internal and committed versions. Notice that for type B, the item-based operations also refer to those for 'decision set' versions.
  2. DSG(H) = DSG(H')

H is serializable if H' is a serial history.

Remark: One can generalize this concept of equivalence to a subset of statements in history H and H'. Also it is easy to prove that the same serial execution of transactions with different coordinator to be equivalent to each other and the definition of serial history we defined earlier is sound.

1.3 Type A, B, C of the Serializability Theorem and the proof

Serializability Theorem(type A, B & C): Let H be a history with partial order 'happened before' satisfying Conditions 1, 2', 3', 4 and proscribing Conditions 5 & 6, where the predicate-based conflicts are based on Definitions 3' & 4'. Then H proscribes Condition 7 iff it is serializable.

*****************************************************************************************************************************************               
	  

This theorem and its proof is the core of this project and the correctness of my work largely depends on it. You are invited to review it
for me to make sure it is absolutely flawless. The proof is lengthy(about 10 pages), but startforward(nested induction of two levels).
Anyone with a decent math education should be able to comprehend it. For geeks like us, it could entail a fun weekend. Your help is
highly appreciated.              
	  

*****************************************************************************************************************************************               
	  

Proof of the Serializability Theorem for type A, B & C:

(If) If H is equivalent to serial history Hs, DSG(H) is the same as DSG(Hs) by (ii). The version order of a WW conflict coincides with the execution order of Hs from the discussion after the definition of a serial history and hence the conflict is pointing downward in DSG(Hs). The event order of a WR conflict coincides with the 'happened before' partial order by Condition 2', hence also coincides with the execution order of Hs and therefore the conflict is also pointing downward in DSG(Hs). Also, a read in a later transaction always reads the latest update before it in the execution order in Hs, so a RW conflict always points downward in DSG(Hs). This implies every conflict in DSG(H) is from Tm to Tn, whenever m

(Only if) We will prove all three versions of this theorem with the same technique, namely, nested induction. We shall proceed as long as the arguments apply to all three types of the theorem, and split the arguments into specific types when necessary.

The direct serialization graph of History H, or DSG(H), is acyclic implies that we can apply a topological sort on the graph and we may order the resulting transaction sequence vertically, so that all the conflict arrows are pointing downward. We assume the resulting downward transaction sequence to be T1, T2 ,…, Tn, for n > 0. Now let's prove H is equivalent to the serial execution Hs of T1,T2,..., Tn.

Start of outer layer of induction

We proceed with induction here. For convenience we use Ti' to denote Ti in Hs, for 0 < i < n+1. First let's prove T1 in H and T1' in Hs satisfy (i) in the definition of equivalent histories, namely, they give rise to the same set of database operations, corresponding predicate reads return the same set of tuples, corresponding item reads read the same version of data with same value, while corresponding writes write the same value to the same version of data for both internal and committed versions. From the discussion before Example 1, we know that a statement inside a transaction is either a regular programming language statement (like branching statements, loops, deterministic math calculations, non-deterministic math calculations) or a SQL statement. So actually, we'll prove a stronger result: besides the database operations, the regular programming language statement in T1 and T1' will have the same behavior since the corresponding variables will have the same value. For now, we assume any programming language statement or SQL statement in discussion does not involve any non-deterministic math calculations and will handle this exception later.

Start of first inner layer of induction

To show this, we start an inner layer of induction and consider the first statement of T1 and T1'. Since T1 and T1' are different execution of the same transaction and no branching statement has been executed yet, so we are looking at the same statement and the only thing need to be argued is whether they will produce the same result when executed. Consider the following situations for this first statement:

  1. A regular programming language statement or a SQL statement that doesn't access the database. Notice that we also include the cursor statements involving in the process of identifying which data objects are to access in this category.
  2. A SQL statement without predicate read, namely, a regular insertion(more complex insertion statement like 'insert into t1...select * form t2' will be handled later like a JOIN since it involves more than one table).
  3. 3. A SQL statement with a predicate read which is usually interpreted as a predicate read followed by a bunch of read and/or write operations (or just a predicate read like in statement 'select COUNT(*) from t where …').

In the first case, since we assume for now the transaction doesn't contain any non-deterministic math calculation, all variables involved are just deterministic functions of known constants like parameters passed into the procedure that contains the transaction. Hence they give rise to the same values in both T1 and T1'. So both statements in T1 and T1' will behave the same. For example, it will pick the same branch in both execution if it is a branching statement.

In the second case,

(A) a similar argument as in the first case can be applied so that all the field values in the tuple inserted will be the same in T1 and T1'.

(B) a similar argument as in the first case can be applied so that all the field values in the tuple inserted will be the same in T1 and T1', namely, corresponding derived field versions and 'decision set' versions in T1 and T1' will have the same values.

(C) a similar argument as in the first case can be applied so that all the field values in the insertion will be the same in T1 and T1'.

The internal version written in all three cases is the first one generated for both T1 and T1'

In the third case, let's first show that both predicate reads with predicate P return the same set of tuples in T1 and T1'. We will prove this by contradiction. From the discussion about serial history, we know that T1' doesn't see any modification by later transactions and hence tuples returned by the predicate read is based on the original state of the database. Now if the tuples returned by T1's predicate read was different from those of the other transaction's, there must be a tuple tu and some transaction Ti, 1 < i < n+1, such that Ti's write conflicted with T1's predicate read. So there was a predicate WR conflict from Ti to T1 according to Definition 3'. This would contradict with the fact that any conflict in the downward sequence T1, T2, …, Tn is pointing downward and hence both predicate reads return the same set of tuples. Notice the write from T1 in the type A case is a tuple write, in the type B case is a write on the 'decision set' and in the type C case is a field write.

From the argument above, we also prove that for v to be any committed version used by predicate read of T1, if it is not from the original state of the database, each write that generates a committed version before or equal to v is either violating Condition A(in the type A case) or violating Condition B(in all three types). Hence the versions used in the predicate reads of T1 and T1' for each corresponding tuple either are of the same version or only differ by a sequence of non-conflicting writes.

So if the process of identifying which data objects are to access involves a cursor, the discussion for the regular programming statement case will give us the conclusion that data objects are to access related the first statement in T1 and T1' will be the same.

For possible item reads that follow, the ones in T1' of course depend only on the original database state. If an item read in T1 retrieved a different version from that of the original database state, there must be a transaction Ti, 1 < i < n+1, such that there was an item WR conflict from Ti to T1, which would contradict with the fact that any conflict in the downward sequence T1, T2, …, Tn is pointing downward. So any item read in the first SQL statement of T1 reads from the original database state too and they are of same value and version. Notice in the type B case, an item read could also be the implicit reading of the initial internal 'decision set' version when it is updated for the first time, besides the usual field reads as in the type C case.

For possible writes that follow, the value written to each field will be a deterministic function of constants and field values(like in 'update … set col=col+1') which are of same values for T1 and T1', therefore will be the same.

(A) For this type, fields that are not modified in the first internal version in T1 are the same as those in T1', which again depends only on the original database state. This is clear because the tuple read before the write returns the same version with same value, if it is present, from the item read discussion. So the tuple version generated are of the same value for both T1 and T1'. And these writes generate the same internal version in T1 and T1' since they are the first internal versions.

(B) For this type, a tuple version is still generated. And the internal field version of a written field is, without any surprise, will be the first version generated in the transaction for both T1 and T1'. Notice that the tuple version generated may not necessarily be the same in T1 and T1' and an example will be given after the proof. If fields in a 'decision set' are also written, an internal 'decision set' version is also generated and it is guaranteed to be the first one. So it only depends on the initial internal 'decision set' version to conclude that the generated internal 'decision set' version to be of the same value. But from the discussion of item read case we can conclude that the initial internal 'decision set' version read in both T1 and T1' should be the same since the reading of the initial internal 'decision set' version is just an item read.

(C) For this type, there is no tuple version concept and the internal field version of a written field is the first version generated in the transaction for both T1 and T1'

After proving the first statement in T1 and T1' give rise to the same behavior whether it is a SQL statement or a regular programming language statement, we may assume by induction that the first k, k > 0, statements in T1 and T1' behave the same: if it is a regular programming language statement, it gives rise to the same behavior in T1 and T1'; if it is a SQL statement, it gives rise to the same set of database operations, corresponding predicate reads return the same set of tuples, corresponding item reads read the same version of data with same value, while corresponding writes write the same value to the same version of data.

Now let's look at the (k+1)th statement in both T1 and T1'. First if the (k+1)th statement exists in one of these two transactions, it also exists in the other transaction and is the same statement. That is because from the induction assumption we know that previous execution of both transactions must have followed the same execution path and arrived at the same place in the transaction.

Next if the (k+1)th statement is a regular programming language statement or a SQL statement that doesn't access the database , any variable it involves is a deterministic function of constants and variables shown up in earlier statements, which are of same value for both T1 and T1' by the induction assumption. So the (k+1)th statement in T1 and T1' will give rise to the same behavior in this case. A similar argument applies to the case when the (k+1)th statement is a SQL statement without a predicate read, i.e., a regular insertion and this insertion generates the same data object versions as before.

If the (k+1)th statement is a SQL statement with a predicate read, we first consider the predicate read in both T1 and T1'.

It follows directly from the induction assumption that the set of tuples altered by earlier statements in T1 equals to that of T1'and hence the set of tuples NOT altered by earlier statements in T1 also equals to that of T1'. For the set of tuples NOT altered by earlier statements, we may just apply a similar argument as in the discussion of the first statement case to conclude the predicate read will identify the same subset of matching tuples for T1 and T1'. And again the versions used in the predicate reads of T1 and T1' for each corresponding tuple either are of the same version or only differ by a sequence of non-conflicting writes.

For the set of tuples altered by earlier statements, the version chosen by Vset(P) is the latest update within the transaction by Condition 3', which are of the same internal version and the same value in T1 and T1' by the induction assumption. Hence these altered tuples will also give rise to the same subset of matching tuples for T1 and T1'.

Hence the predicate read returns the same set of tuples in T1 and T1'.

If the process of identifying which data objects are to access also involves a cursor, the discussion for the regular programming statement case will give us the conclusion that data objects are to access related to the (k+1)th statement in T1 and T1' will be the same.

For possible item reads that follow, by induction assumption corresponding items in T1 and T1' are either both modified by earlier statements or neither of them is. If the item is not modified by earlier statements, a similar argument as for the first statement case can be applied to show that corresponding item reads are the same in T1 and T1'; if the item is modified by earlier statements, induction assumption suggests that corresponding item reads return the same version of data with the same value in both T1 and T1' by Condition 3'.

For possible item writes that follow, each field written will be the same since it is a deterministic function of constants and other variables and field values(like in 'update … set col=col+1') which are of same values for both T1 and T1' by the induction assumption and the previous paragraph. If the data object written is not modified by earlier statements, a similar argument as in the first statement case can be applied to conclude that same value will be written to the same version of the data object. If the data object written is modified by earlier statements,

(A) the tuple version generated will be of the same value in both T1 and T1' because the fields written will be the same as stated earlier and the fields not written will be the same by induction assumption; that the internal version for the tuple will be the same in T1 and T1' follows directly from induction assumption.

(B) the internal version for the generated fields will be the same in both T1 and T1' by the induction assumption. If the 'decision set' was updated earlier in T1 and T1', the version generated now will be of the same value in both T1 and T1' because the fields written will be the same as stated earlier and the fields not written will be the same by induction assumption; that the internal version for the 'decision set' will be the same in T1 and T1' follows directly from induction assumption too. If on the other hand it is not updated earlier, similar arguments as in the first statement case will give rise to the same conclusion. Again although a tuple version is generated, we can't conclude that they are of the same version and same value in T1 and T1' as before.

(C) the internal version for the generated fields will be the same in both T1 and T1' by the induction assumption.

Completion of first inner layer of induction

This completes the inner layer of induction and comes to the conclusion that T1 and T1' behave the same. It also implies the committed version of data object in T1 and T1' will be of the same value. The version of the committed data object in T1' will be the one after the original database state of course. And this is true for T1 too. Since if this was not true for T1, there must be a transaction Ti, 1 < i < n+1, such that a WW conflict arose between Ti and T1 and this would again contradict with the fact that any conflict in the downward sequence T1, T2, …, Tn is pointing downward. So T1 and T1' generate the same committed versions of data object with same values. Again in the type B case, a data object could be a 'decision set'.

This finishes the proof for the initial condition for the outer layer of induction. And the induction assumption goes as follows: there exists m, 0 < m < n+1, such that subset {T1, T2 …, Tm} of history H and subset {T1', T2', …, Tm'} of history Hs satisfy property 1 of Definition 10. Namely, they give rise to the same set of database operations, corresponding predicate reads return the same set of tuples, corresponding item reads read the same version of data with same value, while corresponding writes write the same value to the same version of data, for both internal and committed tuple, field and 'decision set' versions..

Now let's prove that T(m+1) and T(m+1)' give rise to the same behavior. Again, we'll prove a stronger result: besides the database operations, the regular programming language statement in T(m+1) and T(m+1)' will behave the same since the corresponding variables will have the same value.

Start of second inner layer of induction

To show this, we'll start another inner layer of induction. Let's consider the first statement in T(m+1) and T(m+1)'. Again if the first statement is a regular programming language statement, a SQL statement that doesn't access the database or a regular insertion, a similar argument as in the T1 and T1' case can be used to conclude that both T(m+1) and T(m+1)' will behave the same since no information has been retrieved from the underlying database yet.

If the first statement is a SQL statement with a predicate read for predicate P, let's first show that both predicate reads return the same set of tuples. We prove it by contradiction assuming there was a tuple tu such that it was in the match set of one of these predicate read's, but not both.

Let's say the match decision for the predicate read of T(m+1) and T(m+1)' is based on committed version v and v' respectively for tu. Here v and v' are tuple, 'decision set' or field versions for type A, B & C respectively, so are others in the following discussion about predicate reads for predicate P. We also use notations like v >> v', v << v' or v == v' to represent the version number of v be greater than, smaller than or equal to that of v' in the rest of the proof.

If v' was from the original database state and v was not, there must be a transaction Ti such that Ti generated v'' and conflicted with the predicate read in T(m+1), with v'' <= v. If i < m+1, Ti' would generate a corresponding version in Hs with the same version number and same value by the outer layer induction assumption. Without any ambiguity we may call it v'' too in Hs and the predicate read of T(m+1)' must be based on a version greater or equal to v'' since Hs is serial, which contradicts with the fact that it is based on the original database state. If i > m+1 on the other hand, there would be a predicate WR conflict from Ti to T(m+1), which contradicts with the fact that any conflict in the downward sequence T1, T2, …, Tn is pointing downward.

Since if both v and v' are from the original database state, a contradiction is immediately arrived at. Hence if v' was from the original database state a contradiction is always arrived at and the assumption that tu must be in the match set of one of these predicate read's, but not both can't be true. In this case, the predicate read from T(m+1) and T(m+1)' will return the same set of tuples.

This argument can also be used to prove that versions used in the predicate reads of T(m+1) and T(m+1)' for each corresponding tuple either are of the same version or only differ by a sequence of non-conflicting writes.

On the other hand, if v' was not from the original database state, there must be an integer i, 0 < i < m+1, such that Ti' generated v'. By the outer layer induction assumption, Ti in H must also generate a corresponding version which was of the same version and same value as v'. So without any ambiguity we call it v' too in H.

In H, if v << v', there are two cases. In the first case, T(m+1) generated a committed version after v. Assuming T(m+1), …, Tk to be the sequence of transactions that generated versions between v and v'(exclusively), then there must be a WW conflict between any neighboring two in the sequence T(m+1), …, Tk, Ti. But this certainly contradicts with the fact that any conflict in the downward sequence T1, T2, …, Tn is pointing downward since i < m+1. In the second case, T(m+1) didn't generate a committed version after v. Then according to Definition 4', there must be v'' such that v << v'' <= v' and the write that generated v'' conflicted with the predicate read since v and v' couldn't both be in the match set of P. Say the transaction that generated v'' in H was Tj, then j must be smaller than m+1 by the outer layer induction assumption. That is because the corresponding version of v'' in Hs must be generated by a transaction with subscript smaller than m+1. Therefore there was a predicate RW conflict from T(m+1) to Tj, which contradicts with the fact that any conflict in the downward sequence T1, T2, …, Tn is pointing downward.

In H, if v >> v', there must be v'' such that v >= v'' >> v' and the write that generated v'' changed the match of the predicate since v and v' couldn't both be in the match set of P. Choose v'' to be one that was closest to v. Say the transaction that generated v'' was Tk. Integer k couldn't be smaller than m+1 since if it was, by the outer layer induction assumption, a corresponding version which was of the same version and same value as v'' must also be generated in Hs by a transaction Tk'. So without any ambiguity we call it v'' too in Hs. But the predicate read in T(m+1)' must be based on a version greater than or equal to v'' since k

If v == v', the outer layer induction assumption implies they are of the same value. This contradicts with the assumption that tu must be in the match set of one of these predicate read's, but not both.

Hence if v' is not from the original database state a contradiction is always arrived at and the assumption that tu must be in the match set of one of these predicate read's, but not both can't be true. In this case, the predicate read from T(m+1) and T(m+1)' will also return the same set of tuples.

In the case v' is not from the original database state, v can either be of the same version of v', or be a version before it, or be a version after it in general. If v >> v', a similar argument as in the case where v' is from the original database state can show that v and v' only differ by a sequence of non-conflicting writes. If v << v', there must be a version in H with the same version number and same value as v' by the outer layer induction assumption. Without any ambiguity we may call it v' too in H. We claim that any version v'' in H, v << v'' <= v', is from a write that doesn't conflict with the predicate read in T(m+1). Suppose that was not true and there was transaction Tj such that Tj generated version v'' and the write in Tj changed the match of predicate P and v'' was the closest such version after v. Since v'' <= v' < m+1, this would lead to a predicate RW conflict from T(m+1) to Tj, which contradicts with the fact that any conflict in the downward sequence T1, T2, …, Tn is pointing downward. Hence in this case the versions used in the predicate reads of T(m+1) and T(m+1)' for each corresponding tuple either are of the same version or only differ by a sequence of non-conflicting writes.

Therefore both predicate reads in T(m+1) and T(m+1)' always return the same set of tuples. And the versions used in the predicate reads of T(m+1) and T(m+1)' for each corresponding tuple either are of the same version or only differ by a sequence of non-conflicting writes.

If the process of identifying which data objects are to access also involves a cursor, the discussion for the regular programming statement case will give us the conclusion that data objects are to access related the first statement in T(m+1) and T(m+1)' will be the same.

For the possible item reads that follow, let's say the versions read by T(m+1) and T(m+1)' are v and v' respectively. A version with the same committed version number and same value as v' also exists in H since if v' is from the original database state it is apparently so and if v' is generated by a transaction Ti', i < m+1, Ti generates it too by the outer layer induction assumption. Without causing any ambiguity we call it v' too in H.

If v' << v in H, v must be generated by Tj, with j > m+1. Since if it was not and j < m+1, Tj' must generate a version with the same version number and same value as v in Hs. Again without causing any ambiguity we call it v too in Hs and T(m+1)' must have read a version greater than or equal to v >> v' in Hs and a contradiction arose. Hence j > m+1 as claimed. But then there was an item WR conflict from Tj to T(m+1) which would contradict with the fact that any conflict in the downward sequence T1, T2, …, Tn is pointing downward.

On the other hand, if v' >> v in H, assume the sequence of committed versions between v and v' to be v = v0, v1, …, vk = v', for some k >= 1. Consider version v1, if it was generated by Tj with j < m+1, there must be an item RW conflict from T(m+1) to Tj which would contradict with the fact that any conflict in the downward sequence T1, T2, …, Tn is pointing downward. So j must be greater than m+1. Consider v2 if it exists, it must be generated by Ts with s > m+1 since if it was not, there must be an item WW conflict from Tj to Ts which again would contradict with the fact that any conflict in the downward sequence T1, T2, …, Tn is pointing downward. Repeat this argument and eventually we would come to the conclusion that vk = v' was generated by Ti with i > m+1 and again a contradiction was arrived at.

Hence v == v' and the corresponding item reads in T(m+1) and T(m+1)' read the same item version and therefore with the same value by the outer layer induction assumption.

For possible item writes that follow, the value written to corresponding field in T(m+1) and T(m+1)' will be a deterministic function of constants and field values(like in 'update … set col=col+1') which are of same values, therefore will be the same.

A similar argument as in the T1 and T1' case can be applied to conclude that the item writes generate data items that are of the same internal version and value in T(m+1) and T(m+1)'.

This completes the proof for the initial condition for the second inner layer of induction and we may assume by induction that the first k, k > 0, statements in T(m+1) and T(m+1)' behave the same: if a statement is a regular programming language statement, it gives rise to the same behavior in T(m+1) and T(m+1)'; if a statement is SQL statement, it gives rise to the same set of database operations, corresponding predicate reads return the same set of tuples, corresponding item reads read the same version of data with same value, while corresponding writes write the same value to the same version of data.

Now let's look at the (k+1)th statement in both T(m+1) and T(m+1)'. First if the (k+1)th statement exists in one of these two transactions, it also exists in the other transaction and is the same statement. That is because from the induction assumption we know that previous execution of both transactions must have followed the same execution path and arrived at the same place of the transaction.

Next if the (k+1)th statement is a regular programming language statement or a SQL statement that doesn't access the database , any variable it involves is a deterministic function of constants and variables shown up in earlier statements, which are of same value for both T(m+1) and T(m+1)' by the induction assumption. So the (k+1)th statement in T(m+1) and T(m+1)' will give rise to the same behavior. A similar argument applies to the case when the (k+1)th statement is a SQL statement without a predicate read, i.e., a regular insertion and this insertion generates the same data object versions as before.

If the (k+1)th statement is a SQL statement with a predicate read for predicate P, we first consider the predicate read in both T(m+1) and T(m+1)'.

It follows directly from induction assumption of the second inner layer induction that the set of tuples altered by earlier statements in T(m+1) equals to that of T(m+1)', hence the set of tuples NOT altered by earlier statements in T(m+1) also equals to that of T(m+1)'. For the set of tuples NOT altered by earlier statements, we may just apply a similar argument as in the discussion of the first statement case to conclude the predicate will identify the same subset of matching tuples for T(m+1) and T(m+1)'.

For the set of tuples altered by earlier statements, the tuple, 'decision set' or field versions used by Vset(P) are the latest updates within the transaction by Condition 3', which are of the same internal version and same value in T(m+1) and T(m+1)' by induction assumption of the second inner layer induction, hence these altered tuples will also give rise to the same subset of matching tuples for T(m+1) and T(m+1)'.

Hence the predicate read returns the same set of tuples in T(m+1) and T(m+1)'.

And a similar argument as in the first statement case will lead to the conclusion that the tuple, 'decision set' or field versions used in the predicate reads of T(m+1)and T(m+1)' for each corresponding tuple either are of the same version or only differ by a sequence of non-conflicting writes.

So if the process of identifying which data objects are to access involves a cursor, the discussion for the regular programming statement case will give us the conclusion that data objects are to access related the (k+1)th statement in T(m+1) and T(m+1)' will be the same.

A similar argument as in the T1 and T1' case can be applied to conclude that possible item reads and writes that follow will be the same in T(m+1) and T(m+1)'.

Completion of second inner layer of induction

This completes the second inner layer of induction and comes to the conclusion that T(m+1) and T(m+1)' behave the same. It also implies the committed version of data object in T(m+1) and T(m+1)' will be of the same value. Let's say the version of the committed data object generated in T(m+1)' is (p+1)th, p >= 0 after that of the original database state. So the first m transactions in Hs generate the first p versions. By the induction assumption of the outer layer induction, the first m transactions in H generate the first p versions too, no more and no less. This implies T(m+1) can only generate the (p+q)th version of the data object, with q >= 1. If q was greater than 1, some transaction Ti with i > m+1 must have generated the (p+q-1)th version since p+q-1 > p. This would lead to a WW conflict between Ti and T(m+1) and would again contradict with the fact that any conflict in the downward sequence T1, T2, …, Tn is pointing downward. So T(m+1) and T(m+1)' generate the same, namely the (p+1)th committed versions of data object with same values. Again in the type B case, a data object could be a 'decision set'.

Completion of outer layer of induction

This completes the outer layer of induction and finishes the proof of property 1 in the Definition 10.

For property 2 in the Definition 10, it is easy to see that the item-based conflicts are identical for H and Hs from property 1.

For the predicate-based conflicts, from property 1 we know that both histories generate the same tuple, 'decision set' and field version sequence for the relevant data object that is used to decide the match of a predicate read in three types of the Serializability Theorem respectively. In the proof, we've shown that the versions used in the corresponding predicate read in H and Hs only differ by at most a few writes that could NOT change the match. This implies the relative order between a write that could change the match and the predicate read remains the same in both histories. Hence it leads to identical predicate-based conflicts for H and Hs.

So DSG(H) = DSG(Hs) and we've finished the proof with the following three exceptional cases:

  1. Transactions in H contain non-deterministic functions like random function RAND().
  2. Transactions in H contain timestamp functions like NOW(), which behaves like non-deterministic functions since every time it is invoked, it returns a different value.
  3. The statements in H involve more than one table, complex statement like joins, sub-queries and unions are excluded.

We will handle exceptional case 1 right away and leave case 2 & 3 until later sections.

The problem with non-deterministic function like RAND() is that when it is invoked in the corresponding transaction in the serial history Hs, it doesn't necessarily return the same value as in the original history H. This sometimes will lead to different behavior in the two histories and we can't decide whether the original history is serializable. However the randomness of the function implies that there must be one execution of serial history so that the value for RAND() in the original history is re-generated there. And this readily implies serializability.

Someone may argue that RAND() is not really random and it is only pseudo-random. In that case, it only means that specific implementation of RAND() doesn't necessarily generate the values in the original history when the serial history executes, but the real random function it simulates still does. That is good enough for us since the limitation of RAND()'s implementation doesn't change the correctness of the argument in the last paragraph.

Alternatively, we may look at it like this: the original history is equivalent to a serial history in which corresponding calls to RAND() returned the same values. In other words, we view the outputs of RAND() as constants passed into the transactions.


	                                                                                                       ## 
	  

1.4 An issue with Condition 3'

Condition 3' basically says inside a transaction, a read will see writes to the same item written by itself. This may not be true for some systems, particularly those implementing Snapshot Isolation, which will read only from a snapshot identified by a transaction's starting timestamp. In that case, we must prove a Serilizability Theorem that a read will NOT see writes to the same item written by itself. But that is a trivial task since in the proof of such a theorem, we only handle the case this kind of writes never happen and remove all the arguments dealing with the case when they do.

1.5 Neighborhood of match change

In the process of proving the predicate reads in H and Hs give rise to same set of tuples in all three types of Serializability Theorem, we know that the predicate reads in H and Hs may not use the same version to make the match decision, but the difference between them can't contain one that changes the match of the predicate. So if we define the following

Definition 11: For versions in a history that decides the match of a predicate P(tuple versions in the type A case, 'decision set' versions in type B & C cases), the subset of versions which represent writes that change the match of P are defined to be versions of match change for P. The set of all the versions before the first version of match change, or a version of match change with all the versions before the next match change version is defined to be a neighborhood of match change.

Then in H and Hs, corresponding predicate reads use versions from the same neighborhood of match change to make decisions for every tuple.

1.6 Comparing three types of the Serializability Theorem

Example 9: Let t be the table defined in Example 0 and (1,2,4,0) be the only tuple in t. Consider transactions T1, T2, T3 with one SQL statement as follows(the 'start transaction' and 'commit' boiler-plate are skipped):


        T1: select value from t where id1 = 2;

        T2: update t set id2=3 where id1 = 1;

        T3: update t set id3=5 where id1 = 1;
	  

Execute this trio serially in two orders: T1, T2, T3 and T2, T3, T1 and the histories generated are referred to as H and H' respectively.

In the case conflicts are interpreted at tuple level granularity, there is only a WW conflict from T2 to T3 and H is equivalent to H' and is serializable(what a surprise!) . The tuple version in H used by T1 to make the match decision for the predicate read is from the original database state, while that for H' is two versions after. These two predicate reads both return the same set of tuples(empty set) and the versions they used only differ by a sequence of writes that doesn't change the match, namely in the same neighborhood of match change. This is exactly what we've proved for type A's predicate-based conflicts.

In the case conflicts are interpreted at field level granularity, there is no conflict in the DSGs and again H is equivalent to H' and is serializable. The field version in H and H' used by T1 to make the match decision for the predicate read are both from the original database state. This is exactly what we've proved for type B/C's predicate-based conflicts: we don't need to pay attention to the irrelevant writes that don't intersect with the 'decision set' as we had to in the tuple level granularity case. Notice in the type B case, if we swap T2 and T3 in H', H is still equivalent to this new H'. But the tuple versions generated in this new H' are not the same as those of the old H's any more. This is an example we promised when we define type B of the Serializability Theorem.

So we have successfully generalized the newly defined predicate-based conflicts from tuple level granularity to 'decision set'/field level granularity and the latter does make things simpler and clearer. Of course, other benefits for field level granularity we mentioned before, like reducing conflicts in a DSG, is also demonstrated in this example. This is a major difference between type A and type B/C.

For type B and type C, at first glance, they both get rid of the WW conflict in the DSG. But in 'pessimistic technology' like MySQL InnoDB or NDB Cluster, row locks are still assumed before the row is written. So although consistency based conflicts are ruled out, lock based conflicts are still there. A similar story applies for 'optimistic technology' like PostgreSQL because in Snapshot Isolation write conflicts are still tuple based. So even if we did apply type B/C of the theorem to these cases, we wouldn't have any advantage over type A, not until someday we can incorporate field based serialization mechanism like field level locking into our system.


	                                                                                                       ## 
	  

Example 0 (Continuation...): If we apply type B of the Serializability Theorem, a field level generalization of Adya's Serializability Theorem, to Example 0, a WW conflict from T2 to T1 on the 'decision set' consisting of fields {id2, id3} is identified. Since there is a conflict loop in this history, its inconsistent behavior is not a surprise now. On the other hand, if we apply type A of the Serializability Theorem, a modified version of Adya's Serializability Theorem, to Example 0, the conflict from T2 to T1 is a tuple WW one as before.

Earlier we claimed that it's mostly Fekete's generalization(at least the way I understand it) that causes Example 0 to surface. Now we see why: it failed to recognize the concept of a 'decision set' and its associated conflicts. In Adya's Serializability Theorem, this is unnecessary since such a conflict will always be masked with a tuple WW one.

Earlier we also mentioned that Example 0 doesn't show up in PostgreSQL's serializability implementation since it views update of id2 in T2 and update of id3 in T1 as a WW conflict. This means the resolution of this specific conflict is at tuple level. So although Fekets's paper claims they have generalized Adya's work to field level and PostgreSQL depends on Fekete's paper, PostgreSQL at best implements a hybrid system where conflicts are resolved at both field and tuple levels. Can a hybrid version of the Serializability Theorem be proved? We'll give an affirmative demonstration later. But the question that triggers Example 0 remains unanswered: what exactly is the version of Serializability Theorem PostgreSQL thinks their serializability implementation relies on? It is up to PostgreSQL to clarify this of course.


	                                                                                                       ## 
	  

1.7 Generalized Serializability Theorems

Historically, there are different flavors of serializability other than the one in the Serializability Theorem we've just proved, namely, conflict serializability. One of them assumes that there are no conflict loops among the Read-Write transactions in the history. In this case, even if there is a loop in the history, there still exists a serial history of the Read-Write transactions that will leave the database as in the same state as the original history does.

Theorem(Generalized Serializability Theorem I): Assuming the same conditions as in the Serializability Theorem except that the conflict loop in Condition 7 is replaced with one among the Read-Write transactions in H, named C as in consistency. Then C is free of such a conflict loop iff it is equivalent to a serial execution of its own.

Proof: In the proof of the Serializability Theorem, predicate reads and item reads dictate the execution of a transaction, in particular, the writes to the database. This certainly applies to the Read-Write transactions here. What is more, the fact that a Read-Only transaction doesn't write the database means that it can't affect the behavior of the Read-Write transactions. In other words, the Read-Write transactions will behave the same whether the Read-Only transactions exist or not. Hence the conflicts between the Read-Write transactions completely determine their effect on the database and a proof similar to that of the Serializability Theorem applies here.


	                                                                                                       ## 
	  

Remark: There is actually an issue about the statement 'the fact that a Read-Only transaction doesn't write the database means that it can't affect the behavior of the Read-Write transactions' in this proof. A call to the RAND() function in a Read-Only transaction can affect a following Read-Write transaction in a very subtle way: the value u returned by RAND() in a Read-Only transaction is from a sequence that simulates the random behavior of RAND(); so if a Read-Write transaction later calls RAND(), the returned value v has to be one after u in that sequence. Think about the situation in one history a Read-Only transaction calls RAND() while in the other history the corresponding Read-Only transaction doesn't since they don't have to behave the same now; a subsequent Read-Write transaction in both histories will call RAND() and they return different values. However, this can be handled with a similar argument as for the RAND() case in the Serializability Theorem: C is equivalent to a serial execution of itself in which corresponding calls to RAND() returned the same values; or alternatively, C can be viewed as an execution with returned values from RAND() being constants. So it doesn't pose as an actual problem.

Example 10: This example is from pg. 12 of chapter 1 in [Be 87]. It's been rephrased in the context of Example 0 for our purpose. Suppose the table in Example 0 consists of two rows: (1, 2, 4, 200) & (2, 4, 8, 200). These two rows represent two bank accounts of a customer where the 'value' field is the balance of the account and the 'id1' field is the account number. A scheduled transfer of $100 by the bank is performed by transaction T1 from account #1 to account #2. At about the same time, the customer who owns these two accounts wants to print out the total balance of them in transaction T2. This could lead to the following interleaved execution under Read-Committed isolation level, say, the one provided by NDB Cluster:


                    T1                                                                             T2

        start transaction;

        select value from t where id1 = 1;

                                                                                   start transaction;

                                                                                   select value from t where id1 = 1;

        update t set value=value-100 where id1 = 1;

        select value from t where id1 = 2;

        update t set value=value+100 where id1 = 2;

        commit;

                                                                                   select value from t where id1 = 2;

                                                                                   print the sum of the values from the last two                             
                                                                                   statements to the screen for the customer

                                                                                   commit;                                                                                  
	  

The printed sum for these two accounts will be $500($200+$300). The consistency that the sum of the accounts will not be changed with a transfer is violated. There is actually an item RW conflict from T2 to T1 and an item WR conflict from T1 to T2 so that a conflict loop is formed. Hence the inconsistency shown here is not a surprise. However, if we apply Generalized Serializability Theorem I to this example(since the set of Read-Write transactions only consists of one element and certainly can't contain any conflict loop), we may conclude that such an inconsistency is not observable from the database state after both transactions commit. The sum is still $400($100+$300) and the inconsistency only shows up in the Read-Only transaction.


	                                                                                                       ## 
	  

What is demonstrated in this example is summarized in the following

Corollary: If C in Generalized Serializability Theorem I doesn't contain a conflict loop, H will leave the database in a consistent state if it started out consistent.

Proof: Since the Read-Only transactions in H don't update the database, H leaves the database in a state as C does and hence the conclusion.


	                                                                                                       ## 
	  

Generalized Serializability Theorem I suggests that if we could device a way to make sure the Read-Write transactions do not contain a conflict loop, we may only see inconsistencies through the Read-Only transactions. What if we only want to see inconsistency from some, but not all of the Read-Only transactions? The following theorem provides an answer.

Theorem(Generalized Serializability Theorem II): Assuming the same conditions as in Generalized Serializability Theorem I, only this time we consider a conflict loop in the set of Read-Write transactions and some of the Read-Only ones in H, named C as in consistency. Then C is free of such a conflict loop iff it is equivalent to a serial execution of its own.

Proof: Similar to proof of Generalized Serializability Theorem I.


	                                                                                                       ## 
	  

This is not a surprise either since this time we exclude a possibly smaller set of transactions that can't affect database writes than in Generalized Serializability Theorem I. And the following corollary is true for Generalized Serializability Theorem II:

Corollary: If C in Generalized Serializability Theorem II doesn't contain a conflict loop, H will leave the database in consistent state if it started out consistent. What's more, inconsistency will not show up in the Read-Only transactions of C.

Proof: It follows from Generalized Serializability Theorem II.


	                                                                                                       ## 
	  

We'll see how these theorems are applied to the TPCC benchmark example in next section.

Now let's apply the Serializability Theorem to the three SQL implementations—PostgreSQL, MySQL InnoDB and NDB Cluster.

Corollary 1(for the Serializability Theorem): Along the way, we've proved that PostgreSQL(in particular its Snapshot Isolation) satisfies all the conditions for the Serializability Theorem, hence we may apply it to PostgreSQL's Snapshot Isolation and achieve serializability.

In particular, this could provide a fix to PostgreSQL's serializability implementation if Fekete's generalization did cause a problem. On the other hand, Cahill's paper([Ca 08]) doesn't provide any theoretical change to [Fe 05], so as long as we fix Fekete's possible problem, PostgreSQL's serializability implementation will be sound again if it is not already so.

Corollary 2(for the Serializability Theorem): Along the way, we've proved that MySQL InnoDB(in particular its Read-Committed isolation level and Repeatable-Read isolation level) satisfies all the conditions for the Serializability Theorem, hence we may apply it to MySQL InnoDB and achieve serializability.

MySQL InnoDB already comes with a serializability implementation, which is based on Two Phase Locking, or 2PL. One can actually use the Serializability Theorem we've proved to show that 2PL is really serializable(a short proof is given in Appendix B). But even then we may still do better for some application with our Serializability Theorem. That is because MySQL InnoDB's serializability implementation is of tuple level granularity and field level granularity implementation based on type B of the Serializability Theorem will have advantages over it for some workloads. However since MySQL InnoDB is not the focus of this article, I will leave it for the interested audience and they should be able to get some hint about how to apply it from next section when we apply the Serializability Theorem to NDB Cluster.

Corollary 3(for the Serializability Theorem): Except for a couple of conditions that require the knowledge of commit logic in NDB Cluster, we have proved that we may apply the Serializability Theorem to its Read-Committed isolation level and achieved serializability.

We will fill in the blanks for Corollary 3 in next section.

Corollary 4(for the Generalized Serializability Theorems): Corollaries 1, 2 & 3 apply for the Generalized Serializability Theorems too.

So we've proved three versions of the Serializability Theorem for three SQL implementations, PostgreSQL, MySQL InnoDB and NDB Cluster, as we promised , thanks to the powerful framework set up by Adya's paper. If this was another research paper that is trying to prove a version of the Serializability Theorem, we already did it, for three times this time. But what we are looking for is a practical solution to the serializability problem that is useful for many, if not most, of the sophisticated database application developers. To this end, we are just at square one and a lot of work and testing are needed for that purpose. So in next section, we will device a way to apply these theorems to NDB Cluster so that histories executed within it will be serializable and start the journey. Join me in the quest when you feel ready!

2. Application to NDB Cluster

In the process of applying the Serializability Theorem to NDB Cluster, I've consulted the PHD thesis of Mikael Ronstrom that NDB Cluster is developed upon, the manual(including the MySQL NDB Cluster API Developer's Guide) and have read part of the source code( the part about the commit logic) and consulted about some of the implementation details from individuals in Oracle hoping I've got everything right. But it is still possible I've missed something in this process since I am not a developer of NDB Cluster. So if you'd have noticed any mistake, please let me know and I truly hope all the efforts will eventually converge to a solution for the sophisticated database application developers.

Specifically I'd like to thank Frazer Clement from Oracle for his patient and detailed answers when I asked him about NDB Cluster's commit logic. What follows is partly based on those answers.

2.1 Commit logic of NDB Cluster

The commit mechanism in NDB Cluster is a variant of two phase commit, which is a commit protocol commonly used in distributed transaction processing system. In the two phase commit protocol, data reads and writes in a transaction happen in different participating nodes across the system and one node is chosen as the transaction coordinator(TC) to orchestrate this process. The transaction coordinator starts the two phase commit protocol after all SQL statements are executed and the transaction is ready to commit. The protocol consists of two phases as the name suggests. In the first phase(voting phase), a message like 'are you ready to commit?' is sent out to each and every participant by the coordinator. When all replies are received by the coordinator, a 'commit' consensus is reached if all answers are affirmative, otherwise an 'abort' decision is made. This is the end of the first phase and the second phase starts right away. In the second phase, the coordinator sends out messages like 'commit now' or 'abort now' to all participants based on the decision made in the first phase. Upon receiving the message, participants perform operations necessary for the commit(the so called local commit) or abort request. After that, a message is sent back to the coordinator reporting this. Upon receiving all the reporting messages from the participants, the coordinator reports back to the system that the transaction is committed or aborted and ends the execution of the two phase commit protocol.

In NDB Cluster, the transaction coordinator role is assumed by a transaction coordinator block(also called TC) in the NDB kernel. A transaction consists of three phases: prepare phase, commit phase and complete phase. The voting phase and part of the second phase of the two phase commit, specifically the part of operations relevant to an abort decision, happen in the the prepare phase. And the part of the second phase in the two phase commit that is relevant to a commit decision is split into the commit phase and complete phase in NDB Cluster. In other words, at the very end of the prepare phase, TC verifies if the transaction is ready to commit. If the answer is 'no', it will abort the transaction and the later two phases will never happen. If the answer is 'yes', the commit phase starts and is followed by the complete phase.

In the commit phase, TC sends out a commit request to each participant. Upon receiving the request, a participant starts its local commit process. If a tuple is updated in this participant, the write lock on the primary replica will be released. The new tuple version will be available both for non-locking read as in the Read-Committed isolation level and locking read as in 'select … lock in shared mode' or 'select … for update' right after the release of the lock at the primary replica. However, the write lock on the secondary replica is not released until the complete phase. After all the operations necessary for the local commit are performed, a confirmation message is sent back to TC to signify the ending of the local commit. After all the confirmation messages are received by TC, it completes the commit phase by sending a commit acknowledgment(reporting its state as committed) to the API node that initiates this transaction.

Then the complete phase starts with TC sending out a complete request to each participant. Upon receiving the request, a participant starts its local complete process. Among other operations, the write lock on the secondary replica will be released. After all the operations necessary for the local complete are performed, a confirmation message is sent back to TC to signify the ending of the local complete. After all the confirmation messages are received by TC, the complete phase ends.

2.2 Completing the proof of the Serializability Theorem for NDB Cluster

Now it is time to fill in the blanks we left out in the last section about the 'happened before' partial order and finish the proof of the Serializability Theorem.

First, the following is assumed in the discussion of a serial history in the last section: in the discussion of a serial history, when a transaction reports back to the API node its state as committed, all of its modifications are available for reading. This is obvious from the commit logic of NDB Cluster.

Next, let's show that the partial order 'happened before' satisfies Conditions 1 & 2'.

In the Condition 1 case, for events happened in TC that are ordered in the transaction, this order is preserved in the 'happened before' partial order by condition 1 of Example 6. For instance, the 'start of transaction' event 'happened before' the 'start of commit phase' event, which 'happened before' the 'end of commit phase' event, which alternatively 'happened before' the 'start of complete phase' event and so on. On the other hand, let e and e' be events in different SQL statements S and S' respectively so that S is executed before S' in the sequential execution of the transaction. For example, e could be a write of a certain field and e' could be a read of another field. Events e and e' could happen in different nodes in general. But e 'happened before' the ending event of S since a message is sent back to TC notifying the write of the field, and the ending event of S 'happened before' the starting event of S' because both events happen in TC, which in turn 'happened before' event e' since a message is sent from TC to another node to request the read. Also in a SQL statement like “update t set col=col+1” where the read of col 'happened before' the write of it can be argued similarly. This means NDB Cluster satisfies Condition 1.

Remark: In a distributed system case like NDB Cluster, the item reads/writes after a predicate read might overlap with each other in Condition 1 since they are usually executed by different threads on different nodes.

In the Condition 2' case, the fact that creation of a data object version 'happened before' the reading of it is transparent from the commit logic of NDB Cluster even if a re-partition exists like in the case a new node is added to the cluster.

We've already explained Conditions 3' & 4 are both satisfied, while Conditions 5 & 6 are proscribed by NDB Cluster.

Remark: Condition 1 is actually not used in the proof of the Serializability Theorem, but it is heavily depended on for the rest of this section. I keep it as a condition for the Serializability Theorem so that it is easy to compare it with Adya's work. It might be moved to a place after the proof, like here.

So as long as we can device a way to enable histories executed in NDB Cluster to be free of conflict loops, we may apply the Serializability Theorem to NDB Cluster and achieve serializability. At first glance, it might seem we may apply both type A and type B of the Serializability Theorem to NDB Cluster. But the following example suggests it might not be appropriate to apply type A to it.

Example 11: Consider the same situation as in Example 0, with id1 being the primary key and id2 being a secondary key. Execute the following sequence of statements in the specified order:


                     T1                                                                T2

        start transaction;

        update t set value=0 where id1 = 1;

                                                                                start transaction;

                                                                                select id1 from t where id2 = 3 lock in share mode;

	  

There will be no lock wait whatsoever between these two transactions. That is because the statement in T2 can be fulfilled within the secondary index on id2, so we only need to lock the corresponding tuple in that index, not the one in the primary index, while T1 only locks the tuple in the primary index. However the two statements are considered to be conflicting by type A of the Serializability Theorem since they operate on the same tuple and one of the operation is a write. This implies if we applied type A of the Serializability Theorem here, we need to use a new mechanism other than regular row locks provided by NDB Cluster to prevent this conflict. This is certainly undesirable. Nevertheless, in type B of the Theorem, this pair of operations are not considered to be conflicting since T1 is trying to update the value field while T2 doesn't access it. So for the rest of the discussion in this section, we will only apply type B of the Serializability Theorem unless otherwise specified.


	                                                                                                       ## 
	  

Remark: This example also applies to MySQL InnoDB since it uses a similar locking mechanism when it comes to access data with indices. It doesn't apply to Snapshot Isolation of PostgreSQL though since it doesn't use locks to control access.

2.3 Application of the Serializability Theorem to NDB Cluster

Before developing a systematic way to achieve serializability in NDB Cluster, we'd like to use an example to show that the Serializability Theorem is already useful.

Example 12: In the now popular hadoop cluster's debut, it offered a simple access pattern called 'Write once, Read many'. In this simple access pattern, a tuple is created, accessed by its primary key for reading for many times during its lifetime but is never updated, and eventually it might be deleted. In each tuple there are only two fields: the primary key and the data field. In each transaction, only one tuple is accessed. This load pattern simulates that of a short blog application like Twitter and is very common in a NoSQL datastore.

To make things more interesting, in this example we will also allow updates to the tuple's data field after it's created and each transaction can only update one tuple. In other words, a tuple is accessed via by its primary key and the data field is updated in an update transaction. And a primary key is never updated. We are going to prove this load pattern is serializable under the Read-Committed isolation level of NDB Cluster. And if you recall this is exactly what the Flexasynch benchmark simulates, we know that a hadoop like application is both consistent and fast when it is deployed on NDB Cluster.

Remark: Starting from 2015, ACID is being built into some hadoop applications like Hbase/Hive, Actian Vortex etc.. This example only represents the old hadoop before 2015. ACID compliance doesn't necessarily imply the highest level of consistency though. So this example is still meaningful for today's hadoop beyond its significance for the purpose of this article.

For a tuple with a specific primary key, after it is deleted, a new incarnation of the tuple with the same primary key can be inserted into the table if the application allows primary key reuse. Since the primary key semantics dictates at any time(time of the reference frame) in the NDB Cluster system only one tuple with a specific primary key exists, different incarnations of tuple with the same primary key form a total order. We call this the Incarnation Order.

While we are at the topic, I want to emphasize a requirement of Adya's framework: different incarnations of a tuple with the same primary key are considered to be different tuples. This is a reasonable requirement since two different incarnations of a tuple are just two different tuples which happen to use the same primary key. However this requirement imposes extra challenges when we try to implement it in NDB Cluster because the concept of incarnation of a tuple with the same primary key doesn't exist in NDB Cluster. In particular, consider the following example in NDB Cluster, assuming a tuple with primary key k exists already:


                                 T1                                T2                                          T3 

        start transaction;

        select a tuple with primary key k;

                                                            start transaction;

                                                            delete the tuple in T1;

                                                            commit;

                                                                                                        start transaction;

                                                                                                        insert a tuple with the key k;

                                                                                                        commit;

        select a tuple with primary key k;

        commit;
	  

In this example, the second selection in T1 returns a tuple with the same primary key as in the first selection. If T1 is designed carelessly, it might consider these two to be the same tuple and lead to a mistake. Of course, one may design the application NOT to reuse primary keys in some cases. But if that is not possible, incarnation related issues must be dealt with when we talk about conflicts. For example, writing of version 3 certainly conflicts with writing of version 4 in the same incarnation, but not with writing of version 4 in the next incarnation. In 'pessimistic technology' like NDB Cluster this is not a problem since as long as we fence off all the real conflicts, the history will be serializable due to the preventative approach we've taken and don't have to deal with the fake ones(unnecessary lock conflicts could result, though). In an optimistic approach, however, a fake conflict like this could in general lead to abortion of a transaction if the system doesn't recognize it. One way to recognize them is to attach a timestamp to each incarnation when it is created. For the previous example, one way to avoid it is NOT to read the same object twice in a transaction. Also notice that this history is not serializable since there is a loop in it(There is a not so apparent predicate RW conflict between T2 and T3).

So as you might have guessed, we are going to prove that the load we are interested in only gives rise to serializable histories even if primary keys are reusable.

Proof: First from the simple access pattern in this load, we observe that only statements with the same primary key would conflict with each other in this load. So if a conflict loop existed in this load, all the reads, writes and predicate reads involved were relevant to a specific primary key since there is only one single SQL statement in each and every transaction. What's more, when the Incarnation Order is paired with the total order on item versions of each incarnation, we have a total order on the item versions of all incarnations of tuple with the same primary key. This total order certainly is consistent with the 'happened before' partial order. Since different incarnations with the same primary key are considered to be different tuples, operations on versions from different incarnations don't conflict with each other. Hence versions involved in a conflict must be of the same incarnation. Assuming a conflict loop existed in this load, we will prove the following

Property: For two consecutive writes w1 and w2 along the conflict loop, w1 'happened before' w2.

We also assume that w1 is in T1 and w2 is in T2. Here w1 and w2 could be the same operation. We break it down into two cases.

Case 1: W1 and w2 happen in two neighboring transactions in the conflict loop. In this case, there are three possibilities for the conflict between T1 and T2: a WW conflict, or when a WW conflict is absent, a predicate RW conflict or a predicate WR conflict must be present. For a WW conflict, the fact that w1 'happened before' w2 is apparent. When a predicate RW conflict is present and a WW one is not, w2 is an insertion and w1 is a write from a previous incarnation, or w2 is a deletion and w1 is an update from the same incarnation. Either way we have w1 'happened before' w2. When a predicate WR conflict is present and a WW one is not, if w1 is a deletion, then w2 is from an incarnation after w1's; on the other hand, if w1 is an insertion, w2 must be an update or the deletion from the same incarnation. Either way, we have w1 'happened before' w2 too.

Case 2: There is at least one transaction T between T1 and T2 in the asserted conflict loop. First there must be exactly one transaction T between T1 and T2 in this case. So T may just contain a predicate read or an item read may also follow it. We name the conflict between T1 and T to be c1, and the conflict between T and T2 to be c2 respectively. Then there are three types for c1: 1. It is between a deletion and a predicate read; 2. It is between an insertion and a predicate read; 3. It is between an update(could be an insertion) and an item read. There are also three types for c2: 4. It is between a predicate read and an insertion; 5. It is between a predicate read and a deletion; 6. It is between an item read and an update(could be a deletion). Notice that sometimes two types of conflict may co-exist for either c1 or c2 case, but it doesn't change the correctness of the argument that follows. For all nine combinations of c1 and c2, one can verify that w1 'happened before' w2 is always true: if c1 is of type 1, since the deletion in T1 renders a dead version for the incarnation of c1, either type 4, 5 or 6 for c2 will have to be in a later incarnation; if c1 is of type 2, since the insertion turns an initial version into a visible one for the incarnation of c1, either type 4, 5 or 6 for c2 will have to be in a later incarnation; if c1 is of type 3, since the predicate read in T reads a visible version, either type 4, 5 or 6 for c2 will have to be in a later incarnation.

So the property is proved. But it implies if we walked along the conflict loop, we got w1 'happened before' w1, whatever w1 we started with. This contradicts with the fact that 'happened before' is a strict partial order. Hence execution of the application doesn't contain a loop and it is serializable since we've shown that all other conditions in type B of the Serializability Theorem are satisfied by NDB Cluster.


	                                                                                                       ## 
	  

For more complex applications in NDB Cluster, conflict loops are possible and we will device a systematic way to eliminate them. The following theorem suggests a possible direction.

Theorem 1: In NDB Cluster, if the DSG of a history H, DSG(H), contains only WR and WW conflicts, there is no conflict loops in H and it is serializable.

Before we prove Theorem 1, let's recap an important element about the NDB API in NDB Cluster. The NDB API accepts four different kinds of operations: primary key access, unique key access, full table scan and range scan using an ordered index. So in the case a SQL API node is used, all SQL statements are translated into the four kinds of operations we've just described. The following proof will be based on these operations so that the Theorem applies whether you use the SQL API or the NDB API directly. This effort of finding a method that applies for both the SQL API and the NDB API will prevail throughout the rest of the section when it applies.

Proof: Assuming any conflict considered is from T1 to T2. Let's first look at an item WR conflict. The read in T2 could be one of these provided by the SQL API of NDB Cluster:

  1. A read in the Read-Committed isolation level as in a regular select, which corresponds to the LM_CommittedRead lock mode in the NDB API.
  2. A read in shared lock as in 'select … lock in share mode', which corresponds to the LM_Read lock mode in the NDB API.
  3. A read in exclusive lock as in 'select … for update', which corresponds to the LM_Exclusive lock mode in the NDB API.

In the current implementation of NDB Cluster, all three kinds of read are routed to the primary replica and the write in T1 is available for these reads right after the write lock is released on the primary replica, which happens in the commit phase of T1. In other words, the event of releasing the write lock on the primary replica 'happened before' the event of an item read, whichever kind of read it is.

So we got the following event chain in history H based on the commit logic described at the beginning of this section:


        Event of starting the commit phase of T1 →

        Event of releasing the write lock on the primary replica for the write in T1(since it happens in the  commit phase) →

        Event of the item read in T2 →

        Event of starting the commit phase of T2(since starting of the commit phase 'happened after' the  end of the prepare phase and the end of the prepare phase 'happened after' any operation in T2, including the item read).                                                                                          
	  

Here an arrow means that two events connected by it are related in the 'happened before' partial order. We want to emphasize that his argument clearly applies to the 'decision set' case.

Similarly, for a WW conflict, there is the following event chain:


        Event of starting the commit phase of T1 →

        Event of releasing the write lock on the primary replica for the write in T1 →

        Event of acquiring the write lock on the primary replica for the write in T2 →

        Event of starting the commit phase of T2.
	  

The second arrow is sound because we require that the serial order of write events coincides with the 'happened before' partial order in the last section. Notice this argument is also correct for the 'decision set' case since the locks are tuple ones.

Finally let's take a look at a predicate WR conflict. The predicate read is the matching process of deciding which tuples to return for access. In a predicate read each tuple is examined(theoretically) against the predicate to see if it will be selected. When the access relies on the primary key or a unique key, the identifying process is just the traversal along relevant T trees, which usually only requires small amount of time; when the access relies on a scan, the identifying process could be a lengthy one. In general the write in T1 not necessarily 'happened before' a lengthy matching process in T2, but the release of the write lock on the written tuple 'happened before' the reading of a data object in that tuple, a 'decision set' version in the type B case, to evaluate the predicate for that specific tuple. This is true by Condition 2' and the fact that the serial order of write events on the 'decision set' coincides with the 'happened before' partial order(since there could be multiple writes between the one that changes the match and the read) and the definition of a predicate-based conflict. So we have the following event chain:


        Event of starting the commit phase of T1 →

        Event of releasing the write lock on the primary replica of the write in T1 →

        Event of the read of a data object of the written tuple to evaluate the predicate in T2 →

        Event of starting the commit phase of T2.
	  

So for all three types of conflict from T1 to T2 in H, it all leads to:


        Event of starting the commit phase of T1 → Event of starting the commit phase of T2.
	  

If there was a conflict loop in H, there must be a transaction T, such that:


        Event of starting the commit phase of T→Event of starting the commit phase of T.
	  

This immediately leads to a contradiction since the 'happened before' partial order is strict. So H is free of conflict loops and hence serializable by type B of the Serializability Theorem.


	                                                                                                       ## 
	  

This proof hints that in a regular history H, if we could convert a RW conflict into a situation similar to those of Theorem 1's, we would be able to achieve serializability.

For an item RW conflict between T1 and T2, all we have to do is to place a shared read lock to the item read if one hasn't been placed yet. Then an argument similar to the WW conflict case will again lead to:


      Event of starting the commit phase of T1 → Event of starting the commit phase of T2.
	

In the NDB API, this implies changing the lock mode form LM_CommittedRead to LM_Read; in the SQL API, changing a regular select to 'select ... lock in share mode' is sufficient.

The predicate RW conflict case is, however, more complex. In general, an update of the 'decision set' could give rise to either an IN operation of one predicate, or an OUT operation of another predicate; an insertion, however, only leads to a possible IN operation.

Let's first look at the OUT operation case. It turns out OUT operations can also lead to predicate-based conflicts, although common literature usually uses insertions, which is an IN operation, for demonstration. The following example illustrates it.

Example 13: Consider the same situation as in Example 0, with id1 being the primary key and three rows in the table: (1,1,1,1), (3,3,3,3), (5,5,5,5). Execute the following sequence of statements in the specified order in NDB Cluster:


                            T1                                                                     T2

        start transaction;

        select COUNT(*) from t where id1 > 0 and id1 < 2;

                                                                                         start transaction;

                                                                                         delete from t where id1 = 1;

                                                                                         insert into t values(4,4,4,4);

                                                                                         commit;

        select COUNT(*) from t where id1 > 2 and id1 <6 ;

        commit;

	  

Both transactions commit in this example. The COUNT() functions return 1 and 3 in that order, with the sum being 4. But a serial execution always returns a sum that equals 3. So there must be a conflict loop inside the DSG according to type B of the Serializability Theorem. Both selects in T1 are not followed by item-based operations, so they are solely predicate reads. There is apparently a predicate RW conflict between the first select and the deletion in T2, and a predicate WR conflict between the insertion in T2 and the second select. And the deletion in the first conflict is an OUT operation.


	                                                                                                       ## 
	  

Handling OUT operations is relatively easy. Specifically, if operations following the predicate read are reads in the Read-Committed isolation level, just change the lock mode from LM_CommittedRead to LM_Read in the NDB API or change it to 'select ... lock in share mode' in the SQL API; if on the other hand, these operations are writes, we need to do nothing since the write locks will suffice.

We assume the conflict is from T1 to T2 as usual. According to the type B of the Serializability Theorem, the write w in T2 that changes the match of the predicate is not necessarily the one that updates the version v used in the predicate read. Instead, a few writes that generate versions belonging to the same neighborhood of match change as v does may precede w. Let's assume the sequence of transactions that generate these writes to be Ti through Tj. Since v is inside the match set of the predicate, we may just use the same argument as in the item RW conflict case to guarantee:


      Event of starting the commit phase of Ti → … 
    
      → Event of starting the commit phase of Tj
    
      → Event of starting the commit phase of T2.
	

Here the WW conflicts are between write events that generate the 'decision set' versions.

Eventually we have:


      Event of starting the commit phase of T1 → Event of starting the commit phase of T2.
	

Remark 1: In this approach, we use locks on the reads/writes following a predicate read to push away an OUT operation so that relevant transactions are more isolated. As explained in Remark 1 after Definition 4', the fact that another transaction can't inject such a write operation is the key for the argument to be correct.

Remark 2: At first glance, the method we've just developed may not seem to apply to predicate read in a statement like 'select COUNT(*) in t where …' as in the previous example since there are no reads/writes after the predicate read. It turns out if we replace the first statement in T1 to be 'select COUNT(*) from t where id1 > 0 and id1 < 2 lock in share mode;' in Example 13, T2 will hang at the deletion. That is because this select statement place long read lock for the matching tuples even if the only info accessed is the cardinality of the matching set. So the method applies to this case too.

Remark 3: There is a chance that after the placement of the lock, the predicate RW conflict vanishes. For example, after the placement of the lock, the two transactions are separated far enough so that another statement in T1 perform an OUT operation before T2 got a chance. We'll see an example of this kind in the TPCC benchmark later. The point is that after the lock is in place, whether the predicate RW conflict continue to exist or not, the issue is taken care of.

For an IN operation, things become hefty because the pre-image of the update or insertion is not in the match set. MySQL InnoDB and PostgreSQL both handle this with the so called 'gap lock', which is a special case of so-called 'granular locking'. An exposure of 'granular locking' can be found in the classic book [Gr 93].

However the concept 'gap' becomes a global one in NDB Cluster, a distributed system. When a tuple is inserted into a table, it is hashed to a certain data node so that when tuples are accessed in the same locality, they are processed by different data nodes. This parallelism strategy has been a major contribution to the awesome benchmark profile that NDB Cluster demonstrates. But it also implies if we are looking at two neighboring tuples in the same data node with primary keys '4' and '7' respectively, we don't know what exactly the 'gap' is since there may be a tuple with primary key '6' in some other node. To know exactly what it is, in principle we need to query every node to find out each 'gap' there and consolidate all of them into one. And when this process is happening, it'd be better the 'gap' in each node is not altered. This sounds like it requires global synchronization and it is very expansive. The following theorem suggests we might be able to get around it.

Theorem 2: Let T1 and T2 be transactions in history H. And further let s1 be a statement in T1 and s2 be a statement in T2 such that s2 could give rise to an IN operation conflicting with s1's predicate read. A lock l1 is placed right before s1 in T1 and another lock l2 is placed at the end of T2, so that l1 conflicts with l2. For example, l1 could be a shared read lock on a tuple and l2 could be a write lock on the same tuple. Assuming a predicate-based conflict between s1 and s2(hence between T1 and T2) exists, then it is a predicate RW conflict if and only if l1 'happened before' l2, assuming that appropriate locks have been placed in the predicate read of s1 as in the OUT operation case.

Proof: (If) We prove it by contradiction. Let's assume that l1 → l2, but the conflict between s2 and s1 was a predicate WR conflict. Then the following event chain can be derived:


        T1 performing the predicate read →

        Release of l1 in T1 (Condition 1) →

        Acquisition of l2 in T2 →

        T2 releasing write lock of the tuple in the primary replica(Condition 1) →

        Predicate read of the updated/inserted tuple in T1(predicate WR conflict).
	  

Since in this chain the first event is a set of events that contains the last one, this immediately contradicts with the fact that 'happened before' is strict and the 'If' part of the Theorem is proved.

(Only If) Again we prove it by contradiction. Notice that l1 and l2 can't be overlapping events since they represent conflicting locks. So we may assume the conflict between s1 and s2 to be a predicate RW conflict, however l2 → l1.

The write in T2 could be either an insert or an update, and it readily implied the following event chain:


        Acquisition of write lock for the updated/inserted tuple in T2 → 

        Release of l2 in T2(Condition 1) →

        Acquisition of l1 in T1 →

        T1 performing the predicate read(Condition 1).
	  

Since we assumed appropriate locks had been placed in the predicate read of s1 as in the OUT operation case, the NDB Cluster implementation would acquire short locks in the last event when identifying which tuple could be a match even in the case where no item reads/writes followed the predicate read, the lock conflict between the first and the last event in this chain implied the conflict between T2 and T1 had to be a predicate WR one. This lock conflict from the NDB Cluster implementation is demonstrated in the following Example 14.

This would lead to a contradiction and l1 'happened before' l2.


	                                                                                                       ## 
	  

Remark: The placement of the conflicting lock pair might get rid of the predicate RW conflict as in the OUT operation case too. In that case, the issue is resolved automatically and the potential predicate-based conflict is replaced with an item-based one on the 'decision set'.

Example 14: Consider the same situation as in Example 13. Execute the following sequence of statements in the specified order:


                     T1                                                           T2

                                                                           start transaction;

                                                                           insert into t values (2,2,2,2);

        start transaction;

        select * from t where id1 > 0 and id1 < 4 lock in share mode;
	  

Transaction T1 hangs because lock conflict at tuple (2,2,2,2). Also notice that if we execute the statement 'update t set id1 = 2 where id1 = 5' in T2, we will observe the same behavior. If we change the select statement in T1 to 'select COUNT(*) from t where id1 > 0 and id1 < 4 lock in share mode;', we will observe the same behavior too. On the other hand, if we remove 'lock in share mode' in all the cases we just mentioned, the lock wait will disappear. This is good enough for the proof of the 'Only If' part in Theorem 2.


	                                                                                                       ## 
	  

The 'only if' part of Theorem 2 suggests that for each IN operation induced predicate RW conflict, as long as we place conflicting locks l1 and l2 in the specified places, we have l1 'happened before' l2. This further implies the following relation in the 'happened before' partial order by an argument as in Theorem 1:


      Event of starting the commit phase of T1 → Event of starting the commit phase of T2.                                                               
	

For a Read-Only transaction, the event of starting the commit phase of it may be chosen to be one that 'happened after' the execution of SQL statements(hence 'happened after' any lock acquiring event) and 'happened before' any lock releasing event since two-phase locking is employed by NDB Cluster.

So given a history H in NDB Cluster, for both item and predicate RW conflict between T1 and T2 in H, we've devised a way so that the following condition is satisfied:

Condition *: Event of starting the commit phase of T1 → Event of starting the commit phase of T2.

For WR and WW conflicts in H, Condition * is also satisfied as in Theorem 1. This implies if there was a conflict loop in H, it would lead to the conclusion that 'happened before' was not strict. This is impossible and H is serializable by the Serializability Theorem. So we have proved the following:

Theorem 3: As long as we place prevention measures as described for all possible RW conflicts in an application executed under NDB Cluster's Read-Committed isolation level, the generated history will be serializable.

One optimization goal of this article is to minimize locking time. The position of l2 of course is optimal, but how about l1? The following example gives us an affirmative answer.

Example 15: Assuming everything as in Example 13. Execute the following sequence of statements in the specified order:


	                       T1                                                            T2

        start transaction;                                                    

        select * from t where id1 = 2 lock in share mode;

                                                                                start transaction;

                                                                                insert into t values(2,2,2,2);

                                                                                select * from t where id1 = 3 for update;

                                                                                commit;

        select * from t where id1 = 3 lock in share mode;

        commit;
	  

The read and write locks on the tuple with id1 = 3 serve as l1 and l2 respectively in this example. A predicate RW conflict is present: it is between the first statement of T1 and that of T2, but l1 'happened after' l2.


	                                                                                                       ## 
	  

So the location of l1 can't be moved downward, but can it be moved upward? The answer is yes, for both l1 and l2. Actually the following Theorem 2' can be proved.

Theorem 2': The conclusion of Theorem 2 still holds if we move the position of either/both locks upward in the containing transaction.

Proof: Similar to the proof of Theorem 2.


	                                                                                                       ## 
	  

2.4 Applying the method to other Read-Committed implementations

In this development of a systematic method of applying type B of the Serializability Theorem to NDB Cluster, the choice of time point of commit is crucial. For the arguments to go through, it must be chosen to be after all SQL or NDB API statements are executed, but before any long lock is released. If a Read-Committed implementation also uses a variant of two phase locking, but without acquiring locks for every read like NDB Cluster, the starting point of the second phase naturally serves this purpose. This implies we may also apply the method to MySQL InnoDB's Read-Committed and Repeatable-Read isolation levels to achieve serializability.

For a lock-free system like PostgreSQL's Snapshot Isolation, it doesn't have locks and the previous argument can't be applied. In this case, we associate each transaction T with the timestamp of commit t(let's say even a Read-Only transaction commits, as contrary to some Snapshot Isolation implementations). Then a WR or WW conflict be tween T1 and T2 implies a t1 < t2 since they can't be concurrent. For a RW conflict, materialization or promotion can be used to push two concurrent transactions away so that t1 < t2 remains true. This way a conflict loop cannot form since it would otherwise imply a loop of timestamps of commit, which contradicts with strictness of the temporal order. Notice for this argument to go through, the Monotonicity Condition must hold, namely, clocks can't tick backward.

2.5 An example: the TPCC Benchmark

Next we will apply Theorem 3 to the application in the TPCC benchmark to achieve serializability. TPCC is examined in detail in [Fe 05], where the transactions and conflicts between them are represented as DSG. Although we call DSG a graph, it is really a multi-graph since there may be more than one conflict between any two transactions. [Fe 05] identifies and eliminates the so called 'dangerous structure', two consecutive RW conflicts in DSG, to achieve serializability by aborting a participating conflicting transaction.

In the Read-Committed isolation implemented by NDB Cluster, the story is very different. We in general need to place prevention measures for all RW conflicts in DSG one by one. That is because in a pessimistic system like NDB Cluster, we use locks instead of abortion of transactions to prevent conflicts as we've just described. Actually we don't need to impose prevention measures for every one of those conflicts. We will see that in the following

Example 16(TPCC): In this example the TPCC benchmark, revision 5.11.0, is examined. Inside a transaction, variables to store selected fields and parameters passed in are prefixed with a semi-colon, like :w_id. Since the purpose is to analyze conflicts between the five transactions, I skip most of the computations for these variables. Mostly only the SQL statements and branching statements are reserved. For cursor statements, I only retain those related to the process of identifying which data objects are to access. The computations are all deterministic functions of variables and parameters. Pseudo-random numbers are not present. Datetime values returned from underlying OS, however, do show up in the application. I also rewrite the join statements as semantically equivalent statements. More about join statements will be discussed later. Inside the TPCC benchmark specification, there is a description of the five transactions and there is also a definition in a SQL dialect for them. Sometimes there are discrepancies between these two. Whenever this happens, I pertain to the description to resolve it unless it's specified otherwise. Here follows the five transactions.


        The New-Order transaction(Read-Write):

        select w_tax from warehouse where w_id=:w_id;

        select d_tax from district where d_id=:d_id and d_w_id=:w_id;

        select d_next_o_id from district where d_id=:d_id and d_w_id=:w_id;

        update district set d_next_o_id=:d_next_o_id+1 where d_id=:d_id and d_w_id=:w_id;

        select c_discount, c_last, c_credit from customer 
           where c_w_id=:w_id and c_d_id=: c_d_id and c_id=:c_id;

        insert into orders values(:o_id, :d_id, :w_id, :c_id, :datetime, NULL, :o_ol_cnt, :o_all_local);

        insert into new_order values(:o_ld, :d_id, :w_id);

        And for each item we're going to order, we execute the following block of statements in a for loop:

        {    
            select i_price, i_name, i_data from item where i_id=:ol_i_id;

            select s_quantity, s_data, s_dist_01, s_dist_02, s_dist_03, s_dist_04,
               s_dist_05, s_dist_06, s_dist_07, s_dist_08, s_dist_09, s_dist_10
               from stock where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id;

            update stock set s_quantity=:s_quantity where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id;

            update stock set s_ytd=s_ytd+:ol_quantity
               where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id;

            update stock set s_order_cnt=s_order_cnt+1
               where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id;

            if (:ol_supply_w_id !=:w_id) {
               update stock set s_remote_cnt=s_remote_cnt+1
               where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id;
            }

            insert into order_line values(:o_id, :d_id, :w_id, :ol_number, :ol_i_id,
               :ol_supply_w_id, NULL, :ol_quantity, :ol_amount, :ol_dist_info);
        }

        The Payment transaction(Read-Write):

        select w_street_1, w_street_2, w_city, w_state, w_zip, w_name 
           from warehouse where w_id=:w_id;

        update warehouse set w_ytd=w_ytd+:h_amount where w_id=:w_id;

        select d_street_1, d_street_2, d_city, d_state, d_zip, d_name 
           from district where d_w_id=:w_id and d_id=:d_id;

        update district set d_ytd=d_ytd+:h_amount where d_w_id=:w_id and d_id=:d_id;

        if the customer making the payment is represented by a name{          //60% chances

            select count(c_id) into :namecnt from customer 
               where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id;

            declare c_byname cursor for 
                select c_first, c_middle, c_id, c_street_1, c_street_2, c_city, c_state, 
                   c_zip, c_phone, c_credit, c_credit_lim, c_discount, c_balance, c_since
                   from customer where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id
                   order by c_first;

            open c_byname;

            if(:namecnt%2) :namecnt++;
            for (n=0;n < :namecnt/2;n++) {
                fetch c_byname
                   into :c_first, :c_middle, :c_id, :c_street_1, :c_street_2, :c_city, :c_state, 
                   :c_zip, :c_phone, :c_credit, :c_credit_lim, :c_discount, :c_balance, :c_since
            }

            close c_byname;
        }
        else if the customer making the payment is represented by an id{      //40% chances

            select c_first, c_middle, c_last, c_street_1, c_street_2, c_city, c_state, 
               c_zip, c_phone, c_credit, c_credit_lim, c_discount, c_balance, c_since
               from customer where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;
        }

        update customer set c_balance=c_balance-:h_amount
           where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;

        update customer set c_ytd_payment=c_ytd_payment+:h_amount
           where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;

        update customer set c_payment_cnt=c_payment_cnt+1
           where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;

        if (:c_credit=”BC”){

            select c_data from customer where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;

            update customer set c_data=:c_data
               where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;
        }    

        insert into history values(:c_d_id, :c_w_id, :c_id, :d_id, :w_id, :datatime, :h_amount, :h_data);

        The Order-Status transaction(Read-Only):

        if the customer in the order is represented by a name{                         //60% chances

            select count(c_id) into :namecnt from customer 
               where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id;

            declare c_name cursor for 
               select c_balance, c_first, c_middle, c_id from customer
                  where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id
                  order by c_first;

            open c_name;

            if (:namecnt%2) :namecnt++;
            for (n=0;n < :namecnt/2;n++) {
                fetch c_name
                   into :c_balance, :c_first, :c_middle, :c_id;
            }

            close c_name;
        } 
        else if the customer in the order is represented by an id{                    //40% chances

            select c_balance, c_first, c_middle, c_last from customer
               where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;
        }
    
        declare c_order cursor for
           select o_id, o_carrier_id, o_entry_d from orders 
              where o_d_id=:c_d_id and o_w_id=:c_w_id and o_c_id=:c_id
              order by o_id desc;

        open c_order;

        fetch c_order 
           into :o_id, :o_carrier_id, :o_entry_d;

        close c_order;

        select ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_delivery_d from order_line
           where ol_d_id=:c_d_id and ol_w_id=:c_w_id and ol_o_id=:o_id;

        The Delivery transaction(Read-Write):
		
        declare c_no cursor for
           select no_o_id from new_order where no_d_id=:d_id and no_w_id=:w_id
              order by no_o_id asc;

        open c_no;
 
        fetch c_no into :no_o_id;

        close c_no;

        if the former cursor returns a non-empty set{

            delete from new_order where no_d_id=:d_id and no_w_id=:w_id and no_o_id=:no_o_id;

            //In TPCC's delivery transaction, the previous two statements are expressed as an updatable 
            cursor. Since NDB Cluster doesn't support updatable cursors, I've changed it to the 
            previous two statements which are semantically equivalent.         

            select o_c_id from orders where o_d_id=:d_id and o_w_id=:w_id and o_id=:no_o_id;

            update orders set o_carrier_id=:o_carrier_id
               where o_d_id=:d_id and o_w_id=:w_id and o_id=:no_o_id;

            update order_line set ol_delivery_d=:datetime
               where ol_d_id=:d_id and ol_w_id=:w_id and ol_o_id=:no_o_id;

            select sum(ol_amount) from order_line 
               where ol_d_id=:d_id and ol_w_id=:w_id and ol_o_id=:no_o_id;

            update customer set c_balance=c_balance+:ol_total
               where c_id=:c_id and c_d_id=:d_id and c_w_id=:w_id;

            update customer set c_delivery_cnt=c_delivery_cnt+1
               where c_id=:c_id and c_d_id=:d_id and c_w_id=:w_id;
        }

        The Stock-Level transaction(Read-Only):

        select d_next_o_id from district where d_w_id=:w_id and d_id=:d_id; 

        select distinct(ol_i_id) from order_line 
           where ol_w_id=:w_id and ol_d_id=:d_id and ol_o_id<:o_id and ol_o_id>=:o_id-20;

        for each ol_i_id obtained in the last statement, assuming it is stored in an array cell :ol_i_id[i] {

            select count(s_i_id) from stock 
               where s_i_id=:ol_i_id[i] and s_w_id=:w_id and s_quantity<:threshold;
        }                                                                                                 
	  

The following tables summarizes the reads, writes and predicate reads in these five transactions as in [Fe 05], with a few missing ones from there also supplemented here:

District Table PK:(D_W_ID, D_ID)
PR R W
D_ID SLEV, NEWO, PAY
D_W_ID SLEV, NEWO, PAY
D_NAME PAY
D_STREET_1 PAY
D_STREET_2 PAY
D_CITY PAY
D_STATE PAY
D_ZIP PAY
D_TAX NEWO
D_YTD PAY PAY
D_NEXT_O_ID SLEV, NEWO NEWO

	  
	  
Customer Table PK:(C_W_ID, C_D_ID, C_ID)
PR R W
C_ID OSTAT, NEWO, PAY, DLVY2 OSTAT, PAY
C_D_ID OSTAT, NEWO, PAY, DLVY2
C_W_ID OSTAT, NEWO, PAY, DLVY2
C_FIRST OSTAT, PAY
C_MIDDLE OSTAT, PAY
C_LAST OSTAT, PAY OSTAT, PAY, NEWO
C_STREET_1 PAY
C_STREET_2 PAY
C_CITY PAY
C_STATE PAY
C_ZIP PAY
C_PHONE PAY
C_SINCE PAY
C_CREDIT PAY, NEWO
C_CREDIT_LIM PAY
C_DISCOUNT PAY, NEWO
C_BALANCE PAY, OSTAT, DLVY2 PAY, DLVY2
C_YTD_PAYMENT PAY PAY
C_PAYMENT_CNT PAY PAY
C_DELIVERY_CNT DLVY2 DLVY2
C_DATA PAY PAY

	  
	  
New-Order Table PK:(NO_W_ID, NO_D_ID, NO_O_ID)
PR R W
NO_O_ID DLVY2 DLVY2 NEWO(I), DLVY2(D)
NO_D_ID DLVY1,DLVY2 NEWO(I), DLVY2(D)
NO_W_ID DLVY1,DLVY2 NEWO(I), DLVY2(D)

	  
	  
Orders Table PK:(O_W_ID, O_D_ID, O_ID)
PR R W
O_ID DLVY2 OSTAT NEWO(I)
O_D_ID DLVY2, OSTAT NEWO(I)
O_W_ID DLVY2, OSTAT NEWO(I)
O_C_ID OSTAT DLVY2 NEWO(I)
O_ENTRY_D OSTAT NEWO(I)
O_CARRIER_ID OSTAT NEWO(I), DLVY2
O_OL_CNT NEWO(I)
O_ALL_LOCAL NEWO(I)

	  
	  
Order-Line Table PK:(OL_W_ID, OL_D_ID, OL_O_ID, OL_NUMBER)
PR R W
OL_O_ID DLVY2, SLEV, OSTAT NEWO(I)
OL_D_ID DLVY2, SLEV, OSTAT NEWO(I)
OL_W_ID DLVY2, SLEV, OSTAT NEWO(I)
OL_NUMBER NEWO(I)
OL_I_ID SLEV, OSTAT NEWO(I)
OL_SUPPLY_W_ID OSTAT NEWO(I)
OL_DELIVERY_D OSTAT NEWO(I), DLVY2
OL_QUANTITY OSTAT NEWO(I)
OL_AMOUNT OSTAT, DLVY2 NEWO(I)
OL_DIST_INFO NEWO(I)

	  
	  
Warehouse Table PK:(W_ID)
PR R W
W_ID NEWO, PAY
W_NAME PAY
W_STREET_1 PAY
W_STREET_2 PAY
W_CITY PAY
W_STATE PAY
W_ZIP PAY
W_TAX NEWO
W_YTD PAY PAY

	  
	  
Item Table PK:(I_ID)
PR R W
I_ID NEWO
I_IM_ID
I_NAME NEWO
I_PRICE NEWO
I_DATA NEWO

	  
	  
Stock Table PK:(S_W_ID, S_I_ID)
PR R W
S_I_ID NEWO, SLEV
S_W_ID NEWO, SLEV
S_QUANTITY SLEV NEWO NEWO
S_DIST_01 NEWO
S_DIST_02 NEWO
S_DIST_03 NEWO
S_DIST_04 NEWO
S_DIST_05 NEWO
S_DIST_06 NEWO
S_DIST_07 NEWO
S_DIST_08 NEWO
S_DIST_09 NEWO
S_DIST_10 NEWO
S_YTD NEWO NEWO
S_ORDER_CNT NEWO NEWO
S_REMOTE_CNT NEWO NEWO
S_DATA NEWO

	  
	  
History Table
PR R W
H_C_ID PAY(I)
H_C_D_ID PAY(I)
H_C_W_ID PAY(I)
H_D_ID PAY(I)
H_W_ID PAY(I)
H_DATE PAY(I)
H_AMOUNT PAY(I)
H_DATA PAY(I)

Theorem 3 tells us that, in the Read-Committed isolation level of NDB Cluster, only RW conflicts matter. From the previous tables, we may analyze them one by one. We use initials to represent the transactions in this process. So s means the stock-level transaction and d1 means the first execution path of the delivery transaction and so on. Among all the RW conflicts, those involving predicate reads are the most complicated and interesting ones. We will analyze them thoroughly and will start with one between the order-status and new-order transactions.

o->n:

Fields the conflict incident upon Comments
O_W_ID, O_D_ID, O_C_ID predicate-based
O_ID item-based, not a conflict
O_ENTRY_D item-based, not a conflict
O_CARRIER_ID item-based, not a conflict
OL_W_ID, OL_D_ID, OL_O_ID predicate-based, not a conflict
OL_I_ID item-based, not a conflict
OL_SUPPLY_W_ID item-based, not a conflict
OL_DELIVERY_D item-based, not a conflict
OL_QUANTITY item-based, not a conflict
OL_AMOUNT item-based, not a conflict

Let's take a closer look at the first possible conflict. For the statement that causes this conflict, the description of the order-status transaction in the TPCC specification says: the row in the ORDER table with matching O_W_ID(equals C_D_ID), O_D_ID(equals C_D_ID), O_C_ID(equals C_ID), and with the largest existing O_ID, is selected. In the transaction provided by TPCC, it uses the following statement to implement it in a cursor:


        select o_id, o_carrier_id, o_entry_d from orders 
           where o_d_id=:c_d_id and o_w_id=:c_w_id and o_c_id=:c_id
           order by o_id desc;                                                                                            
	  

Once this cursor is opened, the first tuple(with largest O_ID) is fetched. So according to Theorem 2, all we have to do it to place a read lock before this statement and a corresponding write lock at the end of the new-order transaction such that when the RW conflict is observed, the two transactions will still be reasonably isolated.

We create a table named t_o_n which consists of three fields: O_W_ID, O_D_ID, O_C_ID. For each customer in the database, a tuple is inserted into t_o_n when he/she becomes a member of the system and deleted from t_o_n when he/she is not a customer any more. The following statement:


        select * from t_o_n 
           where o_d_id=:c_d_id and o_w_id=:c_w_id and o_c_id=:c_id
           lock in share mode;                                                                                                     
	  

is placed in the order-status transaction before the target statement:


        select o_id, o_carrier_id, o_entry_d from orders 
           where o_d_id=:c_d_id and o_w_id=:c_w_id and o_c_id=:c_id
           order by o_id desc;                                                                                                      
	  

And the following statement is placed at the end of the new-order transaction:


        select * from t_o_n where o_d_id=:d_id and o_w_id=:w_id and o_c_id=:c_id for update;.                                                                                           
	  

This is for the following target statement:


        insert into orders values(:o_id, :d_id, :w_id, :c_id, :datetime, NULL, :o_ol_cnt, :o_all_local);.                                                                                              
	  

For the target update to cause a conflict, it must share the same O_D_ID, O_W_ID and O_C_ID values with the target select. And the two locks placed conflict with each other if and only if both statements share the same O_D_ID, O_W_ID and O_C_ID values. Hence Theorem 2 applies. Notice we can't go one step down in terms of granularity here since 'the largest existing O_ID' is not a precise value. But it seems one lock for each customer is fine-grained enough.

Remark 1: In Theorem 2, we only require l1 to conflict with l2. But if we used the update lock l2 as suggested, l2 would conflict with another write lock on the same tuple in t_o_n too. This is an undesirable side-effect and we don't have a very good way to get around it in this Ad-Hoc second-tier solution. We will certainly address it when we develop a third-tier solution for type C and D of the Serializability Theorem.

Remark 2: We could also create table t_o_n as having just O_W_ID and O_D_ID as columns and Theorem 2 still applies. But if we were to hold a monolithic read lock whenever an order-status transaction executes and the corresponding write lock whenever a new-order transaction executes, it could generate a lot more of lock conflicts and the flight of performance would immediately crash into geek mountain. It is our responsibility to choose the finest lock granularity that can do the job. Whenever the 'decision set' demonstrates a hierarchical structure like this(a warehouse, a customer in it and a feature of it) and the higher levels come with an exact value, it is always easy. We just choose all the levels above the last one. All the predicate-based conflicts in this example are like this since TPCC is well-structured. If we cannot pick all higher levels, we might need to choose only some of the higher levels and this will certainly lead to more lock conflicts.

The second, third and fourth possible conflicts are not real since an item can't be read before it is inserted.

For the fifth possible conflict, the predicate read reads the order lines with OL_O_ID equal to O_ID we've read in the previous select. The inserted order lines from the new-order transaction are of an O_ID larger than those already in the database since O_ID is an increasing function. Hence the insertion doesn't change the match of the predicate read and it is not a conflict.

The sixth, seventh, eighth, ninth and tenth possible conflicts are not real since an item can't be read before it is inserted.

s->n:

Fields the conflict incident upon Comments
D_NEXT_O_ID item-based
OL_I_ID item-based, not a conflict
OL_W_ID, OL_D_ID, OL_O_ID predicate-based, not a conflict
S_W_ID, S_I_ID, S_QUANTITY predicate-based

The first conflict occurs at the field D_NEXT_O_ID and it is an item-based conflict. So all we have to do is to turn the first select statement into a locking read:


        select d_next_o_id from district where d_w_id=:w_id and d_id=:d_id lock in share mode;.                                                                                             
	  

In the new-order transaction, the write lock accompanying the write on D_NEXT_O_ID will do the job automatically.

The second possible conflict is not real since an item can't be read before it is inserted.

The third possible conflict is a predicate-based one incident on three fields: OL_W_ID, OL_D_ID, OL_O_ID. It is not a conflict either since the insertion in the new-order transaction has a bigger OL_O_ID than the predicate read covers and hence it cannot be an IN operation.

The fourth possible conflict is a predicate-based one incident on three fields: S_W_ID, S_I_ID, S_QUANTITY. The write for this conflict is an update. So it could either be an IN or an OUT operation. For an IN operation, we use granular locking to block it out by creating a table t_s_n with fields S_W_ID and S_I_ID. A tuple is inserted into it whenever an item is inserted into the Stock table and it is deleted whenever the corresponding item is removed from the Stock table. And the following statement:


        select * from t_s_n
           where s_i_id=:ol_i_id[i] and s_w_id=:w_id
           lock in share mode;                                                                                                   
	  

is placed before the target statement:


        select count(s_i_id) from stock 
           where s_i_id=:ol_i_id[i] and s_w_id=:w_id and s_quantity < :threshold;.                                                                                                   
	  

And we place the following statement at the end of the new-order transaction:


        select * from t_s_n where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id for update;.                                                                                                  
	  

This is for the following target statement:


        update stock set s_quantity=:s_quantity where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id;.                                                                                                  
	  

Since if the target update were to cause a conflict, it must share the same S_W_ID and S_I_ID values with the target select. And the two locks placed conflict with each other if and only if both statements share the same S_W_ID and S_I_ID values. Hence Theorem 2 applies here.

For an OUT operation, we need to place read lock on the statement where the predicate read is from and it becomes:


        select count(s_i_id) from stock 
           where s_i_id=:ol_i_id[i] and s_w_id=:w_id and s_quantity < :threshold
           lock in share mode;                                                                                                   
	  

o->p:

Fields the conflict incident upon Comments
C_BALANCE item-based, two branches

The item read of C_BALANCE takes place in two branches, so we need to place a share lock in both select statements.

p->p:

Fields the conflict incident upon Comments
W_YTD item-based, not a conflict
D_YTD item-based, not a conflict
C_BALANCE item-based
C_YTD_PAYMENT item-based, not a conflict
C_PAYMENT_CNT item-based, not a conflict
C_DATA item-based

The first possible conflict occurs at the field W_YTD. But the read of W_YTD accompanies a write of it and happens after the write lock is acquired. This means if there was a RW conflict on W_YTD, there would also be a WW one on the same field. But this means the conflicting write would not write the next version of W_YTD that was read; instead, the accompanying write would. This contradicts with the definition of item RW conflict. So this is not a real conflict.

This argument applies to the second possible conflict too.

For the third possible conflict, there are two item reads of C_BALANCE and they must be protected with read locks.

The fourth and fifth possible conflicts are not real for the same reason as the first and the second one.

For the sixth possible conflict, C_DATA must be protected with read lock although an update of it follows.

n->n:

Fields the conflict incident upon Comments
D_NEXT_O_ID item-based, not a conflict
S_QUANTITY item-based
S_YTD item-based, not a conflict
S_ORDER_CNT item-based, not a conflict
S_REMOTE_CNT item-based, not a conflict

The second possible conflict must be protected with read lock although an update of it follows. The rest are not real because of the accompanying write as in the p->p case.

d1->n:

In [Fe 05], Fekete named two execution paths of the delivery transaction to be logical transactions d1 and d2 respectively, where the former represents the case that the predicate read of the first statement in the delivery transaction returns an empty set while the latter represents the opposite.

The delivery transaction related conflicts represent the most complex part of conflict dependence graph in the TPCC benchmark and we'll analyze it in detail. The description of the delivery business transaction says that for a given W_ID, one order in each of ten districts is delivered, although one may use one or up to ten database transactions to realize it. We've chosen the latter since if more than one order was delivered in one database transaction, we had to hold locks longer than when we just had a single one and minimizing the duration of a lock is another important aspect of minimizing lock wait besides minimizing dependence. In other words we assume that, given a W_ID and a D_ID, at most one order is delivered in the database transaction. Of course, we have to use a loop to serve every D_ID in the warehouse for a given W_ID in this approach.

The description also says 'a database transaction is started unless a database transaction is already active from being started as part of the delivery of a previous order(i.e., more than one order is delivered within the same database transaction)'. It seems to me this may imply there was always only one delivery database transaction for each W_ID and D_ID pair in the system at any time. If that was true, there mustn't any conflicts between any two delivery database transactions since for the conflicts to exist, W_ID and D_ID must be same in the first place. This is apparent from the definition of the delivery transaction. But it seems [Fe 05] still suggests there were conflicts between different delivery database transactions. We will follow [Fe 05] for completeness. However you should realize that there might be a chance to simplify the conflict analysis we are performing here.

In d1, the first statement


        select no_o_id from new_order where no_d_id=:d_id and no_w_id=:w_id
           order by no_o_id asc;
	  

returns an empty set. If then a new-order transaction inserts a new order into the new-order table with matching W_ID and D_ID, it constitutes a predicate RW conflict. So we have the following

Fields the conflict incident upon Comments
NO_W_ID, NO_D_ID predicate-based

To use granularity locking to provide prevention measure, we create a table named t_d1_n with two columns: NO_W_ID, NO_D_ID. And one row is inserted into this table whenever a district is added and a row is deleted whenever the corresponding district is removed. The following statement:


        select * from t_d1_n where no_d_id=:d_id and no_w_id=:w_id lock in share mode;                                                                                                
	  

is placed in the delivery transaction before the target statement:


        select no_o_id from new_order where no_d_id=:d_id and no_w_id=:w_id
           order by no_o_id asc;.                                                                                                  
	  

And the following statement is placed at the end of the new-order transaction:


        select * from t_d1_n where no_d_id=:d_id and no_w_id=:w_id for update;.                                                                                                   
	  

This is for the following target statement:


        insert into new_order values(:o_ld, :d_id, :w_id);.                                                                                                    
	  

Since if the target update were to cause a conflict, it must share the same NO_W_ID and NO_D_ID values with the target select. And the two locks placed conflict with each other if and only if both statements share the same NO_W_ID and NO_D_ID values. Hence Theorem 2 applies again.

Remark: The granularity of the lock dictates that in every district(NO_D_ID) of a warehouse(NO_W_ID), we have to serialize the delivery and new-order transactions. Although the delivery transaction's execution can be deferred, like, to after 2am when new order placements are sparse, it could still pose as a problem if the business is global. In that case, careful scheduling of the delivery transactions so they are separated from the new-order transactions in the same district as much as possible would be desirable.

d1->d2:

Fields the conflict incident upon Comments
NO_W_ID, NO_D_ID predicate-based, not a conflict

In this case a delete in d2 is the write operation under consideration. This is not a conflict since that d1's statement returns an empty set implies that each tuple with matching NO_W_ID, NO_D_ID in the new-order table is either in the initial or the dead state and there is no chance a delete will become the first operation to change the match.

d2->n:

Fields the conflict incident upon Comments
NO_W_ID, NO_D_ID predicate-based, masked by locks for d1->n
NO_O_ID item-based, not a conflict
NO_W_ID, NO_D_ID, NO_O_ID predicate-based, not a conflict
O_W_ID, O_D_ID, O_ID predicate-based, not a conflict
O_C_ID item-based, not a conflict
O_W_ID, O_D_ID, O_ID predicate-based, not a conflict
OL_W_ID, OL_D_ID, OL_O_ID predicate-based, not a conflict
OL_W_ID, OL_D_ID, OL_O_ID predicate-based, not a conflict
OL_AMOUNT item-based, not a conflict

The first possible conflict is based on the predicate read of the first statement as in the d1->n case. Nothing need to be done since locks for d1->n already make this pair separated enough. That is because although we distinguish d1 and d2 logically, the prevention measure is placed in the same physical delivery transaction.

The second possible conflict is not real since an item can't be read before it is inserted.

The third possible conflict is based on the predicate read of the second statement of d2. It is not a conflict because a tuple to be inserted into the new-order table always has a bigger NO_O_ID than those already in there.

The fourth possible conflict is based on the predicate read of the third statement of d2. It is not a conflict because a tuple to be inserted into the order table always has a bigger O_ID than those already in there.

The fifth possible conflict is not real since an item can't be read before it is inserted.

The sixth possible conflict is based on the predicate read of the fourth statement of d2. It is not a conflict for the same reason as the fourth one.

The seventh possible conflict is based on the predicate read of the fifth statement of d2. It is not a conflict because a tuple to be inserted into the order-line table always has a bigger OL_O_ID than those already in there.

The eighth possible conflict is based on the predicate read of the sixth statement of d2. It is not a conflict for the same reason as the seventh one.

The ninth possible conflict is not real since an item can't be read before it is inserted.

d2->d2:

Fields the conflict incident upon Comments
NO_W_ID, NO_D_ID predicate-based, OUT operation
NO_O_ID item-based, vanishes for locks in the first conflict
NO_W_ID, NO_D_ID, NO_O_ID predicate-based, vanishes for locks in the first conflict
C_BALANCE item-based, not a conflict
C_DELIVERY_CNT item-based, not a conflict

The logical transaction d2 is the bulk of the delivery transaction, in which the oldest row in the new-order table is deleted and the corresponding order is delivered. So let's take a closer look at it. The deletion is realized with the following two SQL statements:


        select no_o_id from new_order where no_d_id=:d_id and no_w_id=:w_id
           order by no_o_id asc;

        delete from new_order where no_d_id=:d_id and no_w_id=:w_id and no_o_id=:no_o_id;                                                                                                    
	  

The first one represents a cursor that identifies the oldest row in the new-order table and the second one deletes it. The key issue here is the definition of d2. We've defined d1 to be the case where the first SQL statement returns an empty set. If we defined d2 to be the case where that SQL statement returns a non-empty set, the following abnormal situation would arise: The execution of a d2 returned the oldest row in the new order table. But before it got a chance(say, the thread that executes this transaction got context switched out after executing the first SQL statement) to delete it, a second d2 found the same oldest row and completed the deletion first. Then the first d2 deleted nothing when it was its turn to finish what was left. In this case, the second SQL statement in the first d2 was actually executed and nothing really happened. And statements following the second SQL statement in the transaction didn't actually change the database either. By definition, this execution is classified as a d2, but logically it is really a d1. So it makes sense to make sure this abnormal situation would not happen.

The first conflict is between predicate read of the first statement and a delete operation in a second delivery transaction. By the discussion about OUT operations following Theorem 1, we know that all we have to do is place a share lock for the first statement as follows:


        select no_o_id from new_order where no_d_id=:d_id and no_w_id=:w_id
           order by no_o_id asc
           lock in share mode;                                                                                                   
	  

Also after this lock is in place, the abnormal situation we've described will not show up since the second d2's deletion need to wait until the first d2 gives up this read lock and that is after all the SQL statements are executed in the first d2, including the deletion.

Remark 1: The abnormal situation arises because we implement the process of finding and deleting the oldest row with two SQL statements and this pair can be intervened in the middle by a deletion. In a system where updatable cursor is implemented in a way that the process can't be intervened, the abnormal situation wouldn't show up at all. In that case, in a locking system, if we could assume the cursor acquire short(until the cursor is closed) write lock for each row scanned and long write lock for the matching oldest row, the long write lock of the first d2 will block the short write lock of the second d2 if these two transactions are concurrent. This renders the updatable cursor in the first d2 to happen completely before that of the second d2's. On the other hand, in a possibly lock-free system implementing Snapshot Isolation, whether it supports updatable cursors or not, deleting the same row is classified as a write conflict and it would render the two transactions isolated so that they are not concurrent. This is why this predicate RW conflict doesn't exist in Fekete's analysis of the TPCC application. This also applies to the second and the third possible conflicts for a Snapshot Isolation implementation.

Remark 2: After this lock is in place, the original predicate RW conflict vanishes. So this is an example we promised in Remark 3 of the discussion of OUT operations. The disappearance of this predicate RW conflict is consistent with the semantics of the d2 transaction: the process of identifying and deleting the oldest row in a d2 is NOT intervened by another. It is consistent with the semantics of the application too: after all prevention measures are set, the application will become serializable and this predicate RW conflict shouldn't show up in serial execution of two d2's since in this serial execution the first d2 changes the match of the predicate read itself. This is also an example that an application behaves differently under different isolation levels: one with a conflict, the other without.

There is an issue with this approach though. Since we use read lock to fence off the OUT operation, there is a chance that both d2's acquire the read lock on the oldest row in new-order table and wait for each other to give it up so that the deletion can proceed. This creates a dead-lock situation and one of the transactions must be rolled back. Standard solution in the industry uses a type of read lock called 'update lock' to resolve this kind of situation. An 'update lock' is an intention lock that signifies the intention to modify the locked item afterwards, and it is compatible to regular read lock, but not with 'update lock' or write lock. In particular, two transactions acquiring the same 'update lock' will not be granted simultaneously. In NDB Cluster, this is a luxury since 'update lock' is not implemented. So if we don't want to roll back transactions to resolve dead-lock, we could use write lock instead:


        select no_o_id from new_order where no_d_id=:d_id and no_w_id=:w_id
           order by no_o_id asc
           for update;                                                                                                    
	  

This deadlock situation is not new. In the previous analysis, the item-based conflict for the C_BALANCE and C_DATA fields in the p->p case and the S_QUANTITY field in the n->n case both demonstrate this phenomena. So we may also replace the added read lock in the prevention measure with an update lock if we don't want the deadlock resolution mechanism in NDB Cluster to handle it. Also, after these locks are in place, the conflicts disappear since such a field is updated first in its own transaction.

This scenario is similar to the case where a possible item RW conflict is not real because of the accompanying write as in the p->p case, the difference is that we have to have the read lock(or update lock, or write lock) in place since the process of identifying which row to update and the update can be intervened.

The second possible conflict is from the item read of NO_O_ID in the first statement, while the third possible one is from the predicate read of the second statement. For them to be feasible at all, NO_O_ID in the deleted row in the second d2 must be equal to the oldest NO_O_ID value in the new-order table. But from the previous discussion, such a deletion need to wait until the first d2 gives up its locks, by then the row is already deleted by the first d2. We certainly can't delete a row twice. Hence both conflicts vanish because of the locks of the first conflict.

The rest are not real because of the accompanying write as in the p->p case.

d2->p:

Fields the conflict incident upon Comments
C_BALANCE item-based, not a conflict

It is not real because of the accompanying write as in the p->p case.

p->d2:

Fields the conflict incident upon Comments
C_BALANCE item-based

There are two item reads of C_BALANCE. They are protected with read locks from the p->p case. So nothing needs be done here. Also, after these locks are in place, the conflicts disappear since such a field is updated first in its own transaction.

o->d2:

Fields the conflict incident upon Comments
C_BALANCE item-based
O_CARRIER_ID item-based
OL_DELIVERY_D item-based

The first possible conflict is item-based and there are two of them. However it is protected with read locks from the o->p case already. So nothing needs to be done.

For the second and third possible conflict, read locks need to placed at the corresponding SQL statements.

The previous detailed analysis gives rise to the following set of altered transactions, with the prevention measures highlighted by red color:


        The New-Order transaction(Read-Write):

        select w_tax from warehouse where w_id=:w_id;

        select d_tax from district where d_id=:d_id and d_w_id=:w_id;

        select d_next_o_id from district where d_id=:d_id and d_w_id=:w_id;

        update district set d_next_o_id=:d_next_o_id+1 where d_id=:d_id and d_w_id=:w_id;

        select c_discount, c_last, c_credit from customer 
           where c_w_id=:w_id and c_d_id=: c_d_id and c_id=:c_id;

        insert into orders values(:o_id, :d_id, :w_id, :c_id, :datetime, NULL, :o_ol_cnt, :o_all_local);

        insert into new_order values(:o_ld, :d_id, :w_id);

        And for each item we're going to order, we execute the following block of statements in a for loop:

        {    
            select i_price, i_name, i_data from item where i_id=:ol_i_id;

            //n->n
            select s_quantity, s_data, s_dist_01, s_dist_02, s_dist_03, s_dist_04,
               s_dist_05, s_dist_06, s_dist_07, s_dist_08, s_dist_09, s_dist_10
               from stock where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id
               lock in share mode;

            update stock set s_quantity=:s_quantity where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id;

            update stock set s_ytd=s_ytd+:ol_quantity
               where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id;

            update stock set s_order_cnt=s_order_cnt+1
               where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id;

            if (:ol_supply_w_id !=:w_id) {
               update stock set s_remote_cnt=s_remote_cnt+1
               where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id;
            }

            insert into order_line values(:o_id, :d_id, :w_id, :ol_number, :ol_i_id,
               :ol_supply_w_id, NULL, :ol_quantity, :ol_amount, :ol_dist_info);
        }
		
        //s->n
        select * from t_s_n where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id for update;

        //o->n
        select * from t_o_n where o_d_id=:c_d_id and o_w_id=:c_w_id and o_c_id=:c_id for update;

        //d1->n
        select * from t_d1_n where no_d_id=:d_id and no_w_id=:w_id for update;

        The Payment transaction(Read-Write):

        select w_street_1, w_street_2, w_city, w_state, w_zip, w_name 
           from warehouse where w_id=:w_id;

        update warehouse set w_ytd=w_ytd+:h_amount where w_id=:w_id;

        select d_street_1, d_street_2, d_city, d_state, d_zip, d_name 
           from district where d_w_id=:w_id and d_id=:d_id;

        update district set d_ytd=d_ytd+:h_amount where d_w_id=:w_id and d_id=:d_id;

        if the customer making the payment is represented by a name{          //60% chances

            select count(c_id) into :namecnt from customer 
               where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id;

            //p->p, p->d2
            declare c_byname cursor for
               select c_first, c_middle, c_id, c_street_1, c_street_2, c_city, c_state, 
                  c_zip, c_phone, c_credit, c_credit_lim, c_discount, c_balance, c_since
                  from customer where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id
                  order by c_first
                  lock in share mode;

            open c_byname;

            if(:namecnt%2) :namecnt++;
            for (n=0;n < :namecnt/2;n++) {
                fetch c_byname
                   into :c_first, :c_middle, :c_id, :c_street_1, :c_street_2, :c_city, :c_state, 
                   :c_zip, :c_phone, :c_credit, :c_credit_lim, :c_discount, :c_balance, :c_since
            }

            close c_byname;
        }
        else if the customer making the payment is represented by an id{      //40% chances
            
            //p->p, p->d2
            select c_first, c_middle, c_last, c_street_1, c_street_2, c_city, c_state, 
               c_zip, c_phone, c_credit, c_credit_lim, c_discount, c_balance, c_since
               from customer where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id
               lock in share mode;
        }

        update customer set c_balance=c_balance-:h_amount
           where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;

        update customer set c_ytd_payment=c_ytd_payment+:h_amount
           where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;

        update customer set c_payment_cnt=c_payment_cnt+1
           where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;

        if (:c_credit=”BC”){

            //p->p
            select c_data from customer where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id
               lock in share mode;

            update customer set c_data=:c_data
               where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;
        }    

        insert into history values(:c_d_id, :c_w_id, :c_id, :d_id, :w_id, :datatime, :h_amount, :h_data);

        The Order-Status transaction(Read-Only):

        if the customer in the order is represented by a name{                         //60% chances

            select count(c_id) into :namecnt from customer 
               where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id;

            //o->p and o->d2
            declare c_name cursor for 
               select c_balance, c_first, c_middle, c_id from customer
                  where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id
                  order by c_first
                  lock in share mode; 

            open c_name;

            if (:namecnt%2) :namecnt++;
            for (n=0;n < :namecnt/2;n++) {
                fetch c_name
                   into :c_balance, :c_first, :c_middle, :c_id;
            }

            close c_name;
        } 
        else if the customer in the order is represented by an id{                    //40% chances

            //o->p and o->d2
            select c_balance, c_first, c_middle, c_last from customer
               where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id
               lock in share mode;
        }
		
        //o->n
        select * from t_o_n 
           where o_d_id=:c_d_id and o_w_id=:c_w_id and o_c_id=:c_id
           lock in share mode;
		   
        //o->d2
        declare c_order cursor for
           select o_id, o_carrier_id, o_entry_d from orders 
              where o_d_id=:c_d_id and o_w_id=:c_w_id and o_c_id=:c_id
              order by o_id desc
              lock in share mode;

        open c_order;

        fetch c_order 
           into :o_id, :o_carrier_id, :o_entry_d;

        close c_order;
		
        //o->d2
        select ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_delivery_d from order_line
           where ol_d_id=:c_d_id and ol_w_id=:c_w_id and ol_o_id=:o_id
           lock in share mode;

        The Delivery transaction(Read-Write):
		
        //d1->n
        select * from t_d1_n where no_d_id=:d_id and no_w_id=:w_id 
           lock in share mode;
		   
        //The following read lock can be removed if only one delivery transaction with
        //matching NO_W_ID and NO_D_ID is allowed at any given time
        //d2->d2
        declare c_no cursor for
           select no_o_id from new_order where no_d_id=:d_id and no_w_id=:w_id
              order by no_o_id asc
              lock in share mode;

        open c_no;
 
        fetch c_no into :no_o_id;

        close c_no;

        if the former cursor returns a non-empty set{

            delete from new_order where no_d_id=:d_id and no_w_id=:w_id and no_o_id=:no_o_id;

            //In TPCC's delivery transaction, the previous two statements are expressed as an updatable 
            cursor. Since NDB Cluster doesn't support updatable cursors, I've changed it to the 
            previous two statements which are semantically equivalent.         

            select o_c_id from orders where o_d_id=:d_id and o_w_id=:w_id and o_id=:no_o_id;

            update orders set o_carrier_id=:o_carrier_id
               where o_d_id=:d_id and o_w_id=:w_id and o_id=:no_o_id;

            update order_line set ol_delivery_d=:datetime
               where ol_d_id=:d_id and ol_w_id=:w_id and ol_o_id=:no_o_id;

            select sum(ol_amount) from order_line 
               where ol_d_id=:d_id and ol_w_id=:w_id and ol_o_id=:no_o_id;

            update customer set c_balance=c_balance+:ol_total
               where c_id=:c_id and c_d_id=:d_id and c_w_id=:w_id;

            update customer set c_delivery_cnt=c_delivery_cnt+1
               where c_id=:c_id and c_d_id=:d_id and c_w_id=:w_id;
        }

        The Stock-Level transaction(Read-Only):

        //s->n
        select d_next_o_id from district where d_w_id=:w_id and d_id=:d_id 
           lock in share mode;

        select distinct(ol_i_id) from order_line 
           where ol_w_id=:w_id and ol_d_id=:d_id and ol_o_id<:o_id and ol_o_id>=:o_id-20;

        for each ol_i_id obtained in the last statement, assuming it is stored in an array cell :ol_i_id[i] {
		
            //s->n
            select * from t_s_n
               where s_i_id=:ol_i_id[i] and s_w_id=:w_id
               lock in share mode;

            select count(s_i_id) from stock 
               where s_i_id=:ol_i_id[i] and s_w_id=:w_id and s_quantity<:threshold;
        }                                                                                                 
	  

	                                                                                                       ## 
	  

Remark: A helper program to generate the tables in this example is available here

There is one issue though in the way we prevent the predicate RW conflicts in NDB Cluster: we introduce new tables and possible access to them. Does this cause extra conflicts and require more prevention measures? It does introduce new conflicts. For example, inserting and deleting a row in such a table give rise to a WW conflict. It turns out it doesn't require more prevention measures. We will give a proof after we introduce the Ramification Theorem in next section.

2.6 Application of the Generalized Serializability Theorems to TPCC

So this is what we got if we were to apply type B of the Serializability Theorem to the TPCC benchmark. How about the Generalized Serializability Theorems? The Generalized Serializability Theorems suggest that inconsistencies don't show up in subset C and they only show up in its complement, a subset of the Read-Only transactions. Let's name this complement I as in inconsistency. In a real world application, inconsistencies in these Read-Only transactions could be displayed in some sort of monitor and in general are harmless since they won't affect the underlying database. But there are exceptions. Consider the following

Example 17: A stock market application uses the Generalized Serializability Theorems instead of the Serializability Theorem to boost performance. A Read-Only transaction dynamically retrieves info from the market and print it to a monitor screen helping investors to make their decisions. It happens that this Read-Only transaction belongs to subset I and the data it prints on the monitor could be inconsistent. An investor reads the inconsistent info and makes a purchase in the market and that leads to money loss. Or even worse, the application represents operations on a nuclear reactor and the data printed on the monitor screen summarize the reactor's status; its inconsistency might trigger an operator to perform an operation that leads to a catastrophe.

If you wonder if there is a way to explain this kind of undesirable behavior, here is how: think of the monitor screen as data fields in the database and the printing as writes to them. And then think of the individual(for example, an investor or an operator) reading those fields as a transaction and his/her operation as a write to the database. This way, the original Read-Only transaction becomes a Read-Write transaction and must be accounted for as in subset C instead of in subset I when we try to apply the Generalized Serializability Theorems to this extended database application.


	                                                                                                       ## 
	  

So let's now apply the Generalized Serializability Theorems to the TPCC benchmark and see what happens. From the analysis in Example 16, we know that we have prevention measures between the following transaction pairs to avoid a conflict loop to show up: o->p, o->d2, o->n, d1->n, d2->d2, s->n, p->p, n->n. Notice that although d1 is a Read-Only transaction logically, we can't removed the prevention measures for d1->n readily since they are also for d2->n. So if we apply Generalized Serializability Theorem I to it, only four pairs of them remain significant: d1->n, d2->d2, p->p, n->n. In a real world application, if we could arrange the delivery transactions to be in a period of time where new-order transactions are not allowed(for example, for a region, schedule delivery processing at 4 am of that region and suspend incoming orders could be one way to leverage this), we may remove prevention measures for d1->n(and d2->n too) since this way Condition * is guaranteed to be satisfied for this RW conflict. If we further constrain the application at the second-tier to make sure that no two delivery transactions with matching W_ID and D_ID may be active simultaneously, we can get rid of prevention measures for d2->d2 too. This mean we've just proved the following

Theorem 4': With the constraints stated in last paragraph and providing prevention measures for conflicts p->p, n->n as described in Example 16, the TPCC benchmark leaves a NDB Cluster database in a consistent state if it started consistent.

Remark: In this theorem we've demonstrated not only using Generalized Serializability Theorems, but also rewriting the application, to reduce consistency related overhead while preserving consistency. We call it Theorem 4' instead of Theorem 4 since we still need to take care of the timestamp columns and the statements in the theorem are not exactly correct yet. But it is important to get the idea about what we are trying to achieve here. We'll present Theorem 4 after we cover the Ramification Theorem. So we are getting ahead of ourselves again.

This is not as cool as Fekete's result that TPCC benchmark is serializable under Snapshot Isolation, but it is still awesome since only minimum prevention measures have to be placed in the payment and new-order transactions to achieve desirable consistency. The significance of this result can actually be extended further: with the minimum prevention measures placed, it probably will perform well in a dbt2 like benchmark; and if it does, it suggests that we've not only solved the consistency problem for TPCC under NDB Cluster's Read-Committed isolation level, such a systematic way using Generalized Serializability Theorems will also perform reasonably well for a lot of application under NDB Cluster's Read-Committed isolation level since TPCC itself is a rather complex one. We will try to collect more evidence to support this conclusion down the road.

Somebody might argue that it would not be reasonable to allow possible inconsistencies in the order-status transaction since it would lead to unpleasant customer experience. On the other hand, the situation would be better for the stock-level transaction since a manager may go down to the warehouse to check it if data had shown a dangerous low stock for an important product. If these could be asserted, we need to include back prevention measures for the order-status transaction, namely, those for: o->p, o->d2, o->n. Of course, this implies Generalized Serializability Theorem II is applied.

Remark: This method of using the Generalized Serializability Theorems can certainly be applied to MySQL InnoDB's Read-Committed and Repeatable-Read isolation levels. In Snapshot Isolation like that of PostgreSQL's, a conflict loop is only possible if it contains at least two consecutive RW conflicts. If we could move a Read-Only transaction T from subset C to subset I(or designing your application to be so), we potentially reduced the RW conflicts in subset C since all those started from T were removed. This way maybe we could conclude that the rest of subset C didn't contain a loop and turned out to be serializable while we couldn't have done so with the original subset C. In this approach, we essentially sacrifice the consistency of Read-Only transactions for performance and this could be a good trade-off since the database will remain consistent.

3. The Ramification Theorem(s)

Now we are one step away from applying the Serializability Theorems and the Generalized Serializability Theorems to the TPCC benchmark application since these Theorems don't allow timestamp fields, but there are five fields in the application that are of or contain datetime data type: c_since, c_data, h_date, o_entry_d, ol_delivery_d. Before introducing the Ramification Theorem to deal with this issue, let's look at an example.

3.1 Issues with timestamp related fields

Example 18: Let t1, t2 be two tables in NDB Cluster. Execute the following pseudo-code in its presented order


                  T1                                                        T2

                                                                        start transaction;

        start transaction;

        select a row r1 in t1;

                                                                        update r1 in t1;

                                                                        insert a row r2 into t2 such that a field f
                                                                        of r2 is the value returned from NOW(); 

                                                                        commit;  

        insert a row r3 into t2 such that a field f
        of r3 is the value returned from NOW(); 

        commit;                                                                                          
	  

The only conflict in this history is the item-based RW one incident on r1 from T1 to T2. Hence it should be serializable if the Serializability Theorem allowed fields of datetime data type. And the order of the serial execution should be T1, T2. However, this contradicts with the fact that r2's timestamp field returns a time earlier than that of r3's.


	                                                                                                       ## 
	  

To understand why this inconsistency would show up, we introduce a type of virtual transaction T and a virtual field named 'system time' into the database. What each transaction T does is to update 'system time' per time resolution of the system and commit. And the NOW() function is implemented as a select to this field. This way, there is an item RW conflict incident on 'system time' from T2 to T, for some T; and there is also an item-based WR conflict on this field from T to T1, for possibly another T(there could be a sequence of Ts in between which forms a sequence of WW conflicts in general). Hence a conflict loop is formed and the inconsistency is explained. Unfortunately, this won't help us much since T is virtual and we can't place prevention measures in it. One possible approach to handle this issue is to take the 'if we can't prevent it, we constrain it' approach and this idea leads to the Ramification Theorem.

Remark: In PostgreSQL's Serializable Snapshot Isolation, the history in Example 18 also executes successfully. The 'start transaction' statement in T2 must precede that of T1's for T2's timestamp to precede T1's since the timestamp represents the time corresponding to the snapshot when the transaction starts. In MySQL InnoDB's serializable isolation level, this history doesn't show up since item read requires a read lock and T2's update would have to wait after T1's commit. But an example in which we insert NOW() into t2 twice from T1 and once from T2 so that T2's insertion is interleaved between T1's duo would also indicate a problem with the same nature. That means the so called serializable isolation levels they've implemented are only correct less the datetime relevant fields. I'll define precisely what this means when I present the Ramification Theorem.

Before we introduce the Ramification Theorem, let's look at a few examples about how inconsistency ramifies from field to field.

Extension #1 of Example 18: Suppose there is another table t3 in the database and more statements are added to both transactions before the commit statement.


	              T1                                                        T2

                                                                    start transaction;

        start transaction;

        select a row r1 in t1;

                                                                    update r1 in t1;

                                                                    insert a row r2 into t2 such that a field f
                                                                    of r2 is the value returned from NOW();

                                                                    select f from r2; 

                                                                    insert a row r4 to t3 such that a field f1 of
                                                                    r4 is the selected value in previous select;                             
                                                                    //The previous two statements can be
                                                                    //combined into a statement like:
                                                                    //insert into t3 values (id, select f from…);

                                                                    commit;  

        insert a row r3 into t2 such that a field f
        of r3 is the value returned from NOW();

        select f from r3; 

        insert a row r5 to t3 such that a field f1 of
        r5 is the selected value in previous select;
        //The previous two statements can be
        //combined into a statement like:
        //insert into t3 values (id, select f from…);

        commit;                                                                    
	  

In this extension we may use the field f1 in t3 to give rise to a similar inconsistency as we did in the original example. So apparently, the inconsistency ramifies from field f to field f1. In this case, the original datetime field f is used as a variable in the identity function for the field write of f1.


	                                                                                                       ## 
	  

Inconsistencies not only would ramify to other fields through item reads, but also would do so through predicate reads.

Extension #2 of Example 18: A table t3 is also added here.


	                T1                                                    T2

                                                                    start transaction;

        start transaction;

        select a row r1 in t1;

                                                                    update r1 in t1;

                                                                    insert a row r2 into t2 such that a field f
                                                                    of r2 is the value returned from NOW();

                                                                    update a field f1 for every tuple in t2
                                                                    where f >= '2019-12-31 00:00:00'
                                                                    and f < '2020-01-01 00:00:00';                           

                                                                    commit;  

        insert a row r3 into t3 such that a field f
        of r3 is the value returned from NOW();

        update a field f1 for every tuple in t3
        where f >= '2019-12-31 00:00:00'
        and f < '2020-01-01 00:00:00';

        commit;                                                                                                
	  

In this extension, if T1 and T2 are both executed near 2020-01-01 00:00:00, there is a chance that r2 is updated but r3 is not. On the other hand, if the history was serializable, r3 should be updated too. Notice that in this extension it is the update operation(whether a row is updated) which demonstrates the inconsistency, NOT what is updated to f1.


	                                                                                                       ## 
	  

We can see this more clearly in the next example.

Example 19: An 'account' field in a banking application is assumed to be always non-negative(an account can't be overdrawn). But this field in a tuple became negative anyway, say, because a write-skew anomaly happened. Then a routine counts these tuples with a statement like 'select COUNT(*) … where … and account >= 0;' and updates the result to a field 'total accounts' regularly. The bank manager might notice the total accounts in the bank is one less although accounts never got deleted without the bank manager's involvement. In this example, the inconsistency in that specific 'account' field ramifies to the 'total accounts' field through a predicate read since it is the only operation in that SQL statement. Notice the SQL statement is executed at least twice for the bank manager to notice the inconsistency.


	                                                                                                       ## 
	  

Let's look at a more complex case.

Extension #3 of Example 18: The conditions are the same as in Extension #1 except that a table t4 is added.


	              T1                                                              T2

                                                                            start transaction;
 
        start transaction;

        select a row r1 in t1;
                                                                                         

                                                                            update r1 in t1;

                                                                            insert a row r2 into t2 such that fields f, f1
                                                                            of r2 are the value returned from NOW()
                                                                            and a positive integer respectively;

                                                                            select f1 from t2 where 
                                                                            f >= '2019-12-31 00:00:00'
                                                                            and f < '2020-01-01 00:00:00' 
                                                                            and store it into variable c; 

                                                                            insert a row r4 to t4 such that a field f2 of
                                                                            r4 is c;                             

                                                                            commit;  

        insert a row r3 into t3 such that fields f, f1
        of r3 are the value returned from NOW()
        and a positive integer respectively;

        select f1 from t3 where 
        f >= '2019-12-31 00:00:00'
        and f < '2020-01-01 00:00:00'
        and store it into variable c;

        //assuming c got a default value 0 
        insert a row r5 to t4 such that a field f2 of
        r5 is c;  

        commit;                                                                                              
	  

In this extension, if insertions into t2, t3 and t4 were all executed near 2020-01-01 00:00:00, T2 could insert a tuple into t4 with a positive integer in f2 while T1 did it with default value 0 since the second select in T1 returned an empty set. On the other hand, if the history was serializable, the insertion into t4 in T1 should also contain a positive integer. This is a contradiction. So the inconsistency has ramified to field f2 in t4 from f1 in t2 and t3. In this case, there are two predicate reads so that one operates on t2 while the other operates upon t3. Notice that although the value in f2 is determined by f1, this new inconsistency is caused by the predicate reads instead of the item reads since it is the sign of f2 that matters, not the value of it.


	                                                                                                       ## 
	  

One interesting thing about this extension is that we can also see the inconsistency from variable c. In other words, if the history was serializable, c in T1 should contain a positive value too, instead of the null value in it. The fact that inconsistency can be temporarily stored in a variable provides a second avenue of ramification through a predicate read. In fact, sometimes the values provided for columns in a predicate is given in a variable. The following extension demonstrates exactly that.

Extension #4 of Example 18: The conditions are the same as in Extension #3.


	             T1                                                              T2

                                                                            start transaction;  

        start transaction;

        select a row r1 in t1;

                                                                            update r1 in t1;

                                                                            insert a row r2 into t2 such that fields f, f1
                                                                            of r2 are the value returned from NOW()
                                                                            and a positive integer respectively;

                                                                            select f1 from t2 where 
                                                                            f >= '2019-12-31 00:00:00'
                                                                            and f < '2020-01-01 00:00:00' 
                                                                            and store it into variable c; 

                                                                            update t4 set … where … and col = c;                             

                                                                            commit;  

        insert a row r3 into t3 such that fields f, f1
        of r3 are the value returned from NOW()
        and a positive integer respectively;

        select f1 from t3 where 
        f >= '2019-12-31 00:00:00'
        and f < '2020-01-01 00:00:00'
        and store it into variable c;

        //assuming c got a default value 0
        update t4 set … where … and col = c;

        commit;                                                                                                   
	  

In this extension, if insertions into t2 and t3 were both executed near 2020-01-01 00:00:00, T2 could update those tuples with col = c > 0 while T1 did it for rows with col = 0. On the other hand, if the history was serializable, the updates in T1 should happen on tuples with col = c > 0 too. This is a contradiction.


	                                                                                                       ## 
	  

This is how a variable affects the process of identifying which data objects are to access by affecting the predicate read. Since this process could include a cursor sometimes, there is a second way how a variable affects it. The following example in the TPCC application demonstrates this.

Example 20: In the payment transaction of the TPCC benchmark, when a customer is represented by last name, the following statements are used to retrieve the necessary info:


        select count(c_id) into :namecnt from customer 
           where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id;

        declare c_byname cursor for 
           select c_first, c_middle, c_id, c_street_1, c_street_2, c_city, c_state, 
              c_zip, c_phone, c_credit, c_credit_lim, c_discount, c_balance, c_since
              from customer where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id
              order by c_first;

        open cursor c_byname;

        if(namecnt%2) namecnt++;
        for(n=0;namecnt/2;n++)
        {
           fetch c_by_name
              into :c_first, :c_middle, :c_id, :c_street_1, :c_street_2, :c_city, :c_state,
              :c_zip, :c_phone, :c_credit, :c_credit_lim, :c_discount, :c_balance, :c_since; 
        }

        close c_byname;                                                                                              
	  

The first statement returns a characteristic of the predicate read, the cardinality of its result set, and store it in variable :namecnt. The rest of the code represents a typical case where the process of identifying which data objects are to access equals the predicate read specified in the WHERE clause, plus a cursor iteration. What is outstanding here is that :namecnt doesn't affect the predicate read and it affects the following cursor iteration instead, hence affects the whole process. In particular, imagine the case inconsistency is stored in :namecnt.


	                                                                                                       ## 
	  

There is yet another way that inconsistency ramifies, namely, through a branch. The following two extensions demonstrate exactly this.

Extension #5 of Example 18: The conditions are the same as in Extension #3.


	              T1                                                                 T2

                                                                            start transaction;

        start transaction;

        select a row r1 in t1;

                                                                            update r1 in t1;

                                                                            insert a row r2 into t2 such that a field f
                                                                            of r2 is the value returned from NOW();

                                                                            select f from r2 and store it into variable c; 
                                                 
                                                                            if (c >= '2019-12-31 00:00:00'
                                                                            and c < '2020-01-01 00:00:00') {

                                                                            insert a row r4 into t4;}                             

                                                                            commit;  

        insert a row r3 into t3 such that a field f
        of r3 is the value returned from NOW();

        select f from r3 and store it into variable c;
 
        if (c >= '2019-12-31 00:00:00'
        and c < '2020-01-01 00:00:00') {

        insert a row r5 into t4;}  

        commit;                                                                                                  
	  

In this extension there could be only one tuple in t4 from T2 if the NOW() function calls happened near '2020-01-01 00:00:00'. But if the history was serializable, there should be an insert into t4 from T1 too. That is a contradiction.


	                                                                                                       ## 
	  

Extension #6 of Example 18: The conditions are the same as in Extension #3.


	                T1                                                              T2

                                                                            start transaction;

        start transaction;

        select a row r1 in t1;

                                                                            update r1 in t1;

                                                                            insert a row r2 into t2 such that a field f 
                                                                            of r2 is the value returned from NOW(); 

                                                                            select COUNT(*) from t2 where
                                                                            f >= '2019-12-31 00:00:00'
                                                                            and f < '2020-01-01 00:00:00';

                                                                            //Assuming the result of previous select is
                                                                            //stored in a variable c
                                                                            if c > 0{

                                                                            insert a row r4 into t4;}

                                                                            commit;  

        insert a row r3 into t3 such that a field f
        of r3 is the value returned from NOW();

        select COUNT(*) from t3 where
        f >= '2019-12-31 00:00:00'
        and f < '2020-01-01 00:00:00';

        //Assuming the result of previous select is
        //stored in a variable c
        if c > 0{

        insert a row r5 into t4;}

        commit;                               
	  

In this extension, if both 'select COUNT(*) ...' statements are executed near 2020-01-01 00:00:00, T2 may insert a tuple in t4 while T1 may not. On the other hand, if the history was serializable, the insertion into t4 in either transaction should generate a tuple in it. This is a contradiction.


	                                                                                                       ## 
	  

3.2 The process of ramification

Based on these extensions and examples we've just discussed, let's try to summarize how inconsistencies can ramify from a base set B of columns and what a ramify operation should be. Firstly, the ramify operation has to read some fields from a base set B and there are obviously two ways of doing that: through an item read or a predicate read. Secondly, for the inconsistencies to ramify to a field outside base set B, a write to that field is a must. What could happen between these two kinds of reads and the write is of our interest.

The first situation is when the read is item-based and the field written is represented as a function F whose variables assumes values returned from those fields read. This is the most common case and a typical scenario is in a statement like 'update col=col+1 …'. Extension #1 serves as a more complex example. In that specific case, f1=F(f) and F is the identity function.

The case that starts with a predicate read is more complex since there are two ways it may come into play. The second situation we are discussing in the next few paragraphs represents one such way and it is demonstrated in Extension #2 and Example 19.


      D := { (Id, x1, x2, …, xn) | Id being the unique identifier of a tuple P evaluates;
      xi, 0 < i <= n, being the values for fields in the 'decision set' of this tuple}.                                                                                               
	

We further define NULL to be a value that doesn't equal to any tuple, and


      I := { tuples uniquely identified by Id} U { NULL}.                                                                                                    
	

Then we may define a map M: D → I as following to represent the predicate read operation:


      M(Id, x1, x2, …, xn) = the tuple identified by Id, if (x1, x2, …, xn) matches P;
      M(Id, x1, x2, …, xn) = NULL, otherwise.                                                                                                    
	

The image Im satisfying M(D) = Im V { NULL }, here V is the disjoint union, is the set of all matching tuples for P. We use this map M to interpret the process of identifying the matching tuples in the predicate read for P.

Under this map, inconsistency inside a set of tuples, which is a characteristics of the tuple set, could become visible in Im. Notice that it is only when the set of fields with inconsistency intersects with the 'decision set' we could possibly see the inconsistency in Im since in that case the inconsistency may or may not be filtered out. It might be better to comprehend this with an analogy: suppose there is a production line which produces a set of 10000 products in a shift; we sample 100 of them to assess the quality of the whole set and a defective product is noticeable only when it falls into that subset of 100 samples.

From Extension #3 we see that inconsistency can be stored in a variable. This leads to the third situation and the second way that a predicate read can be affected by inconsistency. It is demonstrated in Extension #4. The difference between the second way and the first way can be described by the following analogy: view a predicate read as a black box; then to affect its output, you may either affect the input, which is the first way, or affect the black box, which is the second way.

There is more about storing inconsistency in a variable. If it is used in a predicate read of an update or a delete statement, it is the third situation we've just described. If it is used in a deterministic function to write to a field, then it is just the fourth situation and Extension #3 is such an example. Also, if in Example 19, the selected info is stored in a variable before it is written to the field 'total accounts'(say, originally it uses a sub-query so that the selected info is fed into the update statement), it is interpreted as the fourth situation. So now Example 19 can be interpreted in either way: the second situation or the fourth one, depending on the syntax. This applies to Extension #1 too: it may be interpreted as the the fourth or the first situation, depending on whether it uses a variable or not.

The process of identifying which data objects are to access sometimes contains a cursor besides the predicate read. If the variable containing the inconsistency could affect this cursor and the cursor is an updatable one for an update or a delete statement, it is the fifth situation. For example, a variable like :namecnt as in Example 20 that could affect a branching statement for the cursor could be such a case if the cursor turns out to be an updatable one. Also, if the cursor is a scrollable one, the fetch statement of the cursor could be in the following form:


      FETCH ABSOLUTE n FROM cursor_name;

      or

      FETCH RELATIVE n FROM cursor_name;.                                                                                                    
	

Inconsistency in variable n could also affect the process of identifying which data objects are to access and it's considered to be the fifth situation if the cursor happens to be an updatable one.

If on the other hand the variable is used in a branch as demonstrated in Extension #5 and Extension #6, but not in the fifth situation, it is the sixth situation. The only difference between these two extensions is that inconsistency comes from item read in one of them and from predicate read in the other.

Actually there could also be more complex case like after inconsistency is stored in a variable, the variable is used in a predicate read as in the third situation to retrieve a field and this field is in turn stored in another variable, and this second variable is used in a deterministic function to write a field. And before the final field write this process could happen multiple times and more variables can be involved. In other words, inconsistency can be passed on in the form of a variable chain before touching down to a field in the database and each of these passing is just like the third, the fourth, the fifth or the sixth situation except that a new variable, instead of a field, is written. But if we just look at the last variable(assuming the inconsistency is still visible in this last variable), it can always be reduced into one of the last four situations.

Notice that the case in Example 20 is not summarized in the last four situations since the process it affects is not from an update statement and hence it can't be the last variable in the chain. But it can be a variable in the chain in general. Also since NDB Cluster supports neither updatable or scrollable cursor, the fifth situation will never show up for the application to NDB Cluster in this article. But for systems that do, including the fifth situation is necessary.

This classification is summarized in the following

Lemma 1: Inconsistency ramifies from field to field through the following path: it starts with a predicate read or an item read and ends with a field write, in between there could be a chain of variables such that inconsistency ramifies from one variable to the next. If this sequence is absent, it reduces to the first or the second situation; otherwise we use the last variable in the sequence to define the third, the fourth, the fifth or the sixth situation.

Proof: Based on the structure of a transaction described before Example 1, there are SQL statements and ordinary programming language statements in it. When inconsistency from a read is stored in a variable chain, if the last variable in this chain doesn't directly affect a SQL statement(doesn't appear in a SQL statement), it may affect branching statements of a cursor in the process of identifying which data objects are to access for an update or a delete, this is the fifth situation; or it may affect the final field write by affecting a branching statement that encloses that write, this is the sixth situation . If on the other hand the last variable affects a SQL statement directly(appears in a SQL statement), it may either through a predicate read, which is the third situation; or through the value written to a field, which is the fourth situation; or affect a fetch statement of a cursor in the process of identifying which data objects are to access for an update or a delete, this is the scrollable cursor case in the fifth situation. When information from a read is not stored in a variable, it is just either the first or the second situation.


	                                                                                                       ## 
	  

Remark 1: One could actually view the third, the fifth and the sixth situation as one similar to the fourth one as follows. In the third situation, if a variable affects the predicate read, a value in the variable is first used to identify a predicate read(the map M) and the predicate read is used to identify the matching set. The matching set decides then which tuples to update. Notice here the variable doesn't decide the values to be updated(Another variable could do that job). Its role is more like a variable in a piece-wise function that decides what definition the function might assume under different circumstances. In the sixth situation, a branch is basically a map that decides which set of instructions is executed and which set is not. In our specific case, this set of instructions represents a set of database update statements. It will decide which set(like empty or non-empty set as in Extension #5 and Extension #6) of tuples to update, just like the third situation. The fifth situation is the other way to affect the process of identifying which data objects are to access. The map is one that maps the variable to a subset in the matching set of the associated predicate read.

Remark 2: For the way of passing inconsistency from one variable x to the next variable y inside the chain of variables, there is a similar break down: if this passing doesn't involve any SQL statement, there are two cases. Case 1: it is just a regular mapping so that y is the result and x is a variable of this map. Case 2: x is used in the branching condition of a branch and y is modified in that branch. If on the other hand it involves SQL statements, there are four cases. Case 1: x is used in the process of identifying which data objects are to access(in the predicate read or the cursor that follows) of a select and one of the selected field is used in a mapping whose result is stored in y. Case 2: x is used in the process of identifying which data objects are to access(in the predicate read or the cursor that follows) of a select, an update or a delete and a characteristic of the matching set, like the cardinality, is used in a mapping whose result is stored in y. Case 3: x is used in the branching condition of a branch and one of the selected field in a select in this branch is used in a mapping whose result is stored in y. Case 4: x is used in the branching condition of a branch and a select, an update or a delete is executed in this branch; a characteristic of the matching set, like the cardinality, is used in a mapping whose result is stored in y. And of course, all these cases can, again, be interpreted as a map like in Remark 1.

Remark 3: The structure in Lemma 1 – the leading read, the trailing write and the possible sequence in between – is a generic one. It describes how a field(the one read) affects another(the one written) in a transaction and hence in a history in general. Lemma 1 only emphasizes the case when inconsistency is visible in the field read and the field written. One can also see how the field read affects a variable in the chain if one is present in this structure. We'll use this generic structure to define a ramification operator R shortly.

There is one more issue though. So everything can be interpreted as a map of some sort. This is certainly true if the functions involved are deterministic ones. What if there are non-deterministic functions involved in the process in concern? Can inconsistency ramify through them? Well, all the so-called random functions are actually pseudo-random and so they are really deterministic. Hence we are fine with it.

Notice the discussion surrounding Lemma 1 is a syntactic approach and this comes with two implications. First, if in the future the syntax of SQL is changed, we might need to adjust Lemma 1 and Remark 2 so that they remain correct. For example, in the SQL:2003 standard, an updatable cursor is accessed with the following statements after a fetch statement has positioned the cursor on a specific row:


      UPDATE table_name
         SET …
         WHERE CURRENT OF cursor_name;

      or

      DELETE FROM table_name
         WHERE CURRENT OF cursor_name;.                                                                                                     
	

If in the future the syntax was changed to something like


      UPDATE table_name
         SET …
         WHERE ABSOLUTE n FROM cursor_name;                                                                                                  
	

with the original fetch statement being skipped, we need to add this case to the fifth situation since variable n could contain inconsistency then.

Besides this imaginary example, let's take a look at one that is more realistic. MySQL recently starts to support derived table, which is a kind of sub-query in the FROM clause in the outer query. The following statement is one such example.


      SELECT sb1 into :y, sb2, sb3
         FROM (SELECT s1 AS sb1, s2 AS sb2, s3=:x AS sb3 FROM t1) AS sb
         WHERE sb1 > 1;                                                                                                    
	

In this statement, if inconsistency is visible in variable x, it is possible that it would ramify to variable y. Such a statement would not show up in the current framework since we've assumed every statement only involves one table. But if we were to extend this framework to involve more complex statements like joins, sub-queries and unions, the previous statement must be included. Then none of the four cases in Remark 2 involving a SQL statement applies here since x is not in the process of identifying which data objects are to access of the outer query; it is in the process of identifying which data objects are to access of the sub-query, but the result of the sub-query is not stored in y. So we must provide extra cases in Remark 2 to elaborate this. Right now, the outer operation of a derived table must be a query. If in the future MySQL were to support update and delete for the outer operation too, a new situation must be added to Lemma 1 to reflect this too.

Second, some dialects of SQL might introduce non-standard syntax into cursor related statements and Lemma 1 might need to be adjusted to reflect that. For example, the following is a snippet for demonstration of using an updatable cursor in Oracle and it doesn't resemble the syntax in SQL:2003 standard at all.


      DECLARE
        -- customer cursor
        CURSOR c_customers IS 
            SELECT 
                customer_id, 
                name, 
                credit_limit
            FROM 
                customers
            WHERE 
                credit_limit > 0 
            FOR UPDATE OF credit_limit;
        -- local variables
        l_order_count PLS_INTEGER := 0;
        l_increment   PLS_INTEGER := 0;
    
      BEGIN
        FOR r_customer IN c_customers
        LOOP
            -- get the number of orders of the customer
            SELECT COUNT(*)
            INTO l_order_count
            FROM orders
            WHERE customer_id = r_customer.customer_id;
            -- 
            IF l_order_count >= 5 THEN
                l_increment := 5;
            ELSIF l_order_count < 5 AND l_order_count >=2 THEN
                l_increment := 2;
            ELSIF l_increment = 1 THEN
                l_increment := 1;
            ELSE 
                l_increment := 0;
            END IF;
        
            IF l_increment > 0 THEN
                -- update the credit limit
                UPDATE 
                    customers
                SET 
                    credit_limit = credit_limit * ( 1 +  l_increment/ 100)
                WHERE 
                    customer_id = r_customer.customer_id;
            
                -- show the customers whose credits are increased
                dbms_output.put_line('Increase credit for customer ' 
                    || r_customer.NAME || ' by ' 
                    || l_increment || '%' );
            END IF;
        END LOOP;
    
        EXCEPTION
            WHEN OTHERS THEN
                dbms_output.put_line('Error code:' || SQLCODE);
                dbms_output.put_line('Error message:' || sqlerrm);
                RAISE;
            
      END;                                                                                                     
	

Definition 12: Given a database application, for arbitrary base set B of columns, a ramification operator R of B is

R(B) := {c | c is in B or is a column in which a field is written in a process that starts with an item read of fields in B or a predicate read whose 'decision set' intersects with B, and with a possible chain of variables we've described in Lemma 1 in between}.

As we can see from the definition, B is always a subset of R(B). If we apply R to B repeatedly, we got a sequence: B, R(B), R(R(B)), … such that any set in the sequence is a subset of a latter one. This expansion has to stop at some point since the set of columns in the database is finite. We call this super-set of all members in the sequence Ram(B), or the ramification set of B. Ram(B) represents the set of columns that could be affected by columns from B when the application executes.

Lemma 2: For a database application and an arbitrary base set B, Ram(B) doesn't affect writes in its complement Ram(B)'.

Proof: From the arguments before this lemma, we have R(Ram(B)) = Ram(B) and hence Ram(Ram(B)) = Ram(B). The conclusion follows.


	                                                                                                       ## 
	  

Example 21: In this example we try to apply the Ramification Theorem to the TPCC application to constrain the datetime fields related inconsistencies. We start with calculating the ramification set. In the TPCC application, let B = {c_since, c_data, h_date, o_entry_d, ol_delivery_d} to be all columns depending on a datetime value. From the implementation of the transactions in the TPCC specification, we can see that all these columns don't ramify to new columns. So we have Ram(B) = B.


	                                                                                                       (To be continued...)                    
	  

Remark: In the description of the transactions in the TPCC specification, c_data doesn't depend on a datetime value. So according to the description, B = {c_since, h_date, o_entry_d, ol_delivery_d} and Ram(B) = B. We stick to the implementation, not the description, in this case so that Ram(B)' is absolutely consistent.

3.3 Split of transactions to Ram(B)'

Next we try to restrict transactions to Ram(B)' and we will use examples to demonstrate what exactly this means.

Let's first consider the following update statement in a transaction:


      update t set col3=..., col4=..., col5=... where col1=... and col2=...;.                                                                                                    
	

Assuming at least one of col3, col4 and col5 is in Ram(B)'. Suppose col3 and col4 belong to Ram(B)', but col5 belongs to Ram(B). By Lemma 2, col1 and col2 must both be in Ram(B)'. We split the statement into the following two:


      update t set col3=..., col4=... where col1=... and col2=...;

      update t set col5=... where col1=... and col2=...;.                                                                                         
	

We call the first statement the split of the original statement to Ram(B)' since all the fields involved are in Ram(B)'. This operation is designated as a type I split operation and the resulting statement is of course designated as a type I split statement. If all of col3, col4 and col5 are in Ram(B)', we don't need any action since it is already split. The case where all of col3, col4 and col5 are in Ram(B) will be handled shortly,

Now let's look at the following query statement:


      select col3, col4, col5 from t where col1=... and col2=...;.                                                                                                     
	

Assuming at least one of col3, col4 and col5 is in Ram(B)'. Notice that unlike the type I case, some of the selected fields being in Ram(B)' don't necessarily lead col1 and col2 to be so. But by lemma 2: if col1 or col2 is in Ram(B), the selected fields won't affect writes in Ram(B)' at all even if they are in Ram(B)' since otherwise Ram(B) could be further ramified. And for the same reason, the selected fields in Ram(B) won't affect writes to Ram(B)' in any case. So if we want to split those reads that can affect writes to Ram(B)', we only need to consider statements with both col1 and col2 being in Ram(B)' and isolate those selected fields in Ram(B)'. Again let's assume that col3 and col4 belong to Ram(B)', but col5 belongs to Ram(B). Then we split it to the following two:


      select col3, col4 from t where col1=... and col2=...;

      select col5 from t where col1=... and col2=...;.                                                                                                     
	

We call the first statement the split of the original statement to Ram(B)'. This operation is designated as a type II split operation and the resulting statement is of course designated as a type II split statement. That is the part of the original statement that could affect writes to Ram(B)'. Again if all of col3, col4 and col5 are in Ram(B)', it is already split. The case where all of col3, col4 and col5 are in Ram(B) will be handled shortly,

Remark: Speaking of splitting select statements, in an update statement like 'update t set col=col+1 where …', there is a hidden select statement 'select col from t where...' inside. We certainly could split it and classify it as a new type of split statement. But it is OK just to leave it in the type I context since the split of an update statement has done the job.

For a deletion, its 'decision set' can't contain any field in Ram(B) by Lemma 2 unless all the deleted fields are in Ram(B). The later case certainly can't affect writes in Ram(B)' and we designate the former case as a type III split operation and its split consists only of deleted fields in Ram(B)'. The resulting statement is of course designated as a type III split statement.

For an insertion, only the inserted fields in Ram(B)' are considered to be writes that are split from the original statement. This operation is designated as a type IV split operation and the resulting statement is of course designated as a type IV split statement. In general, for statements without a predicate read(without a WHERE clause), the split consists of all the accessed fields in Ram(B)' for all four types since Ram(B) doesn't have a chance to affect a predicate read.

For statements that only consist of a predicate read like 'select COUNT(*) from …', similar to the type II case, we should only consider those statements whose 'decision set' lies completely in Ram(B)'. Although this type of statement is already split, we call it a type V split operation and the resulting statement a type V split statement anyway.

Now let's revisit the following query statement as in type II split operation case:


      select col3, col4, col5 from t where col1=... and col2=...;.                                                                                        
	

This time we assume that all of col3, col4 and col5 are in Ram(B). In this case, if both col1 and col2 are in Ram(B)', the predicate read can still affect writes in Ram(B)', like with its cardinality as we've described after Remark 2 of Lemma 1. Since the current syntax of SQL doesn't allow us to isolate the predicate read from the item reads that follow, we have to include the whole statement as the split statement, but only its predicate read matters of course.

Similar arguments apply to type I and type III operations too. We designate all three as a type VI split operation and the resulting statement is of course designated as a type VI split statement. This is the only type of split statement that includes fields in Ram(B). Notice that in the type I, type II or type III split operation case, if it is the predicate read that could affect writes in Ram(B)', it is already included in the split statement and nothing else needs to be done.

Remark: We've just discussed how to split six relatively simple types of statements here. This means to apply the Ramification Theorems we'll present, we have to re-write our transactions to consist of only these six simple types of statements. Fortunately most, if not all, SQL statements can be re-written into these six simple types.

We call these six types of operation the split operations and the resulting statements the split statements. Also, the split of a set of SQL statements(for example, the set inside a transaction) is just simply the collection of the split of each individual statement in that set.

In a transaction, these split statements, cursors in the process of identifying which data objects are to access, other branches and other non-SQL statements collectively form the split of a transaction to Ram(B)'. Given a history H, we called the collection of splits of member transactions of H on Ram(B)' to be the split of H to Ram(B)'. For the split of H, we may only consider the conflicts between split statements since we have the following observations:

  1. All field writes in Ram(B)' are written by type I, type III and type IV split statements.
  2. Fields read in type I and type II split statements are the only kind of item reads that can affect field writes in Ram(B)', although not every one of them does. When they do, they might be through the first, the third, the fourth, the fifth or the sixth situation discussed above(To be specific, type I is through the first situation, while type II is through the first, the third, the fourth, the fifth or the sixth situation). Predicate reads in type I, type II, type III, type V and type VI split statements are the only kind of predicate reads that can affect field writes in Ram(B)', although not every one of them does. When they do, they might be through the second, the third, the fourth, the fifth or the sixth situation discussed above(To be specific, type I and type III are through the second situation, while type II, type V and type VI are through the third, the fourth, the fifth or the sixth situation).
  3. For any branch that can affect field writes to Ram(B)', its branching condition cannot be affected by Ram(B). In other words, its branching condition can't contain a variable which might store inconsistency originating from Ram(B). Similarly, for a variable in a cursor in the process of identifying which data objects are to access that can affect writes to Ram(B)', it can't be affected by Ram(B) either. Consequently, it can only be affected by Ram(B)'
  4. For reads in the split to Ram(B)' that can't affect writes to Ram(B)', they could be affected by Ram(B) in general. There is also a break down about how Ram(B) can affect reads in the split to Ram(B)' like the six situations in Lemma 1: it starts with a read in Ram(B) and inconsistency is stored in a variable, then it is passed down a chain of variables as usual and eventually affects a read in Ram(B)'. If we look at the last variable in the chain, the first type is through the process of identifying which data objects are to access(the predicate read or the cursor that follows), just like the third or the fifth situation except it doesn't end with a write. The second type is when the last variable affects a branch that is enclosing the target read. This is like the sixth situation except it doesn't end with a write. These reads in the split to Ram(B)' that can't affect writes to Ram(B)', but can be affected by Ram(B) are of type II, type V or type VI split statements.

Observations 1 and 2 just follow from the definition of the split operations and Lemmas 1 & 2. Observations 3 does require some explanation.

In Remark 3 of Lemma 1 we know that the starting read, trailing write and the possible chain of variables in between is structural. Hence a branch that can affect field writes to Ram(B)' just means that it is a branch in such a structure, which touches down with a write to a field in Ram(B)'. If this structure also started with a read from Ram(B), it would form the path described in Lemma 1 and Ram(B) would affect Ram(B)', which contradicts with Lemma 2. A similar argument is true for a variable in a cursor in the process of identifying which data objects are to access that could affect writes to Ram(B)'. Hence observation 3 follows.

Observation 4 should be self-explanatory. Notice that a select could have selected fields in Ram(B)' while the predicate read contains fields in Ram(B). This kind of select is not included in observation 4 since an item read in Ram(B)' is not the same as an item read in the split to Ram(B)'.

3.4 The Ramification Theorems

With these four observations, if no split statements as in observation 4 exist, the split on Ram(B)' executes as if Ram(B) didn't exist; if on the other hand, split statements as in observation 4 actually exist, the updates to Ram(B)' are still completely decided by the split. Now we are ready to present the following

Ramification Theorem: For base set B such that Ram(B)' is not empty, suppose all the conditions are satisfied as in type B/C of the Serializability Theorem except in Condition 7 we only consider conflicts between statements in the splits of H to Ram(B)'. Then there are two cases. Case 1: if no split statements as in observation 4 exist, a conflict loop doesn't exist in the split statements iff the split of H on Ram(B)' is equivalent to a serial execution of itself. Case 2: if split statements as in observation 4 do exist, name the set of split of H on Ram(B)' minus the type II, type V and type VI split statements in observation 4 to be C, as in consistency; then C doesn't contain a conflict loop iff C is equivalent to a serial execution of itself.

Proof: From observations 1, 2 and 3, we know that the field writes to Ram(B)' and field reads, predicate reads, cursors in the process of identifying which data objects are to access, other branches and non-SQL statements that could affect writes to Ram(B)' are all included in the split of H. If split statements as in observation 4 are absent, reads in the split can only be affected by Ram(B)'. So we may apply arguments similar to those for the Serializability Theorem to Ram(B)' and the split to arrive at the conclusion for case 1. Notice in this process the branches that would not affect either writes or reads in Ram(B)' are irrelevant. If on the other hand, split statements as in observation 4 are present, some reads in the split that can't affect writes can be affected by Ram(B) and could become inconsistent. After we exclude them, we got the conclusion of case 2. This is just the basic idea of the Generalized Serializability Theorems re-applied here since the excluded reads can't affect writes to Ram(B)' anyway.


	                                                                                                       ## 
	  

We may actually prove that any field written on Ram(B)' can be expressed as a function where the variables contain only fields from Ram(B)' and constants in both cases. When we say 'C doesn't contain a conflict loop', we mean that there is no conflict loop in the set of associated transactions of C for conflicts caused by reads and writes in C. And 'a serial execution of itself' can be achieved by executing the set of associated transactions serially.

Remark 1: When B is empty, case 2 will not show up and the Ramification Theorem becomes the Serializability Theorem. So the Ramification Theorem is a generalization of the Serializability Theorem.

Then the fact that the proof is similar to that of the Serializability Theorems is not a surprise. The key is to isolate the statements that could affect the database portion represented by Ram(B)', which translates into the six types of split statements, and the rest is taken care of by the definition of Ram(B)'. Notice that if we were still interpreting conflicts at the tuple level, it might hardly be natural to come across the definition of split operations and probably wouldn't be able to discover the Ramification Theorem. This is one more piece of evidence that trying to take everything down to the field level would be beneficial. In particular, we don't have a Ramification Theorem in type A of the Serializability Theorem setting unless we introduce concepts like a sub-tuple that covers fields in a table in Ram(B)' and such. We are not going to do that since again, type A is for demonstration only.

Remark 2: From the proof of the Ramification Theorem we know that the split statements contain all item writes to Ram(B)'. They don't contain all item reads to Ram(B)' though. The field reads in a query with 'decision set' of the predicate read intersecting Ram(B) may all be in Ram(B)', but the query itself can't give rise to any split operations. These item reads may contain inconsistencies in general.

Corollary: If we place prevention measures on the split or C such that the conflicts between the split statements won't form a loop and the portion of the database represented by Ram(B)' started in a consistent state, H will leave it consistent.

Proof: It follows directly from the Ramification Theorem.


	                                                                                                       ## 
	  

Example 21 (Continuation...): Now let's have a look at the split statements in the TPCC application by examining those that are not. First, we can see that there is no predicate read whose 'decision set' intersects with Ram(B) from the conflict tables in Example 16. This not only implies all predicate reads are in the split to Ram(B)', but also means whether a statement is in the split completely depends on the fields accessed. Speaking of field access to Ram(B), for writes, we got insertions on: o_entry_d, ol_delivery_d(NULL value) in the new-order transaction, h_date in the payment transaction and updates on: ol_delivery_d in a execution path of the delivery transaction, c_data in the payment transaction; for item reads, we got c_since, c_data in the payment transaction and o_entry_d, ol_delivery_d in the order-status transaction. The rest are all split statements of type I – V. Besides these, the select and update of c_data in the payment transaction are both type VI split statements. These are all the split statements.

Also split statements in observation 4 don't exist since item reads of fields in Ram(B) are not used in any variable chain that ends in a branch or a process of identifying which data objects are to access. Hence case 1 of the Ramification Theorem applies here. This implies the prevention measures we've placed in the TPCC application guarantees consistency in Ram(B)', where Ram(B) = {c_since, c_data, h_date, o_entry_d, ol_delivery_d}. Actually, we've placed more than enough prevention measures there. We may remove prevention measures for item RW conflicts on field c_data in the p->p case and on field ol_delivery_d in the o->d2 case.


	                                                                                                       (To be continued...)                    
	  

The Ramification Theorem is based on type B/C of the Serializability Theorem. The next few theorems combine the Ramification Theorem with type B/C of the Generalized Serializability Theorems.

Ramification Theorem I: Assuming the same conditions as in the Ramification Theorem, except this time we only consider the split statements to Ram(B)' from the Read-Write transactions. Then there are two cases. Case 1: if no split statements as in observation 4 exist, a conflict loop doesn't exist in the split statements iff the split statements of the set of Read-Write transactions to Ram(B)' is equivalent to a serial execution of itself. Case 2: if split statements as in observation 4 do exist, name the split statements of the set of Read-Write transactions to Ram(B)' minus the type II, type V and type VI split statements in observation 4 to be C, as in consistency; then C doesn't contain a conflict loop iff C is equivalent to a serial execution of itself.

Proof: In Generalized Serializability Theorem I, we know that the set of Read-Write transactions execute as if the set of Read-Only transactions didn't exist. The same applies here. The rest is the same as the Ramification Theorem.


	                                                                                                       ## 
	  

Remark: Ramification Theorem I is a generalization of Generalized Serializability Theorem I.

Corollary: If we place prevention measures on the split of the set of Read-Write transactions to Ram(B)' or C such that the conflicts between the split statements won't form a loop and the portion of the database represented by Ram(B)' started in a consistent state, H will leave it consistent.

Proof: It follows from Ramification Theorem I.


	                                                                                                       ## 
	  

Now we are ready to remedy Theorem 4' and present the following

Theorem 4: With the conditions as in Theorem 4' and providing prevention measures for p->p(without the one for c_data), n->n as described, the TPCC benchmark leaves Ram(B)' in a consistent state if it started consistent.

Proof: It follows directly from the corollary of Ramification Theorem I.


	                                                                                                       ## 
	  
Ramification Theorem II: Assuming the same conditions as in Ramification Theorem I, except this time we consider the split statements to Ram(B)' from the Read-Write transactions and a subset of the Read-Only transactions, named C as in consistency. Then there are two cases. Case 1: if no split statements as in observation 4 exist in C, then a conflict loop doesn't exist in C iff C is equivalent to a serial execution of itself. Case 2: if split statements as in observation 4 do exist in C, name C minus the type II, type V and type VI split statements in observation 4 to be C'; then C' doesn't contain a conflict loop iff C' is equivalent to a serial execution of itself.

Proof: Similar to the proof of Ramification Theorem I.


	                                                                                                       ## 
	  

Remark: Ramification Theorem II is a generalization of Generalized Serializability Theorem II.

Ramification Theorem II is designed for the case when some of the Read-Only transactions are important that inconsistencies should not show up in their reads either.

Corollary: If we place prevention measures on C or C' such that the conflicts between the split statements won't form a loop and the portion of the database represented by Ram(B)' started in a consistent state, H will leave it consistent.

Proof: It follows from Ramification Theorem II.


	                                                                                                       ## 
	  

The key observation about the Generalized Serializability Theorems and the Ramification Theorems is that after we rule out those reads that don't affect writes to the database(or a potion of it), we can still prove the rest to be serializable if it is free of conflict loops and hence consistency in the underlying database is preserved. To push this idea to its extreme, consider a subset R of C as in Generalized Serializability Theorem II which consists of reads that don't affect writes to the database. Notice that any read in a Read-Only transaction qualifies as a member of R here. So although in Generalized Serializability Theorem II we require a Read-Only transaction to be consistent or not as a whole, in the following Generalized Serializability Theorem III, we may choose only part of it to be so. But the intended statements in R are usually those in the Read-Write transactions.

Theorem(Generalized Serializability Theorem III): Assuming the same conditions as in Generalized Serializability Theorem II. Then a conflict loop doesn't exist in C minus R, where C is the set of Read-Write transactions and some of the Read-Only ones in H and R is defined as in the last paragraph, then C minus R is free of such a conflict loop iff C minus R is equivalent to a serial execution of itself.

Proof: Similar to the that of Generalized Serializability Theorem II.


	                                                                                                       ## 
	  
Corollary: If C minus R in Generalized Serializability Theorem III doesn't contain a conflict loop, H will leave the database in consistent state if it started out consistent.

Proof: It follows from Generalized Serializability Theorem III.


	                                                                                                       ## 
	  
Ramification Theorem III: Assuming the same conditions as in the Ramification Theorem II except that this time we consider the split statements to Ram(B)' from C minus R Generalized Serializability Theorem III and name it C'. Then there are two cases. Case 1: if no split statements as in observation 4 exist in C', a conflict loop doesn't exist in C' iff C' is equivalent to a serial execution of itself. Case 2: if split statements as in observation 4 do exist in C' , name C' minus the type II, type V and type VI split statements in observation 4 to be C''; then C'' doesn't contain a conflict loop iff C'' is equivalent to a serial execution of itself.

Proof: Similar to the proof of Ramification Theorem II.


	                                                                                                       ## 
	  

Remark: Ramification Theorem III is a generalization of Generalized Serializability Theorem III.

Corollary: If we place prevention measures on the split of C' or C'' such that the conflicts between the split statements won't form a loop and the portion of the database represented by Ram(B)' started in a consistent state, H will leave it consistent.

Proof: It follows from Ramification Theorem III.


	                                                                                                       ## 
	  

In Ramification Theorem III, three types of reads that do not affect writes to Ram(B)' could be excluded. The first type of reads are those from the Read-Only transactions. They don't affect writes at all, so in particular they don't affect writes to Ram(B)'. The second type of reads are those in Read-Write transactions that don't affect writes and in particular they don't affect writes to Ram(B)'. These two types could both show up in R. The third type of reads are those from observation 4. Although three types of reads have been filtered out, there could still be reads that would not affect writes to Ram(B)' in the remaining statements. Consider reads in the split of C minus R to Ram(B)' that can't affect writes to Ram(B)' and can't be affected by Ram(B) either, but can affect writes in Ram(B). These reads can't be any of the three types excluded already. So in principle we may develop a Ramification Theorem IV that excludes them too and try to push everything beyond the extreme. But the thing is that we develop the Ramification Theorems to handle the issue with datetime fields and it is hard to imagine non-datetime field reads can affect a datetime field write. Such reads can, however, affect writes to the ramification set of the datetime fields. But for the purpose of this article, we might want to settle with Ramification Theorem III for now.

Example 21 (Continuation...): Assuming C to be all the statements in the Read-Write transactions in the TPCC application, let's see what operations are in set R as in Generalized Serializability Theorem III and Ramification Theorem III. To do this, we need to trace the sequence described in Lemma 1 reversely. In what follows, for each transaction, we will fill in more details(highlighted in brown color) about how variables are related as necessary.


        The New-Order transaction(Read-Write):

        select w_tax from warehouse where w_id=:w_id;

        select d_tax from district where d_id=:d_id and d_w_id=:w_id;

        select d_next_o_id from district where d_id=:d_id and d_w_id=:w_id;
		
        :o_id=:d_next_o_id;

        update district set d_next_o_id=:d_next_o_id+1 where d_id=:d_id and d_w_id=:w_id;

        select c_discount, c_last, c_credit from customer 
           where c_w_id=:w_id and c_d_id=: c_d_id and c_id=:c_id;

        insert into orders values(:o_id, :d_id, :w_id, :c_id, :datetime, NULL, :o_ol_cnt, :o_all_local);

        insert into new_order values(:o_ld, :d_id, :w_id);

        And for each item we're going to order, we execute the following block of statements in a for loop:

        {    
            select i_price, i_name, i_data from item where i_id=:ol_i_id;

            //n->n
            select s_quantity, s_data, s_dist_01, s_dist_02, s_dist_03, s_dist_04,
               s_dist_05, s_dist_06, s_dist_07, s_dist_08, s_dist_09, s_dist_10
               from stock where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id
               lock in share mode;

            update stock set s_quantity=:s_quantity where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id;

            update stock set s_ytd=s_ytd+:ol_quantity
               where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id;

            update stock set s_order_cnt=s_order_cnt+1
               where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id;

            if (:ol_supply_w_id !=:w_id) {
               update stock set s_remote_cnt=s_remote_cnt+1
               where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id;
            }
			
            :ol_amount = :ol_quantity * :i_price * (1+:w_tax+:d_tax) * (1-:c_discount);

            insert into order_line values(:o_id, :d_id, :w_id, :ol_number, :ol_i_id,
               :ol_supply_w_id, NULL, :ol_quantity, :ol_amount, :ol_dist_info);
        }
		
        //s->n
        select * from t_s_n where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id for update;

        //o->n
        select * from t_o_n where o_d_id=:c_d_id and o_w_id=:c_w_id and o_c_id=:c_id for update;

        //d1->n
        select * from t_d1_n where no_d_id=:d_id and no_w_id=:w_id for update;
		
	  

Although the sequence between the ending field write and the starting read could in general be very complex, in the real world it is usually in the simplest form. We'll elaborate on them for the new-order transaction on a per updated field basis.

d_next_o_id:

The updated value of d_next_o_id in the district table is from the following statement:


        select d_next_o_id from district where d_id=:d_id and d_w_id=:w_id;.
	  

It is stored in the variable :d_next_o_id before it is written back to the d_next_o_id field. Hence two dependence arise in this case: one starts with the item read of d_next_o_id and the other starts with the predicate read in the previous statement. They converge in the variable :d_next_o_id and each represents the fourth situation since the last and only variable in the sequence is used in a deterministic function to update the field d_next_o_id. In other words, inconsistency from either d_next_o_id or fields of the 'decision set' of the predicate read – {d_id, w_id} – might ramify to d_next_o_id through its own path and become visible there.

When d_next_o_id is actually updated, it also depends on the predicate read of the following statement:


        update district set d_next_o_id=:d_next_o_id+1 where d_id=:d_id and d_w_id=:w_id;.
	  

The dependence between this predicate read and the update represents the second situation of course. So there are three dependence that could affect this update in total.

Remark: The previous two predicate reads that d_next_o_id depends on are identical. The obvious question is whether we could view them as one. In Snapshot Isolation the answer is probably a 'yes' since both come from the same snapshot. But in NDB Cluster's Read-Committed isolation level, some other transaction might modify d_id or w_id in between and they could return different matching set. We include both to make sure it is correct in all implementations of Read-Committed isolation level. And even if they can be viewed as the same, they still give rise to different dependence in our classification scheme, namely, one as in the fourth situation and the other as in the second situation. We'll see a similar example for item reads shortly.

o_id:

The o_id field in the insertions into the orders and new-order tables also depends on the item read of d_next_o_id in the district table and the first predicate read in the d_next_o_id case. The two dependence between the read and the insertion represent the fourth situation.

s_quantity:

The updated value of s_quantity in the stock table depends on the item read of itself and the predicate read in the following statement:


        select s_quantity, s_data, s_dist_01,  s_dist_02,  s_dist_03,  s_dist_04,
           s_dist_05,  s_dist_06,  s_dist_07,  s_dist_08,  s_dist_09,  s_dist_10
           from stock where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id
           lock in share mode;.
	  

When it is updated, the update also depends on the predicate read of the update statement. The first two dependence between the read and the update represent the fourth situation, while the third one represents the second situation.

s_ytd, s_order_cnt and s_remote_cnt:

The update of s_ytd, s_order_cnt and s_remote_cnt in the stock table depends on itself and the predicate read in the update statement(Notice that the update of s_remote_cnt happens in a conditional branch. However the branching condition only depends on parameters passing into the transaction, not a variable. Hence the branch case in the definition of ramification operation doesn't apply here). The two dependence between the read and the update represent the first and the second situation respectively.

ol_o_id:

The ol_o_id field in the insertion into the order-line table depends on the item read of d_next_o_id in the district table and the predicate read in the d_next_o_id case. The two dependence between the read and the insertion represent the fourth situation.

ol_amount:

The ol_amount field in the insertion into the order-line table depends on a few reads. We list them one by one:

The item read of i_price in the item table and the predicate read of the following statement:


        select i_price, i_name, i_data from item where i_id=:ol_i_id;.
	  

The item read of w_tax in the warehouse table and the predicate read of the following statement:


        select w_tax from warehouse where w_id=:w_id;.
	  

The item read of d_tax in the district table and the predicate read of the following statement:


        select d_tax from district where d_id=:d_id and d_w_id=:w_id;.
	  

The item read of c_discount in the customer table and the predicate read of the following statement:


        select c_discount, c_last, c_credit from customer 
           where c_w_id=:w_id and c_d_id=: c_d_id and c_id=:c_id;.
	  

All eight dependence between the read and the insertion represent the fourth situation.

Other updates depend only on parameters passed into the transaction. All the other reads don't affect writes in the transaction and hence could be in set R. They are the item reads in the following statements:


        select c_last, c_credit from customer 
           where c_w_id=:w_id and c_d_id=: c_d_id and c_id=:c_id;

        select i_name, i_data from item where i_id=:ol_i_id;

        select s_data, s_dist_01, s_dist_02, s_dist_03, s_dist_04,
           s_dist_05, s_dist_06, s_dist_07, s_dist_08, s_dist_09, s_dist_10
           from stock where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id;.
	  

There is a subtle difference if we use the following statement to update d_next_o_id instead of the original one:


        update district set d_next_o_id=d_next_o_id+1 where d_id=:d_id and d_w_id=:w_id;.
	  

In this statement, d_next_o_id is read a second time while the first one happens in the select statement before this update. A natural question in this case is whether both reads affect the write of d_next_o_id. If we were applying the Serializability Theorem to it, the natural answer was probably a 'yes' since eventually the history had to be equivalent to a serial one. But for Generalized Serializability Theorem III and the Ramification Theorem III, things are different since only part of a history has to be equivalent to a serial execution of itself as shown by the following example.


	                                                                                                       (To be continued...)                    
	  

Example 22: Let tu be a tuple and f be a field in it. Consider the following pseudo-code executed under NDB Cluster's Read-Committed isolation level.


	            T1                                                               T2

        start transaction;

        select f in tu;

                                                                            start transaction;

                                                                            update f in tu;

                                                                            commit;

        select f in tu again;

        update f in tu;

        commit;                                                                                                               
	  

There is apparently a loop in this history and we can't apply the Serializability Theorem to it. However, if the first select doesn't affect the later update in T1, we may put it in set R so that we can apply Generalized Serializability Theorem III or the Ramification Theorem III to it to conclude that the database state will be a consistent one after the execution of this history if it started so. An alternative way to deal with it is to consider both reads to be ones that affect the update and put prevention measures so that this history doesn't show up and the Serializability Theorem applies.


	                                                                                                       ##                   
	  

Remark: This is an example which demonstrates that we can improve performance by using the Generalized Serializability Theorems or the Ramification Theorems in place of the Serializability Theorem if consistency in some reads can be sacrificed too. We may also use predicate reads to construct a similar example. In general, NOT viewing two identical reads, both item and predicate ones, in a transaction as the same may give us advantages in performance.

Example 21 (Continuation...): Let's continue our analysis with the payment transaction.


        The Payment transaction(Read-Write):

        select w_street_1, w_street_2, w_city, w_state, w_zip, w_name 
           from warehouse where w_id=:w_id;

        update warehouse set w_ytd=w_ytd+:h_amount where w_id=:w_id;

        select d_street_1, d_street_2, d_city, d_state, d_zip, d_name 
           from district where d_w_id=:w_id and d_id=:d_id;

        update district set d_ytd=d_ytd+:h_amount where d_w_id=:w_id and d_id=:d_id;

        if the customer making the payment is represented by a name{          //60% chances

            select count(c_id) into :namecnt from customer 
               where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id;

            //p->p, p->d2
            declare c_byname cursor for
               select c_first, c_middle, c_id, c_street_1, c_street_2, c_city, c_state, 
                  c_zip, c_phone, c_credit, c_credit_lim, c_discount, c_balance, c_since
                  from customer where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id
                  order by c_first
                  lock in share mode;

            open c_byname;

            if(:namecnt%2) :namecnt++;
            for (n=0;n < :namecnt/2;n++) {
                fetch c_byname
                   into :c_first, :c_middle, :c_id, :c_street_1, :c_street_2, :c_city, :c_state, 
                   :c_zip, :c_phone, :c_credit, :c_credit_lim, :c_discount, :c_balance, :c_since
            }

            close c_byname;
        }
        else if the customer making the payment is represented by an id{      //40% chances
            
            //p->p, p->d2
            select c_first, c_middle, c_last, c_street_1, c_street_2, c_city, c_state, 
               c_zip, c_phone, c_credit, c_credit_lim, c_discount, c_balance, c_since
               from customer where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id
               lock in share mode;
        }

        update customer set c_balance=c_balance-:h_amount
           where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;

        update customer set c_ytd_payment=c_ytd_payment+:h_amount
           where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;

        update customer set c_payment_cnt=c_payment_cnt+1
           where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;

        if (:c_credit=”BC”){

            //p->p
            select c_data from customer where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id
               lock in share mode;
			   
            //the + is string concatenation operator, n is the length of the string concatenated to :c_data.
            //Here the implementation instead of the description of the transaction is followed, :h_date            
            //and :h_data are also concatenated to :c_data. The prevention measure for p->p case are
            //also removed for this reason.
            :c_new_data=':c_id, :c_d_id, :c_w_id, :d_id, :w_id, :h_amount, :h_date, :h_data' + :c_data
            :c_data=right-shift(:c_new_data, n), n being the length of the prepended string

            update customer set c_data=:c_data
               where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;
        }

        strncpy(:h_data, :w_name, 10);
        :h_data[10]=' ';
        :h_data[11]=' ';
        :h_data[12]=' ';
        :h_data[13]=' '; 
        strncat(:h_data, :d_name, 10);
        :h_data[24]='\0';		

        insert into history values(:c_d_id, :c_w_id, :c_id, :d_id, :w_id, :datatime, :h_amount, :h_data);
	  

w_ytd:

The update of w_ytd depends on item read of itself and the predicate read of the following statement:


        update warehouse set w_ytd=w_ytd+:h_amount where w_id=:w_id;.                                                                                                                
	  

They represent the first and the second situation respectively.

d_ytd:

The update of d_ytd depends on item read of itself and the predicate read of the following statement:


        update district set d_ytd=d_ytd+:h_amount where d_w_id=:w_id and d_id=:d_id;.                                                                                                                
	  

They represent the first and the second situation respectively.

c_balance:

In the case a customer is identified by a name, the update of c_balance depends on a few reads and represents the one of the most complex cases for this transaction. So let's examine it carefully. The first dependence starts with the following predicate read:


        select count(c_id) into :namecnt from customer 
           where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id;.                                                                                                                  
	  

The cardinality of the result set is stored in variable :namecnt. This variable is used in the cursor of the process of identifying which data objects are to access for the following statement to pinpoint the median(:namecnt/2) th person in the predicate read:


        select c_first, c_middle, c_id, c_street_1, c_street_2, c_city, c_state, 
           c_zip, c_phone, c_credit, c_credit_lim, c_discount, c_balance, c_since
           from customer where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id
           order by c_first;.                                                                                                                 
	  

Then the selected value of c_balance is stored in variable :c_balance and is used to update itself. This dependence represents the fourth situation.

The second dependence starts with the predicate read of the previous statement and ends up in c_balance as we've just described. This dependence represents the fourth situation too.

The third dependence starts with the item read of c_balance in the previous statement and ends up in c_balance as we've just described. This dependence represents the fourth situation.

The fourth, fifth dependence start the same as the first and the second one, except this time c_id is selected in the previous statement and stored in variable :c_id. Variable :c_id is then used in the predicate read of the update statement. These two dependence both represent the third situation.

The sixth dependence starts with the item read of c_id in the previous statement and it is stored in variable :c_id. Variable :c_id is then used in the predicate read of the update statement. This dependence represents the third situation.

The seventh dependence starts with the predicate read of the update statement and ends up in c_balance. This represents the second situation. So in total there are seven dependence that could affect the update of c_balance in this case.

In the case a customer is identified by an id, the first dependence starts with the predicate read of the following statement:


        select c_first, c_middle, c_last, c_street_1, c_street_2, c_city, c_state, 
           c_zip, c_phone, c_credit, c_credit_lim, c_discount, c_balance, c_since
           from customer where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;.                                                                                                                  
	  

The selected value of c_balance is stored in variable :c_balance and is used to update itself. This dependence represents the fourth situation.

The second and third dependence are just the third and seventh one in the previous case when a customer is identified by a name and they represent the fourth and the second situation respectively. So in total there are three dependence that could affect the update of c_balance in this case.

c_ytd_payment and c_payment_cnt:

In the case that a customer is identified by an id, the updates of c_ytd_payment and c_payment_cnt depend on item read of itself and the predicate read of the update statement. They represent the first and second situation respectively.

In the case that a customer is identified by a name, besides the two dependence when a customer is identified by an id, the predicate read of the update statement depends on the variable :c_id. So there are three more dependence which are similar to the fourth, fifth and sixth dependence in the c_balance case. The three dependence represent the third situation too.

c_data:

The first three dependence are similar to those of d_next_o_id case in the new-order transaction. They start with the predicate read and item read of the following statement:


        select c_data from customer where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id
           lock in share mode;                                                                                                                
	  

and the predicate read of the following statement:


        update customer set c_data=:c_data
           where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;.                                                                                                                  
	  

They represent the fourth, the fourth and the second situation respectively.

In the case a customer is identified by a name, the variable :c_data used to update c_data depends on :c_id. Similar to the fourth, fifth and sixth dependence as in the c_balance case, the use of :c_id in the predicate read of the previous select statement gives rise to three dependence that all represent the fourth situation; on the other hand, the use of :c_id in the predicate read of the previous update statement gives rise to three dependence that all represent the third situation.

Besides those, the update of c_data also depends on c_credit through a branch. In the case a customer is identified by a name, c_credit depends on the two predicate reads as in the first two dependence in the c_balance case(when a customer is identified by a name) and on the item read of itself. In the case a customer is identified by an id, c_credit depends on the predicate read as in the first dependence in the c_balance case(when a customer is identified by an id) and item read of itself . These five dependence all represent the sixth situation.

So in the case a customer is identified by a name, there are in total twelve dependence that end up in c_data; in the case a customer is identified by an id, there are only five of them.

c_id:

When a customer is identified by a name, the insertion of c_id into the history table depends on :c_id, which in turn depends on the two predicate reads and one item read as described in the fourth, fifth and sixth dependence as in the c_balance case. These three dependence all represent the fourth situation.

h_data:

The inserted value of h_data into the history table depends on w_name in the warehouse table and d_name in the district table which in turn depend on the item read and predicate read of the following statements respectively:


        select w_street_1, w_street_2, w_city, w_state, w_zip, w_name 
           from warehouse where w_id=:w_id;

        select d_street_1, d_street_2, d_city, d_state, d_zip, d_name 
           from district where d_w_id=:w_id and d_id=:d_id;.                                                                                                                 
	  

All four dependence represent the fourth situation.

All the other item reads could be in R and they are:


        select w_street_1, w_street_2, w_city, w_state, w_zip 
           from warehouse where w_id=:w_id;

        select d_street_1, d_street_2, d_city, d_state, d_zip 
           from district where d_w_id=:w_id and d_id=:d_id;

        select c_first, c_middle, c_street_1, c_street_2, c_city, c_state, 
           c_zip, c_phone, c_credit_lim, c_discount, c_since
           from customer where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id
           order by c_first;

        select c_first, c_middle, c_last, c_street_1, c_street_2, c_city, c_state, 
           c_zip, c_phone, c_credit_lim, c_discount, c_since
           from customer where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;.                                                                                                                
	  

The predicate read of the following two statements could also be in R since they are of type VI split statements:


        select c_data from customer where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;

        update customer set c_data=:c_data
           where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id;.                                                                                                                
	  

        The Delivery transaction(Read-Write):
		
        //d1->n
        select * from t_d1_n where no_d_id=:d_id and no_w_id=:w_id 
           lock in share mode;
		   
        //The following read lock can be removed if only one delivery transaction with
        //matching NO_W_ID and NO_D_ID is allowed at any given time
        //d2->d2
        declare c_no cursor for
           select no_o_id from new_order where no_d_id=:d_id and no_w_id=:w_id
              order by no_o_id asc
              lock in share mode;

        open c_no;
 
        fetch c_no into :no_o_id;

        close c_no;

        if the former cursor returns a non-empty set{

            delete from new_order where no_d_id=:d_id and no_w_id=:w_id and no_o_id=:no_o_id;

            //In TPCC's delivery transaction, the previous two statements are expressed as an updatable 
            cursor. Since NDB Cluster doesn't support updatable cursors, I've changed it to the 
            previous two statements which are semantically equivalent.         

            select o_c_id from orders where o_d_id=:d_id and o_w_id=:w_id and o_id=:no_o_id;

            update orders set o_carrier_id=:o_carrier_id
               where o_d_id=:d_id and o_w_id=:w_id and o_id=:no_o_id;

            update order_line set ol_delivery_d=:datetime
               where ol_d_id=:d_id and ol_w_id=:w_id and ol_o_id=:no_o_id;

            select sum(ol_amount) from order_line 
               where ol_d_id=:d_id and ol_w_id=:w_id and ol_o_id=:no_o_id;

            update customer set c_balance=c_balance+:ol_total
               where c_id=:c_id and c_d_id=:d_id and c_w_id=:w_id;

            update customer set c_delivery_cnt=c_delivery_cnt+1
               where c_id=:c_id and c_d_id=:d_id and c_w_id=:w_id;
        }                                                                                                                     
	  

no_d_id, no_w_id and no_o_id:

The delete statement's predicate read depends on variable :no_o_id, which in turn depends on the item read and predicate read of the following statement:


        select no_o_id from new_order where no_d_id=:d_id and no_w_id=:w_id
           order by no_o_id asc;                                                                                                                  
	  

These two dependence both represent the third situation.

The deletion of course also depends on the predicate read of the delete statement. This third dependence represents the second situation.

o_carrier_id and ol_delivery_d:

The dependence are similar to the no_d_id, no_w_id and no_o_id case, except this time the write is an update, not a delete.

c_balance:

First it depends on variable :c_id in the predicate read of the update statement, which in turn depends on the item read and the predicate read of the following statement:


        select o_c_id from orders where o_d_id=:d_id and o_w_id=:w_id and o_id=:no_o_id;.                                                                                                                  
	  

The predicate read of the previous statement in turn depends on variable :no_o_id, which in turn depends on the item read and predicate read of the following statement:


        select no_o_id from new_order where no_d_id=:d_id and no_w_id=:w_id
           order by no_o_id asc;.                                                                                                                   
	  

So there are four dependence here that depend on the item reads and the predicate reads of the previous two statements and they all represent the third situation.

The updated value of c_balance also depends on variable :ol_total, which in turn depends on the item read and predicate read of the following statement:


        select sum(ol_amount) into :ol_total from order_line 
           where ol_d_id=:d_id and ol_w_id=:w_id and ol_o_id=:no_o_id;.                                                                                                                  
	  

The predicate read of the previous statement depends on variable :no_o_id, which in turn depends on the item read and predicate read of the following statement:


        select no_o_id from new_order where no_d_id=:d_id and no_w_id=:w_id
           order by no_o_id asc;.                                                                                                                   
	  

So there are four dependence here that depend on the item reads and the predicate reads of the previous two statements and they all represent the fourth situation.

Finally the update also depends on the item read of itself and the predicate read of the update statement. They represent the first and the second situation respectively. So there are in total ten dependence altogether.

c_delivery_cnt:

There are six dependence ending up with the update of c_delivery_cnt and they are similar to the first, the second, the third, the fourth, the ninth and the tenth dependence in the c_balance case in this transaction.

The set R is empty for this transaction.

Remark: This analysis is about d2, not d1 because we concern about Read-Write transactions only.

Now we are ready to apply Ramification Theorem III to the TPCC application. In this specific setup, we assume C to be the three Read-Write transactions we've just finished analyzing and R to be of its possible maximum.

Everything is the same as when we apply the Ramification Theorem I to it to arrive at Theorem 4 except this time we will try to squeeze some performance out by sacrificing consistency of reads in R. We got no luck for the TPCC application though. The following are statements in the payment transaction containing reads in R that we may possibly eliminate some of the prevention measures:


        //p->p, p->d2
        declare c_byname cursor for 
           select c_first, c_middle, c_id, c_street_1, c_street_2, c_city, c_state, 
              c_zip, c_phone, c_credit, c_credit_lim, c_discount, c_balance, c_since
              from customer where c_last=:c_last and c_d_id=:c_d_id and c_w_id=:c_w_id
              order by c_first
              lock in share mode;

        //p->p, p->d2
        select c_first, c_middle, c_last, c_street_1, c_street_2, c_city, c_state, 
           c_zip, c_phone, c_credit, c_credit_lim, c_discount, c_balance, c_since
           from customer where c_id=:c_id and c_d_id=:c_d_id and c_w_id=:c_w_id
           lock in share mode;.                                                                                                                   
	  

And the following is the only statement in the new-order transaction containing item reads in R that we may possibly eliminate some of the prevention measures :


        //n->n
        select s_quantity, s_data, s_dist_01, s_dist_02, s_dist_03, s_dist_04,
           s_dist_05, s_dist_06, s_dist_07, s_dist_08, s_dist_09, s_dist_10
           from stock where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id
           lock in share mode;.                                                                                                                 
	  

They all have other fields in them that the a read lock is required to fence off conflicts in p->p or n->n. In the first or the second statement, it is the c_balance field; in the third statement, it is the s_quantity field. So we can't do anything about the locks placed for a tuple level locking system like NDB Cluster's. In the case where a field level capable locking system is in place, field locks will be placed on the aforementioned fields. But still no locks can be dropped. So no advantage will be gained by applying Ramification Theorem III instead of Ramification Theorem I for the TPCC application.


	                                                                                                       (To be continued...)                    
	  

In general, if there is an item read for a field that doesn't affect writes in a transaction and there is a write of that field in another transaction, it is possible that we may use Ramification Theorem III to get rid of the read lock on that field. This is demonstrated in the following

Example 23: Consider the following pseudo-code executed under the Read-Committed isolation level of NDB Cluster.


	           T1                                                         T2

        start transaction;

        select f1 from t1;

                                                                    start transaction;

                                                                    update t1 set f1 = …;

                                                                    update t2 set f2 = …;

                                                                    commit;

        update t2 set f2 = …;

        commit;                                                                                                                 
	  

This history apparently contains a conflict loop where a RW conflict is from T1 to T2 and a WW conflict is from T2 to T1. So it is not serializable and a read lock must be placed on f1 if we want to fence off the RW conflict to achieve serializability. Assuming the reading of f1 can't affect the write of f2 in T1, we may place this read in set R and hence conclude that this history will result in a consistent database state if it started so by Ramification Theorem III or Generalized Serializability Theorem III. In this case the read lock on f1 may be removed, while performance may be boosted with one less lock conflict. The only possible sacrifice is consistency of f1. However, f1's value is from a consistent database state, assuming it is bot in Ram(B), f1 alone will not demonstrate any inconsistency.


	                                                                                                       ##                   
	  

This is actually a common fact summarized by the following proposition.

Proposition: For the fields that don't affect writes and are sacrificed in the Generalized Serializability Theorems or in Ram(B)' of the Ramification Theorems, any single one of them doesn't demonstrate inconsistency.

Proof: Since such a field is always from a consistent database state.


	                                                                                                       ## 
	  

So, for us to see inconsistency from these fields, we need at least two of them.

Example 21 (Continuation...): Let's summarize this example by comparing it with the execution of the TPCC application under Snapshot Isolation. Fekete's paper indicates that the TPCC application is serializable under Snapshot Isolation, at least on Ram(B)' of course. Under the Read-Committed isolation level of NDB Cluster, Ramification Theorem I indicates that the split of the Read-Write transactions to Ram(B)' is serializable, with read locks placed in the p->p and n->n cases to fence off those RW conflicts, while any single field read in Ram(B)' in the Read-Only transactions doesn't demonstrate inconsistency.

What about performance? In the case where read locks in p->p and n->n are necessary under the Read-Committed isolation level of NDB Cluster, the update of c_balance in a pair of payment transactions or the update of s_quantity in a pair of new-order transactions would make either pair serial if they were executed under Snapshot Isolation. Hence the execution of TPCC under the Read-Committed isolation level of NDB Cluster is guaranteed to be no worse than that under Snapshot Isolation, at least it is not caused by the prevention measures placed.

Let's back up a little and say that not all Read-Only transactions' consistency can be sacrificed. If the order-status transactions must be guaranteed to be consistent, Ramification Theorem II indicates that prevention measures must be placed to fence off the following RW conflicts: o->n, o->p and o->d2. To fence off o->n, locks must be acquired in table t_o_n. Table t_o_n contains three columns:o_d_id, o_w_id and o_c_id. This mean the conflicting new-order transaction has to share the same warehouse, district and customer id with the order-status transaction. But how often does a guy simultaneously place a new order while he is checking the order status of a previous one? Even the case that his account is hacked and the hacker places an order for him while he is checking status represents a slim chance of occurrence. The situation for the o->p case is similar: it is unlikely that a customer checks the status of an order and makes the payment simultaneously. For the o->d2 case, we could also arrange a delayed delivery transaction to happen at, say, after midnight in that region so that its conflict with an order-status transaction is unlikely. So it seems these prevention measures don't hurt performance too much after all as long as the lock implementation is light weight, such as MySQL's.

For the stock-level transaction, the only conflict we must fence off to guarantee its consistency is s->n by Ramification Theorem II. It is very likely that when a stock-level transaction is in progress, a few new-order transactions are conflicting with it by modify the quantity of an item in the stock table since the stock-level transaction is of medium weight and the last twenty orders are examined in it. If application semantics dictates that the stock-level transactions are executed very often, this will certainly generate negative performance impact. But fortunately we don't necessarily need to guarantee consistency of a stock-level transaction because in the case of an unusual stock level, we may always call the warehouse to confirm as we mentioned before. So keep it outside the set C could be a more appropriate choice.


	                                                                                                       ## 
	  

While we are at the topic, I'd like to use the Ramification Theorems to handle an issue in NDB Cluster described by the following example.

Example 24: Assuming the same situation as in Example 0 except that id1 is defined as an auto_increment column and table t to be empty to start with. Execute transaction T1 and T2 as follows.


	             T1                                                                        T2

        start transaction;

        insert into t (id2, id3, value) values(1,1,1);

                                                                                start transaction;  

                                                                                insert into t (id2, id3, value) values(2,2,2);

                                                                                rollback;

        insert into t (id2, id3, value) values(3,3,3);

        commit;                                                                                             
	  

If we assume the auto_increment column starts with 1, then after the execution, the ‘select * from t;’ statement will return the following two rows:

id1 id2 id3 value
1 1 1 1
3 3 3 3

Since transaction T2 was aborted, the previous history consists of only one transaction, namely, T1 and it obviously should be serializable. However, if we execute T1 only, what we got in the select should be:

id1 id2 id3 value
1 1 1 1
2 3 3 3

The violation of Serializability is caused by the implementation of the auto_increment column: a hidden counter(with initial value 0 in this example) is used to represent the current value of the auto_increment column and whenever a request is made, the value of the counter is incremented and returned to the requesting transaction. This is a lot like the datetime field case for which the Ramification Theorems are tailored except that a datetime field is incremented every time the clock ticks. Specifically, we may imagine a virtual transaction T that updates the counter every time it is requested. So there are three instances of transaction T in this case. Then even if T2 is aborted, there is still a RW conflict between T1 and the first instance of T, a sequence of WW conflicts between the three instances of T, and a WR conflict between the last instance of T and T1. A loop is hence formed and the inconsistency is explained.


	                                                                                                       ## 
	  

There are two ways to remedy this. The first way is NOT to use an auto_increment column and use an explicit counter like d_next_o_id in TPCC to assign values to id1. Every time we request for id1, the counter is updated with a long lock and the interleaved execution in the last example is impossible to show up at all. The second way is to use the Ramification Theorems to control the damage. So if we may design our application so that update of any other field is not affected by id1, this inconsistency will remain visible only at column id1.

Another application of the Ramification Theorems is the following: Suppose the set of transactions and the set of columns in an application are represented as T and C respectively. And they can be segmented into two parts, T1 with C1 and T2 with C2, such that T1 operates on C1 only and T2 operates on C2 only. Then we have Ram(C1) = C1 and Ram(C2) = C2 and the split of T on C1 is T1 and the split of T on C2 is T2. According to the Ramification Theorems, we may place prevention measures on T1 and T2 separately to guarantee the consistency of C1 and C2 respectively. This way we've partitioned both the database and the application into two parts and can deal with consistency one by one(different Ramification Theorems may be applied to C1 and C2 separately). The trivial case is when C1 and C2 correspond to separate set of tables and this result coincides with common sense.

Let's extend this idea to look at a not so trivial case: proving that the new tables and access to them provided as prevention measures in Example 16 will not generate new inconsistency into the original application that the original prevention measures can't handle. The newly added tables are like t_s_n, t_o_n and t_d1_n in Example 16 and they are accessed by statements like


	  select * from t_s_n where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id lock in share mode;                                                                                                  
	

and


	  select * from t_s_n where s_i_id=:ol_i_id and s_w_id=:ol_supply_w_id for update;                                                                                                   
	

and the insertion and deletion of rows in these new tables. The selects are added to the original transactions as prevention measure of course. For the insertion and deletion statements, we have the following observation: TPCC is just the skeleton of a warehousing application and we don't have a transaction to insert rows into, for example, the stock table; in a real world application, transaction like this must be present and we place the corresponding insertion into t_s_n in it; the same story is for deletion of rows in t_s_n. We call the resulting set of transactions T'.

In the case the original tables don't contain any dateime fields, we denote them as C1 and these new tables as C2. Then we have Ram(C2)=C2(calculated with T') since there is no sequence as described in Lemma 1 that connects a read in C2 and a write in C1. The split of T' on C1 is just the original application represented by the set of transactions T. Applying the Ramification Theorem to C2, we may place prevention measures on the original application to guarantee consistency of C1, which are the original tables. This means the extra conflicts introduced by SQL statements to the new tables will not generate new inconsistency into the original application that the original prevention measures can't handle. In the case the original tables contain dateime fields, named the set to be B as usual, we need to set C1=Ram(B)'(calculated with T) and the rest of the columns to be C2, and then we still have Ram(C2)=C2(calculated with T'). The rest of the argument remains the same.

The fields in C2 are insignificant in both cases. The fact that rows in the newly added tables are never updated between insertion and deletion implies they will function as expected even if inconsistencies can be observed in C2. This argument can be applied to any application that implements the method demonstrated in Example 16 and we've just proved the following

Theorem 5: The method of using granular locking to prevent predicate RW conflict as demonstrated in Example 16 is sound.

This same paradigm can be used to help design a database application sometimes. Besides those tables accessed by clients, we often maintain tables that are accessed by the staff only, which may contain metadata, data for internal uses, etc.. For example, if in the TPCC application orders can be deleted by clients, sometimes we may implement it like this: instead of deleting an order directly from the orders table, we may insert the order into a newly added deleted_orders table without changing the orders table so that the orders table always contains a record for all the orders that have been placed. We name the tables accessed by clients to be C1, the set of transactions for clients to access C1 to be T1, and the tables accessed only by the staff to be C2, which includes the deleted_orders table. If we design the application so that the set of transactions used by the staff, named T2, only accesses C2 and the insertions to the deleted_orders table are the only way for T1 to access C2, then we have Ram(C2)=C2(calculated with T1 + T2) and the split of T1 + T2 to C1 only consists of the statements to access C1 in T1. Hence only placing prevention measures on these statements will suffice to guarantee consistency on C1 by the Ramification Theorems. And the significance of this design is that no matter how the staff use T2 to access C2, it won't affect consistency of C1 and this is desirable in many cases. In the case C1 contains a set B of datetime fields, a similar argument applies by replacing C2 with C2 + Ram(B) and C1 with Ram(B)'. There is yet another advantage of implementing the deletion this way: NOT deleting an order in C1 potentially incurs less conflicts in the split of T1 + T2 to C1. The insertion into C2 potentially incurs more conflicts between statements to access C2, but C2 is for the staff only and they should be more experienced to handle inconsistencies there.

3.5 Joins, sub-queries and unions

In the TPCC application we break down joins into semantically equivalent SQL statements involving only one table. Now let's take a closer look at them.

The first one is in the new-order transaction as follows:


      select c_discount, c_last, c_credit, w_tax from customer, warehouse
         where w_id=:w_id and c_w_id=w_id and c_d_id=:d_id and c_id=:c_id;.                                                                                                    
	

We break it down into these two statements:


      select w_tax from warehouse where w_id=:w_id;

      select c_discount, c_last, c_credit from customer 
         where c_w_id=:w_id and c_d_id=: c_d_id and c_id=:c_id;.                                                                                                    
	

After we find out that there is no update on the 'decision set's involved in these two selects, we may safely combine them back into the original join statement. This way we can take advantage of the 'push down join' algorithm of NDB Cluster and achieve higher performance.

The second one is in the stock-level transaction as follows:


      select count(distinct(s_i_id)) from order_line, stock
         where ol_w_id=:w_id and ol_d_id=:d_id and ol_o_id < :o_id
         and ol_o_id>=:o_id-20 and s_w_id=:w_id and s_i_id=ol_i_id and s_quantity < :threshold;.                                                                                                   
	

We break it down into the following statements:


      select distinct(ol_i_id) from order_line 
         where ol_w_id=:w_id and ol_d_id=:d_id and ol_o_id < :o_id and ol_o_id>=:o_id-20;

      for each ol_i_id obtained in the last statement, assuming it is stored in an array cell :ol_i_id[i] {

        //s->n
        select * from t_s_n
           where s_i_id=:ol_i_id[i] and s_w_id=:w_id
           lock in share mode;

        select count(s_i_id) from stock 
           where s_i_id=:ol_i_id[i] and s_w_id=:w_id and s_quantity < :threshold;
      }.                                                                                                    
	

The necessary prevention measure for the update of s_quantity is also included for comparison that follows.

There is an alternative way to break this join down though:


      select s_i_id from stock
         where s_w_id=:w_id and s_quantity < :threshold;

      for each s_i_id obtained in the last statement, assuming it is stored in an array cell :s_i_id[i] {

        select count(distinct(ol_i_id)) from order_line
           where ol_w_id=:w_id and ol_d_id=:d_id and ol_o_id < :o_id
           and ol_o_id>=:o_id-20 and ol_i_id=:s_i_id[i];
      }.                                                                                                     
	

If we were to use granular locking to prevent the possible predicate RW conflict in this case, it will look like the following since the update is upon s_quantity:


      //s->n
      select * from t_s_n
         where s_w_id=:w_id
         lock in share mode;.                                                                                                     
	

This is a much wider granularity than the previous one and it is certainly not desirable. In general, if there is a way to tell which execution plan the system prefers a priori, we may combine the statements back to the original join and move the prevention measure before it since Theorem 2' applies here. In NDB Cluster, for example, one could apply an explain statement to the join and the 'type' output of it should give us a hint how the join is resolved. If MySQL sticks to the inferred execution plan(I am not 100% sure about this though. For example, say, NDB Cluster favors the execution plan starting with join table t1 in general. But what if at run time, join table t2 only contains one row? Does the optimizer change its mind? A smart one probably should), then we have the 'push down join' back.

The key observation here is that prevention measure depends on the execution plan. We will see this more clearly if the update is upon a field involved in the join condition in the second join. For example, what happens if the update is on s_i_id? In the first execution plan, this gives rise to a predicate RW conflict; in the second execution plan, an item RW conflict results. Right now, we use different conflict resolution measures for them. This implies we can't be sure what to use until the execution plan is resolved. In this Ad-Hoc solution in the second tier for NDB Cluster, we have to break the join down into a specific execution plan a priori to be sure about this. This means we can't take advantage of the 'push down join' algorithm of NDB Cluster any more. One possible solution is to give the optimizer a hint about which execution plan to choose(it might require new SQL syntax). With a capable system like NDB Cluster, there is the following technical path: complicated joins usually show up in OLAP loads, not OLTP ones, so it is probably OK to just break down the simple ones as we did without worrying the 'push down join' algorithm; in the case the load shows both OLAP and OLTP characteristics and the OLAP part is a set of selects, we may try to isolate the OLAP part and service it from a replication server, leaving the OLTP part to the primary cluster(we need to pray that a complicated multi-table update or delete would not be present).

In the case of sub-queries and unions(intersects and excepts too), this execution plan issue doesn't show up unless they contain joins in them. Otherwise, for example for a sub-query, MySQL always evaluates it from inside to outside unless it is a correlated one. This makes the conflict analysis easier for them and I won't elaborate. Of course, we always have the option to break it down as a bunch of statements that only involve a single table as we did for joins.

3.6 Durability of consistency

So all of these deal with the consistency problem in NDB Cluster. This means an application with all the necessary protection and prevention measures will leave a database consistent if it started so and the application finishes successfully. What if the system fails when the application is running? Will it leave the database in a consistent state? How about replications? Do they leave the slave in a consistent state in a traditional replication setting? What about in a more advanced active-active setting? What if a network partition strikes and results in a split-brain situation? We'll answer these questions next.

We first consider the situation when a NDB Cluster suffers a system-wide failure and we call the answer to the question 'whether it will have a consistent state after the cluster is recovered from such a catastrophe' durability of consistency. To do that, we need to recap a key concept called 'epoch' in NDB Cluster first.

Every 100 milliseconds or so, a node in the cluster currently 'elected' as DIH Master initiates a two-phase protocol called Global Checkpoint and sends a GCP_PREPARE_REQ signal to all nodes including itself. On receiving a GCP_PREPARE_REQ signal, the node immediately blocks any prepared transactions from 'beginning to commit'. This causes the Transaction Coordinator (TC) blocks in each node to queue transactions which request commit. Already committing transactions can continue to commit, transactions can continue to be prepared and aborted. Once commit is blocked, the node immediately sends a GCP_PREPARE_CONF back to the DIH Master.

When all GCP_PREPARE_CONF signals have been received, the DIH Master immediately sends a GCP_COMMIT signal to all nodes, which unblocks COMMIT request processing and increments the Global Checkpoint Index (Number) (GCI) associated with every transaction which commits after this point. After unblocking Commit initiation, GCP_COMMIT processing then waits until all of the transactions with the previous GCI have completed, before sending GCP_NODEFINISH back to the DIH Master. Once all GCP_NODEFINISH are received, the DIH Master considers the GCP round complete.

The logical time between two consecutive increments of GCI is called an 'epoch' and the commit of a transaction is associated with an 'epoch' uniquely.

NDB Cluster uses neighborhood-Write-Ahead-Logging or nWAL to provide high availability in case of node failure. Namely, a write is written to all replicas in a node group before it is published to an API node, like a mysqld server, as committed. And before a tuple is written, an entry is written in the REDO log to record this write event. The REDO log at each node is flushed to disk every second. In this case a node fails, the NDB kernel just use the rest of the node group to continue providing service since everything there is the same as in the failed one. In the case the whole node group or the whole system fails, the REDO log is used to restore the system to a consistent state. The question is: does the term 'consistent' used by NDB Cluster here mean the same thing as consistency in this article would have implied?

So we examine the conditions in type B of the Serializability Theorem and easily find out every other condition is satisfied except maybe Condition 2'.

Condition 2' requires the write event of a version exists if a read event of it does and precedes it. In the case of a system-wide recovery, the local REDO log in each node is used to recover itself to less than one second before the failure. If this recovery is NOT to an 'epoch' boundary, there is a chance that the aforementioned requirement might fail.

For example, let n1, n2 be two data nodes in the cluster. N1 coordinates transaction T1 and n2 coordinates T2. Say T1 updates a tuple on n1 which is later read by T2 and this reading results in T2's own update. Assuming both T1 and T2 commit and their modifications are recorded in the local REDO logs on n1 and n2 respectively(for instance, T1 and T2 only modify data on their own coordinator). If a system crash strikes after T2's log is flushed, but before T1's is, then in the recovery afterwards results in T1's update is NOT reflected in the database while T2's is. This is actually a violation of Condition 2' since although T2's read is not recorded in the REDO log, it's implied by its update. Inconsistency could arise from this in general. For example, in the case the tuple read by T2 consists of pay grade info of an employee and that info is used by T2 to update the employee's salary, and the consistency is that any employee's salary must be correlated to his/her pay grade, then the failure to log T1's modification could result in a violation of this consistency. This example also provides better understanding of Condition 2': it doesn't just require a write to be present and 'happened before' its read, this relationship must also be captured by the database.

If this recovery is up to an 'epoch' boundary, in other words, the REDO log is played to the largest 'epoch' that survives the crash(the largest 'epoch' that is persisted to disk in the REDO log for every node), we can prove Condition 2' still holds and type B of the Serializability Theorem applies.

We start by proving the following fact: if an update is recorded in a later 'epoch', it can't be read by a transaction committed in the current or an earlier 'epoch'. We prove it by contradiction. Suppose such a read could happen in the current or an earlier 'epoch'. Then this read event 'happened before' the commit of its transaction, which 'happened before' the commit of the updating transaction by the description of the 'epoch' mechanism since it commits in a later 'epoch'. On the other hand, we are in the Read-Committed isolation level of NDB Cluster, so the commit event of the updating transaction 'happened before' the read. This contradiction proves what we have proposed and leads to the following conclusion: a read only reads updates from the current or an earlier 'epoch'. And since the recovery is up to an 'epoch' boundary, everything before that boundary has been recovered, including any update read by a transaction.

According to an article by Frazer Clement which is available here, “epoch boundaries act as markers in the flow of row events generated by each node, which are then used as consistent points to recover to”. This means every node in the system is recovered to the same 'epoch' boundary in a system-wide recovery. Hence the recovered database state is consistent by type B of the Serializability Theorem.

This conclusion remains true if we instead are applying the Generalized Serializability Theorems to our application since it is like applying type B of the Serializability Theorem to the Read-Write transactions, for example. And they are the only transactions that can affect the underlying database. For the Ramification Theorems, the conclusion remains true on Ram(B)'.

For traditional replication where system-wide changes are replicated to a slave using a binlog, the slave always maintains a consistent state by type B of the Serializability Theorem. The reason is that all the changes in an 'epoch' in all the data nodes in the system is grouped into an 'epoch' transaction, which contains all the writes by transactions in this 'epoch', before it's replicated to the slave according to Frazer Clement's article. From the arguments of the last few paragraphs, this guarantees consistency if the replication runs to its end.

Now what happens if the replication process fails in the middle? If only the replica or the replication channel fails, there are relevant solutions in the MySQL documentation and I won't elaborate on those. We only consider the situation where the source cluster also suffers a system-wide failure here. In that case, from the discussion about durability of consistency, we know that we can always recover the source to a consistent state. After syncing that state to the replica, we may resume replication.

In an advanced active-active setting where all clusters in the system are able to handle write loads, conflicts are resolved using mechanisms described in a series of articles by Frazer Clement, which are available here. Conflicts come into two flavors in this context: tuple-based and transaction-based. Tuple-based conflicts arise when the primary and secondary cluster modify the same tuple concurrently. When a cluster replicates the update to its peer, it could result in a different order in either of them. The resolution is to rollback the modification from the secondary cluster and let the one from the primary prevail. But if there are other updates in the transaction encapsulating the update from the secondary, this resolution mechanism could lead to a situation where only part of the transaction is aborted since not every update in it results in a conflict. This abnormally is called 'data-shearing' in Frazer Clement's series and hence transaction-based conflict is introduced to take care of it. Transaction-based conflict can be defined with the following quote from Frazer Clement's article available here:

“Where a row is found to be in-conflict with some replicated row operation, a further replicated row operation on the same row should also be found to be in-conflict, until the conflict condition has been cleared. This property is implicitly implemented in the existing row based conflict detection functions.

When the scope of a conflict is extended to include all row modifications in a transaction, this implies that all following replicated row operations which affect the same rows, must also be in conflict by implication. To avoid row shearing, these implied-in-conflict rows must implicate the other rows in their transactions, and those rows may in-turn implicate other rows. The overall effect is that a single row conflict must cause its transaction, and all dependent transactions to be considered to be in conflict.”

This means if a transaction on the secondary cluster is in-conflict, all transactions in the same 'epoch' with WW conflicts with it are considered to be in-conflict. And recursively every transaction in that 'epoch' is examined to see if they are in-conflict with the previous found ones until the set of found ones doesn't expand any more. This set of transactions are then rolled back and the conflict resolution process moves onto the next 'epoch'.

Notice that this scheme doesn't suffer the same problem as in the durability of consistency case. For example, in the secondary cluster, if again we assume that T1 updates the database in 'epoch' e, T2 read this update and perform its own update in 'epoch' e+1; simultaneously, T1's update causes a conflict with the primary cluster and T1 is rolled back. Then again the database is in an inconsistent state since that update is gone, but its effect(the read and update in T2) is still there and Condition 2' is violated. This, however, only leaves the secondary cluster's state inconsistent transiently because the conflict resolution process will discover that and roll back T2 shortly when it deals with 'epoch' e+1. So the secondary cluster will be eventually consistent with the primary.

In the case the primary cluster suffers a system-wide failure, the secondary will assume its role. In the unlikely case where the secondary cluster also suffers a system-wide failure afterwards, it is subject to the same situation as in durability of consistency.

When a split-brain situation occurs, there is at most one subset of surviving nodes in NDB Cluster that is going to continue to provide service. The criteria for choosing this surviving subset is:

  1. It must contain at least one node from each and every node group.
  2. It must also contain a pre-selected node called the arbitrator. In other words, only the subset including the arbitrator qualifies.

Remark: If the network topology is special enough so that the arbitrator can be connected to more than one isolated subset, the arbitrator only responds 'yes' to the first subset that queries it. So the uniqueness of the surviving subset is well-defined.

The mechanism in 2 is called arbitration in NDB Cluster and it is described in an article by Frazer Clement discussing the CAP Theorem based issues here. Notice the consistency in the CAP Theorem means 'whether two copies of the same data in a distributed system hold the same value' and hence is very different from the consistency discussed in this article.

When a node or network failure is detected by the NDB kernel, updates are not allowed to commit; on the other hand, between the occurrence of a split-brain situation and its detection, committed updates are only possible in node groups with all replicas to be up; both are true because the replication between nodes in a node group is synchronous. This implies consistency defined in the CAP Theorem is honored before, during and right after a split-brain if a surviving subset will be able to continue to provide service. After that, this consistency is only honored in the surviving subset until the partition is resolved.

This also implies the consistency discussed in this article is honored since Condition 2' is never violated in the whole process. So in the case a surviving subset exists, after network partition is resolved and other nodes are brought back online, these nodes will be synchronized with the surviving subset before they are serving requests again. Whether this is done during the application is executing or after it's finished execution, it won't affect consistency as long as it does finish.

On the other hand, if no such subset survives the partition, the cluster will shutdown and a restart is needed after network partition is resolved. This behaves exactly like the durability of consistency case.

In the traditional replication scenario, if a split-brain situation occurs in the slave, there are two cases. Case 1: there is a surviving subset that can continue to provide service; we only need to resolve the partition and sync the rest of the slave with the surviving subset. Case 2: there is no surviving subset that can continue to provide service; after partition is resolved, the slave need to be initialized and replication need to be set up again with the master. There is a slight complication in Case 1, if the SQL node used for replication is not attached to the surviving subset. We need to stop the service on the slave first and then set up replication with a SQL node in the surviving subset or treat it as in Case 2 and set that up after partition is resolved before service can be restored.

If, on the other hand, a split-brain situation occurs in the master in the traditional replication scenario, there are also two cases. Case 1: there is a surviving subset that can continue to provide service; we only need to resolve the partition and sync the rest of the master with the surviving subset. Case 2: there is no surviving subset that can continue to provide service; in this case, we must restart the master after partition is resolved and consistency will be preserved as in the durability of consistency case. In Case 1, the complication that SQL node used for replication is not attached to the surviving subset must also be addressed. Namely, we need to stop the service on the slave first and then set up replication with a SQL node in the surviving subset before service can be restored.

In an active-active replication deployment, in the case the primary cluster suffers a partition, there are two cases. Case 1: if there is a surviving subset in the primary so that the SQL node for replication is also included, the replication will continue and we only need to resolve the partition and reinstate the primary by syncing the rest of it with the surviving subset. Case 2: otherwise, the secondary will assume its role until partition in the primary is resolved and it is synced with the secondary. In the case the secondary cluster suffers a partition, the situation is similar.

One interesting question is: is there a system that is consistent, but in which durability of consistency could be violated? I don't have a database example for that. But the following incident that resulted in Raimondo, US secretary of commerce's email account being hacked certainly smells like it. The key piece of info there is “The crash dumps, which redact sensitive information, should not include the signing key. In this case, a race condition allowed the key to be present in the crash dump (this issue has been corrected).”. The consumer signing system in question is not a fault-tolerant system. Faults show up in a fault-tolerant system as inconveniences, not mishaps.

3.7 Application to standalone systems

Although the method developed in this article emphasizes distributed database systems, it is backward compatible with and can be applied to standalone systems. In this subsection, we'll explore how to do that in various standalone systems, starting with MySQL InnoDB. If other than consistency, the expected performance profile for your application is at most a few thousand tps, you are at the right place.

3.7.1 MySQL InnoDB

In the Read-Committed isolation level of MySQL InnoDB, as long as we choose a commit event to be one that 'happened after' execution of SQL statements, but 'happened before' the release of any lock in a transaction to replace the start event of the commit phase in Theorem 1, Theorem 2 and Example 14, the arguments there can go through. This implies we may achieve serializability for any history executed under this isolation level as demonstrated in Example 16. Please use the conflict_parser helper program to generate the tables of potential conflicts to start with. In general, this method could reduce locks to a minimum level and should out-perform MySQL's serializable isolation level. You may also apply the Generalized Serializability Theorems or the Ramification Theorems to your application if inconsistencies in some reads are allowed or need to constrain inconsistencies to be within a few columns. Please refer to Example 21 for a demonstration of applying the Ramification Theorems.

However, there is difficulty mounting such a serializability implementation to the Repeatable-Read isolation level of MySQL InnoDB. From the specification here, it seems there is trouble for it to process locking statements and non-locking read simultaneously. This is absolutely a pity! I was hoping its gap lock implementation could render Theorem 2 unnecessary.

3.7.1.1 Relation between consistency, durability and durability of consistency

In the last sub-section, we went a long way to prove that NDB Cluster satisfies durability of consistency. Namely, if it suffers a system crash and is later recovered, its recovered database state is consistent. It turns out the reason of all that trouble is because NDB Cluster is NOT durable and this conclusion is asserted by the following

Theorem 6: Consistency + durability => durability of consistency.

Proof: Durability implies that when a system is recovered from a disaster, all the committed transactions will survive and the effects of them are reflected in the database correctly. The system is also consistent, which implies if a history is executed successfully, it will leave an originally consistent database in a consistent state. Now consider the state the database system is in when it crashes: some transactions have already committed and some are still in progress. Let H be the history that has led to such a state and modify all the still in-progress transactions by adding an 'abort;' statement to each of them and called this new history H'. If the system had not crashed, H' would finish execution and leave the underlying database in a consistent state; but this state is exactly the recovered state for the execution of H. Hence the proof is completed.


	                                                                                                       ## 
	      
Corollary 1: The serializability implementation(and also when the Generalized Serializability Theorems or the Ramification Theorems are applied instead) based on MySQL InnoDB's Read-Committed isolation level satisfies durability of consistency.

Proof: Durability of MySQL InnoDB is provided by Write-Ahead-Logging and hence the conclusion when consistency is in place.


	                                                                                                       ## 
	      
Corollary 2: If battery-backed memory is used to provide durability to NDB Cluster, the serializability implementation(and also when the Generalized Serializability Theorems or the Ramification Theorems are applied instead) based on NDB Cluster's Read-Committed isolation level satisfies durability of consistency.

Proof: Similar to Corollary 1.


	                                                                                                       ## 
	      

4. Type D of the Serializability Theorem

So far, we've only applied type B of the Serializability Theorem, which is still based on a system with tuple versions, to a database application – TPCC. With the proof of type C of the Serializability Theorem, we are at a place to re-define isolation levels like Read-Committed, Repeatable-Read and Snapshot Isolation that are free of tuple versions and based on field versions only so that it can be applied.

4.1 Isolation levels for a field-based database system

Snapshot Isolation: Item reads and predicate reads are still based on the snapshot corresponding to the starting time of the transaction. Two transactions writing the same tuple, however, can be concurrent unless the intersection of the write sets of these two transactions is not empty or they are writing the same 'decision set'. Timestamps are assigned to writes as usual.

In particular, this definition implies transactions updating different fields in the same tuple can be executed concurrently. And it also imposes a total order on field versions of a field or 'decision set' versions of a 'decision set'. For a distributed database system, since we can't synchronize clocks to 100% accuracy by [La 78], we must provide a new way to generate the timestamps. For example, Google's Spanner comes with a way to implement Snapshot Isolation without synchronizing clocks accurately, which is described in Appendix D. More example will be given in next section.

Also to facilitate such a Snapshot Isolation implementation for a field-based system, we need to associate each field version with a timestamp, instead of one for each tuple version. This will increase storage footprint in general. However, there are usually large amount of memory and disk space in modern servers. And a distributed database system can also be scaled horizontally. This should not be a problem.

Read-Committed: Item reads and predicate reads use the latest committed item versions as in the Read-Committed isolation level in MySQL InnoDB and NDB Cluster. A write in a transaction is interpreted as field writes within a write set. Field level lock is acquired before a field write and released upon commit.

This definition applies to both standalone and distributed database since the term 'latest' refers to a local clock. And a field level capable locking system is needed for this Read-Committed isolation level.

Before defining a new Repeatable-Read isolation level, we need to take a closer look at MySQL InnoDB's.

Example 25: Assuming the same table definition as in Example 0 with these two tuples in table t: (1,2,2,10) and (3,5,5,10). Under the Repeatable-Read isolation level of MySQL InnoDB, execute the transactions T1 and T2 in the following order:


	             T1                                                                  T2

        start transaction;

        select * from t;

                                                                            start transaction;

                                                                            update t set id2=1 where id1=1;

                                                                            update t set id2=4 where id1=3;

                                                                            commit;

        update t set id3=3 where id1=1;

        select * from t;

        commit;                                                                                
	  

The first select statement of course returns the two tuples to start with, while the second select statement returns the following two tuples: (1,1,3,10), (3,5,5,10). So when T1 updates a tuple in it, it reads the most recently committed tuple version and applies the update to this version; otherwise it reads only the versions before the time T1 starts. In other words, T1 behaves like in the Read-Committed isolation level for the first tuple and like in Snapshot Isolation for the second. This, however, doesn't violate the Repeatable-Read isolation level definition in the SQL standard since it still refutes the anomalies proscribed in that definition. Notice there is a conflict loop in this history since there is an item RW conflict from T1 to T2 and there is an item-based WR conflict from T2 to T1 on the id2 field of the first tuple. Actually if this history was serializable, the order had to be {T1, T2} since the first select in T1 read the original state. But the the second select in T1 should've read (1,2,3,10) for the first tuple. That is a contradiction since type B of the Serializability Theorem applies here.


	                                                                                            (To be continued ...)  
	  

This behavior of MySQL InnoDB is reasonable since its Repeatable-Read isolation level is a tuple-based system and the tuple version established by T1 is after that of T2's, so T1 should be able to see what the previous version is. With a field-based system where the concept of tuple version is absent, the story is very different as we will see shortly.

Remark: If we place a statement 'select * from t;' into T1 right after T2's commit and before T1's update, it will return the two original tuples too.

Repeatable-Read: Item reads and predicate reads use the committed item versions before the starting time of a transaction as in the Repeatable-Read isolation level in MySQL InnoDB. A write in a transaction is interpreted as field writes within the write set. Field level lock is acquired before a field write and released upon commit. Timestamps are assigned to writes as usual.

For a distributed database system, since we can't synchronize clocks to 100% accuracy, we must provide a new way to generate the timestamps. Also as in the Snapshot Isolation case, we need to associate each field with a timestamp.

These three field-based, newly defined isolation levels are all variants of the generic Read-Committed isolation level in the SQL standard since they all proscribe the following two anomalies:


      P0: w1[x] … w2[x] … (c1 or a1),

      P1: w1[x] … r2[x] … (c1 or a1), a for abort and c for commit here.                                                                                      																								
    

The three field-based, newly defined isolation levels mainly concern P1, but P0 is implicit according to Adya's paper ([La 78]). Of course, we should supply a version of P1 for predicate reads here so that it becomes:


      P1: w1[x] … r2[x] … (c1 or a1),

          w1[x] … r2(P: Vset(P)) … (c1 or a1), a for abort and c for commit here.
    

4.2 Demonstration of type C of the Serializability Theorem under the newly defined isolation levels

Since type C of the Serializability Theorem is based on the generic Read-Committed isolation level with conflicts interpreted at the field level, this means Condition 5 & 6 are both proscribed by the three field-based, newly defined isolation levels. Conditions 1, 2', 3' & 4 are just common sense. Also we require the write events of a field to form a serial order that is consistent with the 'happened before' partial order in type C of the Serializability Theorem and it is certainly satisfied by the definitions of the three isolation levels. Hence we may apply type C of the Serializability Theorem to the three field-based, newly defined isolation levels as long as we may prevent conflict loops to form.

Example 25 (Continuation...): Under this newly defined Repeatable-Read isolation level, the same execution will give a different result. In particular. the second select in T1 should return (1,2,3,10) for the first tuple and (3,5,5,10) for the second one. In other words, for the first tuple, T1 sees the update to id3 by T1, but not the update to id2 by T2 since everything is interpreted at field level now and T1 is NOT updating id2. Because there is no conflict loop in this history, it is serializable by type C of the Serializability Theorem. And the order of serialization is, of course, {T1, T2}. This is just another example that we should use a field-based system since it leads to less conflicts in general.

If we alter the original execution to the following under this newly defined Repeatable-Read isolation level, we see behavior similar to the original one again.


	                 T1                                                          T2

        start transaction;
 
        select * from t;

                                                                        start transaction;

                                                                        update t set id2=1 where id1=1;

                                                                        update t set id2=4 where id1=3;

                                                                        commit;

        update t set id2=id2+1 where id1=1;

        select * from t;

        commit;                                                                                   
	  

The second select in T1 should return (1,2,2,10) for the first tuple and (3,5,5,10) for the second one since T1 reads the update to id2 by T2 on the first tuple, but not the one on the second one. This is different from the second select in T1 in either serial execution. That is because there is also a conflict loop in this history since there is a RW conflict from T1 to T2 and a WW(or WR) conflict from T2 to T1 on id2 for the first tuple.


	                                                                                            (To be continued ...)  
	  

4.3 TypeD of the Serializability Theorem

To support symmetric access via multiple attributes and get rid of Example 0, we need a new type of Serializability Theorem other than type C, namely, type D of the Serializability Theorem in which the 'decision set' could be of more than one field. To do that, we are going to capture a 'decision set' version concept out of the system for type C where only field versions are present. In other words, we are assuming a system that offers a generic Read-Committed isolation level and only field versions are available in the following discussion. This is very different from a system with the concept of tuple version where we may simply derive a 'decision set' version from a tuple version.

We will first use a trick to capture the 'decision set' version concept and then generalize it for type D of the Serializability Theorem. The trick is to pick a field in the 'decision set' and require that each update of the 'decision set' also modifies this specific field. The modifications of the specific field naturally impose a total order on its field versions and we hope that this total order will lead to a total order among the 'decision set' versions to be defined. It turns out this goal can be achieved as follows: because the total order of the specific field versions is generated by a sequence of serial write events as we mentioned after the definition of the 'happened before' partial order, we may require the following condition to hold.

Condition **: The specific field is modified last and if another field in the 'decision set' is also updated in the same transaction, it is NOT available for access until the update of the specific field is finished. Order of events is interpreted with the 'happened before' partial order as usual.

Suppose the 'decision set' consist of fields {a, b, c}, c being the specific field. Transaction T1 updates all of them while T2 only updates b & c such that the order of update of c is T2 before T1. First, let's observe that the update of b by T2 'happened before' that of T1 too. Proscribing P0 implies, in any reasonable implementation, the write by the second transaction must wait until after the commit of the first one since otherwise if the first transaction, for example, got context-switched out, P0 might result. So if the writing of b by T1 'happened before' that of T2, the commit event of T1 'happened before' that of T2 too since it 'happened before' the write of b by T2. But we also know that it should be the other way around from the writing of c. This is a contradiction and the observation is correct.

Then the update of b in T1 is confined between write events of c since: the update of b in T1 'happened before' the update of c in T1 is asserted by Condition **; the update of c in T2 'happened before' the commit of T2, which in turn 'happened before' the update of b in T1 because P0 is proscribed. The update of a in T1, however, can overlap with that of c in T2. This guarantees a new 'decision set' version is generated whenever the write event of the specific field ends: for fields updated in the current transaction, the versions just generated are used; for fields not updated, the latest generated versions or those from original database state are used. With this definition of a 'decision set' version, the writes on the specific field give rise to a 'decision set' version system and a total order can be imposed on the 'decision set' versions so that total order of its derived field versions on that specific field coincides with the original total order of the specific field versions. It is also easy to see that the order of 'decision set' versions coincides with that of other fields since they are both consistent with the order of the specific field.

And what if two multi-field 'decision set's overlap? Do they introduce the same 'derived' versions on the set of overlapping fields? It turns out they do. The observation is that for each 'derived' version, the overlapping set must be modified and this implies both 'derived' version sets coincide. So the only problem is these two 'derived' version sets may not have the same total order imposed. But that is impossible since by the previous analysis, the total order imposed on both 'derived' version sets must coincide with the 'happened before' partial order and hence must be the same. A similar argument can be used to show that the 'derived' field versions from these 'derived' overlapping set versions coincide with the original field versions. So the definition of this 'decision set' version system is sound.

Remark: In Condition **, we use the completion of update of the specific field to signify the establishment of a new 'decision set' version. The fact that no transaction can access the 'decision set' version before that doesn't mean they have to be accessed right after. For example, if we were to implement such a field-based system for NDB Cluster, this completion of update can be chosen to be when the primary replica gives up the field locks on a tuple. Then the 'decision set' version generated can't be updated until the corresponding locks are given up in the secondary replica. On the other hand, if we were to implement it for MySQL InnoDB, we may choose this completion of update to be when field locks are released and the 'decision set' version will be available for both read and update immediately. So Condition ** is defined in such a way that it will allow different implementations.

In the newly defined Read-Committed isolation level, a write event can be represented by the acquisition and release of the field lock. To satisfy Condition **, we may acquire field locks for all the fields written in an order so that unnecessary dead locks wouldn't arise, update the fields, and then release the locks simultaneously; or we may request a lock that covers all the fields written in the 'decision set' if the field level capable locking system supports it and perform the updates while the lock is held. Notice in the former case, it is usually not possible to acquire all the field locks simultaneously since some of them may require lock wait, but it is in general possible to release them at the same time. In the newly defined Snapshot Isolation isolation level, writes on the specific field clearly generate 'decision set' versions since corresponding transactions are non-concurrent and it also satisfies Condition **. In the newly defined Repeatable-Read isolation level, the 'decision set' version is generated as in the Read-Committed isolation level and Condition ** is satisfied for the same reason as in the Read-Committed isolation level case. Notice in both Snapshot Isolation and Repeatable-Read case, we may use the timestamp for the specific field version as that of the corresponding 'decision set' version.

If we try to use the 'decision set' version we've just defined for a predicate read, we must make sure it is the 'decision set' version that we use, not an arbitrary combination of field versions in the 'decision set' since we are now in a field-based system. In the newly defined Read-Committed isolation level, this can be done by acquiring short read lock on the specific field before the predicate read of a tuple and make sure the matching process happens in the duration the lock is held. Since updates for the fields in the 'decision set' only happen after the specific field is locked, the short read lock guarantees the reading of a 'decision set' version. In the case an initial internal version for the 'decision set' is read, the write lock acquired on the specific field before the read should guarantee the reading of a 'decision set' version. In the newly defined Snapshot Isolation isolation level, the predicate read returns all the latest field versions in the 'decision set' that are before the starting time of the transaction and the fact that these field versions constitute the latest 'decision set' version is clear from the way 'decision set' versions are generated. In the case an initial internal version for the 'decision set' is read, the situation is the same. In the newly defined Repeatable-Read isolation level, as Example 25 indicates, different scenarios arise. If the 'decision set' of the predicate read is NOT altered in the transaction or the predicate read happens before the alternation, it behaves the same as in the Snapshot Isolation isolation level case. Namely, it doesn't see any 'decision set' version established after the starting time of the transaction. Otherwise, if the predicate read happens after the alternation, it behaves the same as in the Read-Committed isolation level case. Namely, it sees 'decision set' versions established after the starting time of the transaction. In either case, we may have it return a 'decision set' version for the predicate read.

Example 25 (Continuation...): If we replace the second select in T1 in the original history with a statement like '… where id2=1 and id3=3;' under the newly defined Repeatable-Read isolation level, the first tuple will be a match for this predicate read because the reading of the initial internal 'decision set' version sees the update in T2(say, we use id2 as the specific field). And a conflict loop exists in the execution since there is a RW conflict from T1 to T2 and a WW(or WR) conflict from T2 to T1 on the 'decision set' {id2, id3} of the first tuple.


	                                                                                            (To be continued ...)  
	  

In general, we should impose the following condition when transactions interact with the 'decision set' version system.

Condition ***: A predicate read or a read of an initial internal 'decision set' version reads a 'decision set' version instead of an arbitrary combination of field versions.

From the discussion above, we know that the three newly defined isolation levels all satisfy Condition ***.

Remark: If we were to implement a field-based system for NDB Cluster, Condition *** can be satisfied easily since it is just an implementation of the newly defined Read-Committed isolation level.

With an initial internal 'decision set' version defined, we may apply updates of the 'decision set' to it successively and create an internal 'decision set' version system when a transaction need to modify this specific 'decision set'. And that is the last piece of the puzzle for presenting the following

Serializability Theorem(with 'decision set' versions induced by a specific field): Let H be a history in a field-based system with 'decision set' versions induced by a specific field so that Condition ** and Condition *** are satisfied and with the 'happened before' partial order satisfying Conditions 1, 2', 3', 4 and proscribing Conditions 5 & 6, where the predicate-based conflicts are based on Definitions 3' & 4'. Then H proscribes Condition 7 iff it is serializable.

Proof: The proof is a hybrid one of that of type B and type C's. In particular, the arguments involving 'decision set' are similar to that of type B's and those involving fields are similar to that of type C's. This time a 'decision set' version is induced by the specific field, instead of deriving from a tuple version though.


	                                                                                                       ## 
	  

In general, we need a systematic way to define a 'decision set' version system so that the following requirements are met for type D of the Serializability Theorem.

  1. A way to define the system.
  2. A way to read the 'decision set' versions in a predicate read without mistakenly using an arbitrary set of field versions in the 'decision set' instead, so that predicate-based conflicts can be defined.
  3. A way to read the 'decision set' versions for the initial internal version without mistakenly using an arbitrary set of field versions in the 'decision set' instead, so that successive internal versions and eventually the committed 'decision set' version can be defined.
  4. And also a way to introduce item-based conflicts between reads and writes of 'decision set' versions so that we can reason about conflicts with them.

Remark: Again we define in requirement 4 a third kind of conflict other than the item-based conflicts between fields and predicate-based conflicts as in type B of the Serializability Theorem. In type A, this is buried under the concept of tuple versions and in type C and the type with 'decision set' versions induced by a specific field we just presented it is opaqued by conflicts between field versions and the specific field versions. In type B, we've tried to isolate it from the tuple versions. And now in type D we've made it stand out more clearly since we don't have tuple versions in the way any more. This is the exact ingredient we need to fence off Example 0 since now updating different fields in a 'decision set' will result in a conflict.

Of course, it also requires a way to identify which field version to read after a predicate read if item reads follow as in type C. All these lead to a new type of the Serializability Theorem as follows.

Serializability Theorem(type D): On top of a field-based system, given a way to define a 'decision set' version system so that the former four requirements are all satisfied, and let H be a history with the 'happened before' partial order satisfying Conditions 1, 2', 3', 4 and proscribing Conditions 5 & 6, where the predicate-based conflicts are based on Definitions 3' & 4'. Then H proscribes Condition 7 iff it is serializable.

Proof: Similar to the Serializability Theorem with 'decision set' versions induced by a specific field.


	                                                                                                       ## 
	  

This is not a surprise since type D of the Theorem is just a generalization that we hoped the Ad-Hoc one with a specific field would have led to. Hence what is remaining is to define a 'decision set' version system for all three newly defined isolation levels so that they satisfy all four requirements so that we can apply the theorem to them.

In the newly defined Snapshot Isolation isolation level, we may require updates to the same 'decision set' of a tuple to be conflicting. This way transactions with these conflicting updates can NOT to be concurrent. Hence the 'decision set' versions are defined with the temporal order without needing the specific field and can be labeled with a timestamp for each of them. The way to read it in a predicate read or as an initial internal version is timestamp based, just as reading a tuple in a tuple based system. Starting with the initial internal version, we can write all the internal versions and eventually a committed version is generated and timestamped if the transaction commits. And the item-based conflicts between these writes and reads of the 'decision set' versions are defined like in a tuple based system. Again we can argue that consistency issues related to 'derived' versions will not arise as in the case where 'decision set' versions are defined with the help of a specific field.

In the newly defined Read-Committed isolation level, things are more complicated since a field level capable locking system has to be in place and there could be different implementations. While we are at the subject, I'd like to discuss a couple of alternatives and hope it would help pave the way to such an implementation in the future.

Alternative one: In the field level capable locking system, we may acquire a lock that covers exactly the 'decision set'. It is like a tuple lock, except that it might cover less fields. And we call it a 'decision lock'.

Alternative two: In the field level capable locking system, we may only acquire field locks to cover all the fields of the 'decision set'.

For alternative one, we consider two strategies. Let's say the 'decision set' in concern consists of fields {a, b, c, d, e} and we call it I. And there is also another 'decision set' overlapping with it with fields {d, e, f} and we call it II.

In strategy one, when 'decision set' I is updated, if d or e is also updated, we acquire a long 'decision write lock' that covers all fields a-f; if neither of d or e is updated, we acquire a long 'decision write lock' that covers fields a-e. Once the lock is acquired, 'decision set' I is updated('decision set' II is also updated if d or e is updated) and a new version of I is established when the lock is released at transaction commit. This guarantees serial updates of 'decision set' I and the versions for 'decision set' I are well-defined. The most recently generated 'decision set' versions are used in predicate reads or the read for the initial internal version by requesting a short 'decision read lock' or a long 'decision write lock' respectively. Once the initial internal version is read, updates are applied to it in order to generate the internal versions and eventually lead to the committed version of a 'decision set'. And the way to recognize the item-based conflicts between these reads and writes of a 'decision set' is the same as in a tuple system case, even if writes don't necessarily share common fields.

In strategy two, when 'decision set' I is updated, if d or e is also updated, we acquire a long 'decision write lock' that covers all fields a-e and a long 'decision write lock' that covers fields d-f(these two locks don't conflict with each other if they are requested by the same transaction); if neither d or e is updated, we only acquire a long 'decision write lock' that covers fields a-e. In either case, 'decision set' I is only updated after all the necessary locks are acquired. In this strategy, a 'decision set' version for I is well-defined since between the time the first lock is acquired and the last lock is released, no other transaction can acquire the long 'decision write lock' that covers a-e. And because an update only happens when 'decision set' I is covered with this 'decision write lock', it implies updates to 'decision set' I are serial. Notice the time between the first lock is acquired and the last lock is released for different transactions could overlap with each other. For example, if there was also a 'decision set' III consisting of fields a, b & c and the order we were acquiring and releasing these locks in a transaction was III, I, II, then there was a chance that one such transaction was releasing the lock for II and another such transaction was already acquiring the lock for III since they don't conflict with each other. Hence comparing with strategy one, the order of 'decision set' versions is a bit fuzzier, but it is still NOT ambiguous. The reading, writing and the item-based conflicts part of a 'decision set' will be the same as in strategy one. Of course to make the whole thing work, we probably should impose a global order to eliminate dead locks. But we are not going to elaborate on that here since strategy one is certainly a simpler choice and the main purpose of strategy two is to induce the following alternative two.

A similar argument as in the Serializability Theorem with 'decision set' versions induced by a specific field case can be used to show that 'derived' versions from common fields of two overlapping 'decision set's will agree. So the definition of 'decision set' version system is sound.

In the newly defined Repeatable-Read isolation level, everything is similar to the Read-Committed isolation level case except that we need to assign a timestamp to each 'decision set' version to facilitate reads in a repeatable-read manner. Since things are a little bit fuzzy in strategy two of alternative one and alternative two, we need to be careful about which point of time we choose as the timestamp. The point in time when the first of acquired locks is released could to be one such choice. Then all four requirements for a 'decision set' version system will be satisfied and hence we've proved the following

Corollary: We may apply type D of the Serializability Theorem in the three newly defined isolation levels to achieve Serializability.

Example 25 (Continuation...): If we replace the second select in T1 with a statement like '… where id2=1 and id3=3;' under the newly defined Repeatable-Read isolation level and this time the 'decision set' version system is not introduced with the help of a specific field, an item-based WW and an item-based WR conflict from T2 to T1 on 'decision set' {id2, id3} of the first tuple exist to form a conflict loop.


	                                                                                            ## 
	  

All the Generalized Serializability Theorems and Ramification theorems are also true for type D's setting. The proofs are similar to those for Generalized Serializability Theorems and Ramification theorems for type B, except this time the 'decision set' is not necessarily derived.

4.4 Relations between different types of the Serializability Theorem

Let's take a look at the relations between type D and the other three types of the Serializability Theorem. First type C is a special case of type D where its 'decision set's happen to contain just one field. Type B, of course, is a special case of type D where the 'decision set' version system is derived. So proving type D of the Serializability Theorem also proves type C and type B of it.

We claim we can use type D of the Serializability Theorem to prove type A of it. The 'if' part is the same as before and the arguments for the 'only if' part are as follows.

Proof: For any history H, we may interpret the conflicts at the tuple level, as in type A of the Theorem; or we may interpret them at the field level, as in type D of it. When conflicts in H are interpreted in both ways, please observe that for every conflict in type D, there is a corresponding chain of conflicts in type A. Actually for predicate-based conflicts, such a chain contains exactly one conflict; for item-based conflicts, we may fill in a few tuple writes between the original two operations. On the other hand, there are conflicts in type A that we can't find a correspondent in type D, such as updates of non-overlapping field sets of a tuple by two transactions. To avoid any ambiguity, we denote the DSG in type A of the Theorem as DSG(A) and the one in type D as DSG(D) in the context of this discussion. From this observation, it is easy to see that acyclicity of DSG(A) implies acyclicity of DSG(D).

Starting with H in type A in which DSG(A) is acyclic, we may choose a topological sort of DSG(A) such that conflicts in it are going downward. For this specific topological sort, we know from last paragraph that the conflicts in DSG(D) are going downward in it too. So the chosen topological sort for DSG(A) is a topological sort for DSG(D) too. Apparently, there are no inconsistency issues concerning 'decision set' versions since they are all derived from tuple versions. Other conditions in type D of the Serializability Theorem are also satisfied. So we can apply it here and claim equivalence as specified in type D of the Serializability Theorem and what is left is to show equivalence as specified in type A.

Type D of the Serializability Theorem tells us that transactions in H and Hs give rise to the same set of operations. We claim that in the type A context, transactions in H and Hs also give rise to the same set of operations too.

For tuple writes on a specific tuple, let's assume the first tuple version generated to be v in H. If the corresponding version v' in Hs turned out NOT to be the first one, then there must a version u' in Hs which happened to be the first one. Its corresponding version u in H had to be one that was after v. Let Tu be the transaction that generated u and u' and Tv be the transaction that generated v and v' in both H and Hs. Then in H, Tu must appear below Tv in the topological sort since there were a list of transactions which started with Tv and ended with Tu, such that there was a WW conflict between any two consecutive transactions in the list. But this contradicts with the fact that Tv must appear below Tu in Hs since Hs is serial. So the first tuple version in H coincides with that in Hs. For the rest of the tuple version order, we may use similar arguments and induction to show that H and Hs will give rise to the same thing.

For tuple reads of a specific tuple, suppose transaction Ti in Hs reads the kth tuple version after the initial state of the database, k>=0(k=0 means it is from the original state of the database). We claim Ti in H reads the same tuple version too. If instead Ti in H read an earlier one than the kth version, assuming that the kth version was written by a different transaction Tj, we would have a list of transactions which stated with Ti and ended with Tj, such that there was a RW or WW conflict between any two consecutive transactions in the list. So Ti must appear before Tj in the topological sort, but this contradicts with the fact that Ti reads the kth tuple version in Hs since from last paragraph we know that Tj generated the kth version in Hs too and Ti in Hs appeared before it in the serial order. In the case Ti in H read an later one than the kth version, we may assume the version Ti read from was generated by Tl. The WR conflict between Tl and Ti would imply that Tl must appear before Ti in the topological sort. By the analysis of last paragraph, Tl would generate the same version after the kth one in Hs too. But then in Hs, Ti would've read a version after the kth one too since Ti also appeared after Tl in Hs. This contradicts with what we started. So tuple reads return the same tuple versions in H and Hs.

Let's now look at the values read from or written to a specific tuple for corresponding tuple versions in H and Hs. If the values read are from the original database state, of course they coincide. For the first tuple version written, its committed value is obtained by applying a sequence of field writes to the initial internal version. The sequences of field writes in both histories coincide by type D of the the Theorem. So as long as we can show that the initial internal versions in both histories coincide, we may conclude that the first tuple versions coincide. The initial internal version is from a tuple read before the first write on the tuple since that is what we assume in type A of the Theorem. This tuple read can't read a tuple version later than the first one. That is because if it did, the write of this later version 'happened before' the read, which 'happened before' the write of the first version. This contradicts with the fact that the version order coincides with the 'happened before' partial order. So the values of the first tuple version coincide in H and Hs. Inductively, we may assume that the kth tuple version values coincide, and the reads before the kth version also coincide, for k>0. Then a similar argument as in the initial induction step can be used to conclude that tuple reads and values written coincide in both H and Hs.

From type D of the Serializability Theorem, we've shown that corresponding predicate reads in H and Hs use 'decision set' versions from the same neighborhood of match change. Observe that there is a one-one correspondence between the set of neighborhood of match changes in type A and that of type D. Notice that the number of versions within a corresponding neighborhood may not be the same since in the type A case it may contain versions generated by updates that don't write the 'decision set' at all. In other words, neighborhood of match change in type A is formed by replacing 'decision set' versions from that of type D with corresponding tuple versions, plus the aforementioned tuple versions. From this construction of the neighborhood of match change, the fact that corresponding predicate reads in H and Hs use versions from the same neighborhood of match change in the type A case is also clear.

As claimed, we've proved (i) in type A of the Serializability Theorem and proof of (ii) is the same as in the original proof of type A.


	                                                                                                       ## 
	  

The choice of topological sort based on DSG(A) is important in this proof and we might be allured to choose one based on DSG(D). The following simple example tells us why we should not.

Example 26: Consider DSG(A): T1 → T2 with only one WW conflict from updates NOT on the same field. DSG(D) in this case is an empty graph, so we could choose T2, T1 as the order in a serial history if decision was based on DSG(D). And the original history is NOT equivalent to serial history {T2, T1}.


	                                                                                            ## 
	  

From the relations between the four types of the Serializability Theorem, we know that type D is the most generic one. This is not a surprise since in type D we look at things at the field level – the finest granularity that is possible. It should be able to explain types that are of coarser granularity.

4.5 Type H of the Serializability Theorem

Although finer control of conflicts can be achieved in type D of the Serializability Theorem, more resources and overhead may be necessary. For example, in a typical lock manager, usually one control block is necessary for each lock. Hence if we implement locks at the field level, depending on the implementation, a lot of extra memories and maintenance might be needed. On the other hand, in type A of the Theorem, unnecessary conflicts are introduced while resources and overhead might only leave a more reasonable footprint. So is there a way to combine the merits of both approaches? The previous proof actually suggests an affirmative answer for applications accessing more than one table.

The key observation is that any conflict between a pair of transactions is actually interpreted in a specific table. This means we only need information like which tuple or field a transaction writes, or the range of tuples that a predicate read covers in a specific table to decide whether a conflict exists. Hence we may interpret these conflicts on a per table basis. In other words, we may interpret conflicts at tuple level for one table and do that at field level for another. This gives rise to the first hybrid Serializability Theorem as follows:

Serializability Theorem(type H, H for hybrid): Given the way to interpret conflicts described above, and let H be a history with the 'happened before' partial order satisfying Conditions 1, 2', 3', 4 and proscribing Conditions 5 & 6, where the predicate-based conflicts are based on Definitions 3' & 4'. Then H proscribes Condition 7 iff it is serializable.

Proof: Only the 'only if' part requires a proof. The DSG is a hybrid one, there are both tuple-based and field-based conflicts inside. Acyclicity of this DSG allows us to choose a topological sort. Interpreting those tuple-based conflicts at the field level gives us a new DSG which is still acyclic and the chosen topological sort is still a qualified one for this new DSG since each tuple-based conflict either gives rise to one field-based one or just disappears in this process. We then apply type D of the Serializability Theorem to it and the rest goes like in the proof where we use type D of the Serializability Theorem to prove type A of it.


	                                                                                                       ## 
	  

So the next question naturally is: on what tables should we use tuple-based/field-based conflict resolution? We introduce a concept of 'logical group' within a tuple to answer it.

Definition: A logical group within a tuple consists of fields that can be accessed independently.

Some fields in a tuple must co-variate with others and they should be put into a logical group. For example, consider the customer table in the TPCC application. There is a group of fields related to the customer's address info and there is also a group of fields related to the customer's credit info. These two can be viewed as two different logical groups. That is because in a real world warehouse application, after the tuple's initial insertion, these two groups are probably modified by two independent transactions. If we use field-based conflict resolution on the customer table, these two transactions don't have to conflict with each other at the lock level in a lock-based system; on the other hand if tuple-based conflict resolution is used, they are guaranteed to conflict with each other at the lock level.

So the answer to our question is: if there are more than one logical groups in a table, consider using field-based conflict resolution; on the other hand, if there is only one logical group within a table, tuple-based conflict resolution is enough. Notice that if fields in a key in a table are only used in a predicate read, they don't constitute a logical group, otherwise they do. Also, logical groups are application dependent. For example, if TPCC turns out to be a warehouse application for a postal service, the zip code field and other fields in the address might be modified independently by two transactions since the postal service might change the zip code for a specific address and they constitute two logical groups in that case.

Now that type D of the Serializability Theorem allows us to discuss consistency fully down to the field level, does it have any advantage over its peer, say, type B of it? The discussion about logical groups certainly gives us an affirmative answer. In fact, there is one such example in the TPCC application: in the payment transaction, the field D_YTD is updated while in the new-order transaction, the field D_NEXT_O_ID is updated too; but there is no consistency related conflicts between these two transactions and these two updates don't introduce lock waits in the type D setting. A similar example can be given in the case that a read and a write on different fields are from different transactions as well.

5. Generic generalization of Serializable Snapshot Isolation to a distributed database system

When I started this work, there were only two serializability implementations: 2PL like that of MySQL InnoDB's and PostgreSQL's Serializable Snapshot Isolation. And it's been a dream to generalize both to a distributed database system so that we can solve larger problems. I've successfully mounted a lock-based serializability implementation to NDB Cluster in Section 2 & 3, with the framework developed in Section 1. One major difficulty of generalizing 2PL is its thrashing behavior under heavy load. I address this with the Generalized Serializability Theorems and provide a partial solution to it: it works only if consistencies of enough reads can be sacrificed. It is still a good trade-off since these inconsistencies are only transient and it is safe as long as one doesn't take action when they are spotted.

Since the framework in this article works on a generic Read-Committed isolation level, including Snapshot Isolation for a field-based system we've developed in Section 4, this makes the other part of the dream possible. The major difficulty of this generalization is to develop a distributed clock for the system. In this section, I'll explore a so-called 'Clock Condition' to make the key arguments in [Fe 05] go through, so that we'll actually have various Serializable Snapshot Isolation implementations with differing distributed clocks. This is what 'generic' means in the title of this section.

5.1 Making the generalization generic

The following discussion is basically that of [Fe 05] between Remark 2.1 and Theorem 2.1, with two major differences.

The first one is that we assume the distributed clock, whose reading designated as t.e for event e, satisfies the following:


        Clock Condition: e “happened before” f => t.e < t.f,                                                                                        
	

where the inequality is from a total order.

Remark: I came across the Clock Condition in Lamport's classic paper ([La 78]) about distributed clock. It was for the logical clock defined there only. I extend the idea here for a generic distributed clock. In the Clock Condition, we use “happened before” instead of 'happened before' as the rest of this article does. This “happened before” partial order is for events designated as points in time, not as an interval as in the 'happened before' partial order, so that we may associate a timestamp to each event. This way of capturing the temporal order in a reference frame is discussed in detail in Appendix B and the subsection about CockroachDB in Appendix D: events inside a thread are represented as a sequence of points in time and two events connected by an inter-thread message is interpreted as a type II edge. And this potential mixed usage of “happened before” and 'happened before' is OK since 'happened before' is only used in type D of the Serializability Theorem to assert that a non-serializable history contains a cycle in the proof of Theorem 2.1 below. After that, “happened before” is used for the rest of the arguments.

The second one is that we assume t.e could be equal to t.f for two different events e and f. In a standalone system, we can usually arrange the timestamp service to return an increasing sequence so that t.e is not equal to t.f, if e and f are not the same. In a distributed system, however, the distributed clock usually can't fulfill such a requirement. We must cope with it.

With the second assumption, we need to define two concurrent transactions T1 and T2 to be: timestamp of T1's start <= timestamp of T2's commit and timestamp of T1's commit >= timestamp of T2's start; in the case the timestamp service returns an increasing sequence, it is always safe to use the strict inequality signs '<' and '>'.

For a WR conflict from T1 to T2, we know that timestamp of T1's commit < timestamp of T2's start from the definition of Snapshot Isolation. So if there is a WR conflict between T1 and T2, timestamp of T1's start < timestamp of T1's commit(Clock Condition) < timestamp of T2's start < timestamp of T2's commit(Clock Condition).

If the conflict between T1 and T2 is a WW one, by First-Committer-Wins rule or First-Updater-Wins rule, we have timestamp of T1's start < timestamp of T1's commit(Clock Condition) < timestamp of T2's start < timestamp of T2's commit(Clock Condition). Notice the second inequality can't be the other way around(timestamp of T2's commit < timestamp of T1's start) since then we would have the following timestamp loop: timestamp of T2's commit < timestamp of T1's start < timestamp of T1's conflicting write < timestamp of T2's conflicting write < timestamp of T2's commit by the Clock Condition.

If there is a RW conflict between T1 and T2, it is possible that T1 completely precedes T2 or for T1 and T2 to be concurrent since T2 completely precedes T1 would result in T1 being able to read T2's write, which contradicts with the definition of a RW conflict. But in either case, we have timestamp of T1's start <= timestamp of T2's commit. We've just proved the following Lemma 2.2 in [Fe 05] with a distributed clock satisfying the Clock Condition.

Lemma 2.2 in [Fe 05]: In a history executed under Snapshot Isolation, if there is a conflict between T1 and T2, then timestamp of T1's start <= timestamp of T2's commit.

From the discussion above, the following Lemma 2.3 in [Fe 05] is also clear.

Lemma 2.3 in [Fe 05]: In a history executed under Snapshot Isolation, if we know that there is a conflict from T1 to T2, and that T1 and T2 are concurrent, we can conclude that it must be a RW conflict.

Now we are ready for the proof of the major theorem in [Fe 05], with a distributed clock satisfying the Clock Condition this time.

Theorem 2.1 in [Fe 05]: Suppose H is a multi-version history produced under Snapshot Isolation that is not serializable. Then there is at least one cycle in the serialization graph DSG(H) and we claim that in every cycle there are three consecutive transactions T1, T2, T3(where it is possible that T1 and T3 are the same transaction) such that T1 and T2 are concurrent and T2 and T3 are concurrent.

Proof: The existence of such a cycle is asserted by type D of the Serializability Theorem. Take an arbitrary cycle in DSG(H), and let T3 be the transaction in the cycle with the earliest commit time; let T2 be the predecessor of T3 in the cycle, and let T1 be the predecessor of T2 in the cycle. Suppose for the sake of contradiction that T2 and T3 are not concurrent; then either timestamp of T2's commit < timestamp of T3's start(but then this is before timestamp of T3's commit by the Clock Condition, contradicting with the choice of T3 as the earliest committed transaction in the cycle), or timestamp of T3's commit < timestamp of T2's start(but this contradicts with the presence of an edge from T2 to T3 by Lemma 2.2). Thus we've shown that T2 must be concurrent with T3. Hence timestamp of T2's start <= timestamp of T3's commit and timestamp of T3's start <= timestamp of T2's commit.

Now suppose for the sake of contradiction that T1 was not concurrent with T2. Then either timestamp of T1's commit < timestamp of T2's start(<= timestamp of T3's commit by the last sentence of the prior paragraph, contradicting with the choice of T3 as earliest committed transaction in the cycle), or timestamp of T2's commit < timestamp of T1's start(contradicting with the presence of an edge from T1 to T2 by Lemma 2.2). Thus we've shown that T1 must be concurrent with T2.


	                                                                                                       ## 
	  

So as long as a distributed clock satisfies the Clock Condition, we may apply Theorem 2.1 in [Fe 05] to an application executed under the Snapshot Isolation defined with this distributed clock. In particular, this implies we may apply Cahill's work ([Ca 08]) to it to achieve serializability.

5.2 List of distributed clocks satisfying the Clock Condition

We'll discuss a few distributed clock candidates satisfying the Clock Condition here, hoping that it will help you choose one for your specific implementation of Serializable Snapshot Isolation in a distributed database system.

5.2.1 HLC

The fact that HLC satisfies the Clock Condition is demonstrated in Appendix D when we survey CockroachDB.

5.2.2 Lamport's logical clock

The following algorithm is one way to implement the logical clock proposed by Lamport in [La 78].


        Initially l.j := 0

        Send or local event
          l.j := l.j + 1
          Timestamp with l.j

        Receive event of message m
          l.j := max(l.j+1, l.m+1)
          Timestamp with l.j, c.j
                                                                                               
	  

Here sub-script j represents an arbitrary process participating in this clock protocol, sub-script m represents the process from which the message(also represented as m) is sent. As we can see, this is the skeleton of the logical part in HLC and it satisfies the Clock Condition. This is for the single-threaded scenario in the good old days.

For today's multi-threaded scenario, the variant where events inside a thread are represented as points in time and two events connected by an inter-thread message is interpreted as a type II edge as described in Appendix B and the section about CockroachDB in Appendix D should be used. The fact that it also satisfies the Clock Condition is similar to the single-threaded case.

The advantage of a logical clock like Lamport's is the absence of a physical clock. [Ki 13] discusses some of the caveats of a physical clock and CockroachDB's approach described in Appendix D can mask most of them.

The disadvantage of a logical clock like Lamport's is also the absence of a physical clock. Let's demonstrate this with an example. For simplicity, we consider a distributed database system with only one thread – the transaction coordinator(TC) – in each node, which implements a Snapshot Isolation with Lamport's logical clock providing the timestamp service. Now suppose the system is busy, but one of the transaction coordinators, say TC1, somehow hasn't been assigned any transaction and it hasn't communicated with other TCs(say, none of them reads and writes the data in the node where TC1 resides) for a period of time. The consequence is that TC1's timestamps will lag behind its peers since Lamport's clock only advances with local events and messages received and it is lacking both. So if now TC1 starts a transaction with timestamp 20 assigned to its start event while its peers already advance their timestamps to be over 2000, then these TCs must retain at least all versions created between timestamp 19 and 2000 since TC1's reads might need them. This could impose pressure on the versioning system since it takes too long for versions to become outdated and it might run out of space. In a system where a physical clock plays a major role in the distributed clock like in HLC, this example may not show up as long as the physical clock is relatively accurate.

Such an example is, of course, at best contrived. That is because it need to take advantage of an extreme design principle to surface: the set of events timestamped should be minimized so that the timestamp service is not stressed. One way to resolve this issue is to set up a heartbeat mechanism and timestamp the heartbeat messages. Another alternative is to timestamp the garbage collection messages. Either way, the timestamps of all threads will be synced to close proximity and versions will become outdated and garbage collected in time. Nonetheless, this example still alerts us that care must be taken if we were to use Lamport's clock in our implementation.

5.2.3 An increasing sequence

In Appendix D, we can see that OceanBase uses an increasing sequence as a clock to service each tenant, the name of a Paxos group in OceanBase's nomenclature. Although a tenant could still be distributed in multiple nodes, it may only be a tiny part of an OceanBase cluster. There is no chance for this maximum 2M timestamp/s service to satisfy all the needs for a global Snapshot Isolation implementation for a large cluster. OceanBase has to take a step back by using one thread to generate such a service for each tenant, which results in a Snapshot Isolation implementation per tenant. Standalone Snapshot Isolation implementations like those in Oracle, Microsoft SQL Server, etc. use physical clocks to provide timestamp service. A physical clock is practically an increasing sequence if it always ticks forward(or that we make it to appear so).

An increasing sequence as a timestamp service in a standalone database system usually satisfies the Clock Condition. For example, if the thread that requests a timestamp and the system process that services the timestamp is connected by the system bus or a specific bus, then the requests are served in a FIFO order and the Clock Condition is guaranteed by monotonicity of the timestamp algorithm. But a generalization of it to a distributed system may not be. This is demonstrated in Appendix D when we discuss TiDB.

5.2.4 'Happened before'

If you feel uncomfortable using more than one way to interpret the temporal order in a distributed system, there is the following work-around to just use the 'happened before' partial order as in the rest of this article. To allow the “happened before” partial order in the Clock Condition to be the 'happened before' partial order, the right-hand side of the Clock Condition can be changed to End(e) < Begin(f). Then all the arguments in the last subsection can go through if we use End(e) as the timestamp of e, for each event e. Notice that as described in Appendix B, those arguments are still correct even if the Monotonicity Condition is violated by the physical clock from which End(e) is timestamped, because all the events in the arguments are carefully chosen.

Although in principle we may use physical clocks in the participating nodes as a distributed timestamp service for a Snapshot Isolation implementation, it is not recommended. From the discussion of CockroachDB in Appendix D, we know that there are caveats([Ki 13]) with such a service. Without something like the logical part of HLC to mask them, we have to deal with them one by one. And that really opens a can of worms.

5.3 Alternatives for a field-based Snapshot Isolation implementation

In Section 4, we've shown that type D of the Serializability Theorem can be applied to a field-based Snapshot Isolation level to conclude serializability if an application is free of conflict loops. Let's see how such a field-based Snapshot Isolation level can be implemented here. We'll explore two alternatives next. The first one will be designed so that the cost of adapting a current tuple-based Snapshot Isolation implementation to become a field-based one is minimized. The second one is optimized so that storage footprint of the database is minimized.

5.3.1 Alternative one

In this alternative, we assume a tuple-based Snapshot Isolation implementation in a distributed system is already in place. So let's first explore how to implement such a system.

We start by looking at how to implement tuple-based Snapshot Isolation in a standalone system. A typical implementation of such a system works like this: a transaction T starts with a start timestamp ts and performs reads based on the snapshot defined by ts; writes are cached locally until commit logic starts; Two-Phase-Commit starts the query phase by pre-writing the cached writes to its destination, First-Committer-Wins rule is used to detect write conflicts and a participant responds with a 'no' if T is chosen to be aborted by this rule; if all participants respond with a 'yes' to the query, the commit phase is started with a commit timestamp tc for the participants to finalize the writes; otherwise T is rolled back. Alternatively, if First-Updater-Wins rule is used to detect write conflicts, writes will not be cached till commit; instead, locks will be requested and writes will be performed once they are acquired.

Remark: Although First-Committer-Wins rule and First-Updater-Wins rule are considered to be equivalent since they both roll back one transaction among two conflicting ones, today's commercial databases usually use First-Updater-Wins rule to detect write conflicts because NOT waiting until commit to make the abortion decision will avoid performing unnecessary operations. Unfortunately, it requires a field level capable locking system if we want to apply type D of the Serializability Theorem to a system employing First-Updater-Wins rule. So we'll delay its discussion until next section when a field level capable locking system is ready and we'll focus on First-Committer-Wins rule in this section.

For such an implementation with a distributed clock, we need to worry about the following situation: when a read request from transaction T1 comes in with start timestamp t1, it's supposed to read the latest version before t1 under Snapshot Isolation. But what if this latest version from transaction T2 with commit timestamp t2 < t1 hasn't come in yet? In a distributed timestamp algorithm like HLC, it is in general possible that t1 and t2 are assigned by different nodes and the update by T2 may not have propagated to its destination when T1 is trying to read it.

For an implementation with a distributed clock satisfying the Clock Condition, it turns out the possibility of reading a stale version of data can be eliminated if a span of waiting is introduced. Let's assume that we could use a shared semaphore s to control the access of all the tuple versions of a tuple. For example, if we could design the implementation so that all those versions are placed in a page, then we may use the semaphore of that page as s; alternatively, if these versions span across multiple pages in a, say, B tree, we could view the collection of semaphores for all these pages as a logical one and acquire them all before we access a tuple version in these pages. We prove this by contradiction: Suppose the latest write hadn't come in yet when the read in T1 happened. Then we would have t1 < timestamp to acquire s(shared mode) for the read by T1 < timestamp to acquire s(exclusive mode) for the write by T2 < t2, which contradicts with t2 < t1 to start with. Here we require the commit timestamp t2 to be assigned after T2 finishes its write of the updated tuple for the last inequality to be true. For example, if we use First-Committer-Wins rule as in the typical implementation, such a write is a pre-write and semaphore s(exclusive mode) is acquired before it. In this case, this requirement can certainly be fulfilled. All we have to do is to hold s till the commit timestamp is assigned and attached to the updated tuple. And that is the time span we got to wait. So now we have that T1 acquires semaphore s after T2 does. But that is after T2's commit timestamp is written to the newly generated tuple version. In any reasonable implementation, T2's update is visible to T1. Notice all three inequalities in the previous arguments follow from the Clock Condition. This mechanism need to be applied to both item read and predicate read so that they are up-to-date.

This works fine if the tuple-based Snapshot Isolation implementation only has one copy of data. Contemporary distributed database implementations usually provide high available through replication. This introduces an extra challenge for implementing Snapshot Isolation: when a read request is directed to a secondary replica, the latest update of the requested data in the primary replica may not have propagated to this specific secondary replica yet. For example, in NDB Cluster's synchronous replication implementation, an update in the primary replica is ready for reading in the commit phase while the same update in a secondary replica is still locked and won't be ready until the complete phase. If the Snapshot Isolation implementation employs a similar technology, this may break the rule in Snapshot Isolation that a transaction reads the latest update before its start timestamp if a read request is directed to a secondary replica. There are two ways to overcome this obstacle: direct reads to the primary replica only as in NDB Cluster or let the read request wait till the secondary replica is up-to-date as in Google's Spanner([Co 13]) and TiDB. We will talk more about this latter approach when we discuss First-Updater-Wins rule in next section and in Appendix D when we review TiDB.

So the snapshot read of this distributed tuple-based Snapshot Isolation has been taken care of. Let's see what happens when two concurrent transactions update the same tuple. We assume these two transactions to be T1 and T2, which request and acquire semaphore s(exclusive mode) in successive order, and the start and commit timestamps for them are ts1, tc1 and ts2, tc2 respectively. Two situations arise when T2 request s: T1 either still holds s or already releases it. In the first situation, T2 is rolled back by First-Committer-Wins rule. In the second situation, if we also want to apply First-Committer-Wins rule, s can't be released immediately after T1 commits. This may not be a good idea since we don't have the intention to turn s into a long lock. So we take another approach: as long as we discover ts2 <= tc1, we may roll back T2. That is because ts1 < timestamp for T1 to release s < timestamp for T2 to acquire s < tc2 by the Clock Condition. Hence T1 and T2 are concurrent in the second situation if and only if ts2 <= tc1. But since T1 has already committed, we have to roll back T2. This way, one of two concurrent transactions updating the same tuple will be rolled back in either situation.

Now that we have a distributed tuple-based Snapshot Isolation implementation, we may derive field versions from the tuple versions so that we have a field-based system. We then capture 'decision set' versions as we've demonstrated in Section 4. And without any surprise, these 'captured' 'decision set' versions are the same as the derived ones. Reads(field reads and predicate reads) will behave the same as in the tuple-based implementation. Two concurrent transactions updating non-overlapping write sets, however, should be allowed. This can be accomplished by attaching a bitmap indicating which fields are modified to each tuple version. When the late comer among the two concurrent transactions comes in, it first requests for semaphore s(exclusive mode), it then compares its write set with the bitmap of the altered or supposed-to-be-altered tuple version; if that indicates a non-overlapping situation, the transaction can proceed or wait for s respectively; otherwise it is rolled back. This modification of the update logic should take care of both cases of overlapping or non-overlapping write sets for two concurrent transactions updating the same tuple. On top of this field-based Snapshot Isolation implementation, we may apply type D of the Serializability Theorem to achieve serializability.

However, whether such an implementation satisfies the definition of Snapshot Isolation in Section 4 is still debatable. Alice may argue: although you call s a semaphore, it is held until the end of a transaction and covers at least a page, hence it is a long page lock; thereby goes the requirement in the definition of Snapshot Isolation in Section 4 that two transactions writing the same tuple can be concurrent if the intersection of the write sets of these two transactions is empty. Bob, on the other hand, got the following opinion: the time span s is held by a writing transaction is roughly one network round-trip and we may call this 'transient' in many cases; and even two concurrent transactions writing non-intersecting write sets on a tuple must contend for the semaphore of the containing page of that tuple.

If you choose to stand with Alice, you need to wait until next section when we implement a field capable locking system. In this section we pretend we believed that Bob's got a point and see how far we can go.

In this distributed Snapshot Isolation implementation, there are mainly three alternations to the typical standalone implementation employing First-Committer-Wins rule: a distributed clock satisfying the Clock Condition, a semaphore that covers all tuple versions for each tuple and spans till the end of the transaction when it is acquired for a write, and a bitmap in each tuple version that helps facilitate concurrent transactions updating non-overlapping write sets. Notice that this semaphore need to wait for the commit decision from a transaction coordinator to be released, which in turn could wait for another semaphore to be acquired. This could result in a dead lock situation and may be resolved by a time-out mechanism as usual. Also, in this field-based Snapshot Isolation implementation the tuple concept, which corresponds to a row in a table, still exists. A tuple version, doesn't exists logically; but it exists physically.

5.3.2 Alternative two

In this alternative, we assume a similar database setup as in TiDB described in Appendix D. In other words, each row in a table is represented as a key-value pair so that both key and value are byte streams of arbitrary size. Then only exact size of space is needed from the underlying storage.

MVCC is implemented as follows: for each tuple, a base tuple version is maintained with the smallest timestamp; for each update on this tuple after that timestamp, only the values of updated fields are stored. So the version set for this tuple looks like the following:


        key_version1 → value(tuple version)

        key_version2 → value(field versions)

        key_version3 → value(field versions)

        ...... 
	  

Here a version indicator like 'version1' may well just be the timestamp when its updating transaction commits while the 'key' part in each key_version# is just the key of this tuple.

Combining both types of version, we should be able to approximate the goal of minimizing storage footprint of the database. Inside the value part, there is a header which stores a bitmap that indicates which fields are updated when this version is generated. When a later transaction queries this tuple in an item read or a predicate read, the MVCC system will first try to service this read from the latest version before the transaction's start timestamp; if the requested field versions or 'decision set' version can't be serviced only by this latest version(let's call it a miss), a preceding version is fetched, hoping the combined versions will be enough to accomplish the task; in the worst case, the base tuple version need to be fetched for this purpose. Notice in the item read case, a miss should rarely happen since a logical unit is usually updated and read as a whole.

Again we use the First-Committer-Wins rule and semaphore s mechanism as in alternative one to guarantee a read reads the latest versions before the reading transaction's start timestamp. Other aspects like guaranteeing a replica to serve up-to-date data and using a bitmap to allow concurrent transactions updating non-overlapping write sets are also similar.

In the process of garbage collection, all the versions that are no longer needed plus the version after them are collapsed into a new base tuple version. And the rest of the versions remain the same.

With such a MVCC implementation, we may implement a field-based Snapshot Isolation as defined in Section 4. On top of it, we may apply type D of the Serializability Theorem to it to achieve serializability. Notice this is the first demonstration that we DON'T need to build a field-based Snapshot Isolation system by deriving it from a tuple-based one. Of course, one may still build a tuple-based Snapshot Isolation system by applying the updated field versions to the base tuple version in this MVCC implementation first and then derive from it to give rise to a field-based Snapshot Isolation system. And we still get the same thing. But this approach is at best clumsy. And now we have a more natural field-based Snapshot Isolation system for type D of the Serializability Theorem.

Remark: In the Snapshot Isolation for a field-based system defined in Section 4, we claim that each field version is attached with a timestamp logically. But physically, these timestamps can usually be coalesced so that only one is stored for multiple fields. This is demonstrated in both alternatives we've just presented.

In alternative two, tuple versions are almost completely abandoned, except the base ones. Without type D of the Serializability Theorem, this kind of implementation may not seem natural at all. So by replacing type B of the Serializability Theorem with type D of the Serializability Theorem, we may not only improve concurrency, but also help elicit novel implementation of serializability.

5.4 Primary algorithm for the generalization

The consecutive RW conflicts between concurrent transactions in Theorem 2.1 form the so-called 'dangerous structure' in the nomenclature of Fekete's paper ([Fe 05]). Cahill's team then find a book-keeping way to keep track of these 'dangerous structure's when an application is executed under Snapshot Isolation in [Ca 08]. For every transaction T in the application, two boolean variables T.inConflict and T.outConflict are set respectively if a RW conflict is coming into or going out from T. They are both stored in the transaction record of T. Also, a new SIREAD lock is introduced to capture a RW conflict with a write lock for the First-Updater-Wins rule. No extra blocking is introduced though, since a SIREAD lock is more like a flag to indicate that a read has happened. A SIREAD lock is placed when an item read or a predicate read starts. The following is the primary algorithm in [Ca 08]:


      modified begin(T):

          existing SI code for begin(T)
          set T.inConflict = T.outConflict = false

      modified read(T, x):

          get lock(key=x, owner=T, mode=SIREAD)
          if there is a WRITE lock(wl) on x
              set wl.owner.inConflict = true
              set T.outConflict = true

          existing SI code for read(T, x)

          for each version (xNew) of x
          that is newer than what T reads:
              if xNew.creator is committed
                  and xNew.creator.outConflict:
                      abort(T)
                      return UNSAFE_ERROR
              set xNew.creator.inConflict = true
              set T.outConflict = true

      modified write(T, x, xNew):

          get lock(key=x, locker=T, mode=WRITE)

          if there is a SIREAD lock(rl) on x
              with rl.owner is running or commit(rl.owner) > begin(T):
                  if rl.owner is committed
                      and rl.owner.inConflict:
                          abort(T)
                          return UNSAFE_ERROR
                  set rl.owner.outConflict = true
                  set T.inConflict = true

          existing SI code for write(T, x, xNew)
          # do not get write lock again

      modified commit(T):

          if T.inConflict and outConflict:
              abort(T)
              return UNSAFE_ERROR

          existing SI code for commit(T)
          # release WRITE locks held by T
          # but do not release SIREAD locks  
	

Remark: In this section, instead of First-Updater-Wins rule, we use First-Committer-Wins rule to resolve write conflicts between concurrent transactions. Such an implementation doesn't require long locks. Instead, we'll develop a flag system, which is discussed in detail in next sub-section, to capture RW conflicts. In a flag system, we use a read flag for a SIREAD lock and a write flag for a write lock. But in this sub-section, let's still hold on to the concepts of SIREAD lock and write lock so that it doesn't deviate too much from Cahill's work.

In the 'modified read' part of the algorithm, we can see that whenever there is a version written after the read, it is signified as causing a RW conflict upon item x. So Cahill's team follows strictly what [Be 87] is interpreting RW conflicts: a read conflicts with a write on the same data item, even if there are a thousand other writes between them. This is absolutely unnecessary of course, according to type D of the Serializability Theorem in this article. Only the version written immediately after matters. So we can optimize the 'modified read' part as follows:


	  modified read(T, x) for item RW conflicts:

          if a version (xNew) of x right after what T intends to read exists:
              if xNew.creator is committed
                  and xNew.creator.outConflict:
                      abort(T)
                      return UNSAFE_ERROR
              set xNew.creator.inConflict = true
              set T.outConflict = true
          else:
              get lock(key=x, owner=T, mode=SIREAD)
              if there is a WRITE lock(wl) on x
                  set wl.owner.inConflict = true
                  set T.outConflict = true

          existing SI code for read(T, x)
	

So this is the case for item RW conflicts. For predicate RW conflicts, we can't push down the acquisition of a SIREAD lock as in the optimized 'modified read' for the case of item RW conflicts since a granular lock usually covers a range and after the identification of an IN or OUT operation that is causing a conflict, there might be other IN or OUT operations causing conflicts on other tuples in that range. The optimized 'modified read' for predicate RW conflicts in general is given as follows:


	  modified read(T, R) for predicate RW conflicts:

          get lock(range=R, owner=T, mode=SIREAD)
          if there is a WRITE lock(wl) on x in R
          generating the first version of match change
              set wl.owner.inConflict = true
              set T.outConflict = true

          existing SI code for read(T, R)

          for each version (xNew) of x in R
          that is the first version of match change after what T reads:
              if xNew.creator is committed
                  and xNew.creator.outConflict:
                      abort(T)
                      return UNSAFE_ERROR
              set xNew.creator.inConflict = true
              set T.outConflict = true
	

Here R is a range that covers the predicate read. In pessimistic technology, (multi-)granularity locking is usually employed to fence off conflicting writes to a predicate read. In MySQL InnoDB, R could be just what a bunch of 'gap lock's cover; in the serializability implementation I've developed for NDB Cluster, R could be just what a regular lock in the higher hierarchy covers if the application is well-structured; in TiDB, R could be multiple Regions where the predicate read happens; in Cahill's prototype of Serializable Snapshot Isolation deployed in Berkeley DB, R could be just what a bunch of pages span. Optimistic technology may not employ locks, but the concept of range R remains the same.

In a distributed database system, components of R may scatter into different nodes. We probably should replace R in the algorithm with its local component so that decisions can be made locally to boost performance. This could lead to false positives though since an update that moves a tuple from one component to another doesn't actually generate conflicts.

In Example 12, we've discussed the new incarnation of a tuple with the same primary key might cause confusion. In our context, let's say a transaction T1 reads the last visible version of an item and the tuple is deleted and re-inserted later by transactions T2 and T3 respectively; if the system can't tell the difference between these two incarnations, it might identify it as a RW conflict between T1 and T3 instead of T1 and T2 and could cause unnecessary abortion of transaction T1 or T3 since the last visible version and the newly inserted version are consecutive in the versioning system(because a deletion generates the 'dead' version logically, but not necessarily physically). The solution we've suggested in Example 12 was to attach timestamps of creation to different incarnations. For Snapshot Isolation we've built our serializability implementation upon, we have another approach: create and place a 'dead' version between these two versions so that the two version sets are isolated. This works for predicate RW conflicts too. Notice this may not be appropriate for a system that only stores one version in the database like NDB Cluster. Besides breaking the semantics, a garbage collection system that purges these 'dead' versions is also needed. But in MVCC system like Snapshot Isolation, such a garbage collector already exists. We just need to add the 'dead' versions to the set to be purged.

Now let's look at the following optimized modified write(T, x, xNew) for item RW conflicts:


	  modified write(T, x, xNew) for item RW conflicts:

          get lock(key=x, locker=T, mode=WRITE)

          if there is a SIREAD lock(rl) on x
              with rl.owner is running or commit(rl.owner) >= begin(T):
                  if rl.owner is committed
                      and rl.owner.inConflict:
                          abort(T)
                          return UNSAFE_ERROR
                  set rl.owner.outConflict = true
                  set T.inConflict = true

          existing SI code for write(T, x, xNew)
          # do not get write lock again

          release SIREAD lock(rl) on x if present
	

Remark: The condition 'commit(rl.owner) > begin(T)' has been altered to 'commit(rl.owner) >= begin(T)' to adapt to the new definition of 'concurrent transaction' with a distributed clock as discussed in subsection 5.1.

From the discussion of optimized modified read(T, x) for item RW conflicts case, we know that this write must be the first one after x's read if a SIREAD lock for x exists. So they constitute an item RW conflict. After the write is completed, we may release the SIREAD lock(rl) on x since we don't need that for a later write according to type D of the Serializability Theorem.

In the following optimized modified write(T, x, xNew) for predicate RW conflicts, we can't release the SIREAD lock after the write is completed since other writes in the range might need it later. So the pseudo-code below looks similar as the original:


	  modified write(T, x, xNew) for predicate RW conflicts:

          get lock(key=x, locker=T, mode=WRITE)

          if there is a SIREAD lock(rl) on a range that writing of x is the first match change,
              with rl.owner is running or commit(rl.owner) >= begin(T):
                  if rl.owner is committed
                      and rl.owner.inConflict:
                          abort(T)
                          return UNSAFE_ERROR
                  set rl.owner.outConflict = true
                  set T.inConflict = true

          existing SI code for write(T, x, xNew)
          # do not get write lock again
	

Combining both the item and predicate cases, we got the following algorithm for the optimized modified write(T, x, xNew) case.


	  modified write(T, x, xNew):

          get lock(key=x, locker=T, mode=WRITE)

          if there is a SIREAD lock(rl) on x 
              or there is a SIREAD lock(rl) on a range that writing of x is the first match change,
              with rl.owner is running
              or commit(rl.owner) >= begin(T):
                  if rl.owner is committed
                      and rl.owner.inConflict:
                          abort(T)
                          return UNSAFE_ERROR
                  set rl.owner.outConflict = true
                  set T.inConflict = true

          existing SI code for write(T, x, xNew)
          # do not get write lock again

          if present, release SIREAD lock(rl) on x, 
          but not those on a range that writing of x conflicts with
	

The pseudo-code for modified begin(T) and modified commit(T) remains the same in this optimize algorithm. The arguments for the correctness of this optimized algorithm are the same for the modified algorithm in [Ca 08].

Notice any transaction record must be kept until all the active and future transactions in the system have greater starting timestamps than its owning transaction's commit timestamp so that they can't be concurrent with the transaction in concern. In a distributed system, the garbage collector need to query all the necessary nodes to come to the conclusion that all active transactions satisfy this condition before a transaction record can be purged. This happens asynchronously and should not impose a performance hit.

However, this only guarantees all active transactions are fine. What about the future ones? A distributed clock usually doesn't service timestamps in a monotonic manner. So could a future transaction be assigned a starting timestamp older than the commit timestamp of the transaction record's owning transaction? For instance, in the contrived example when we discussed Lamport's logical clock, a node being inactive for a long time could generate such a timestamp. The solution is again to timestamp the garbage collection messages or the heartbeat messages to find out the smallest possible value for the maximum timestamp each node has issued. If all these values are greater than the commit timestamp in concern, then we are fine since this clock algorithm is monotonic at each node. A similar strategy can be applied to other distributed clocks listed earlier, except the 'happened before' distributed clock. For the 'happened before' distributed clock, if the Monotonicity Condition is satisfied, it will generate timestamps monotonically at each node too. But the discussion in Appendix B indicates that it is not always true. If you are going to use a distributed clock other than those discussed in this article, this issue must be addressed. For the rest of this section, we'll assume this technique is in place and we don't need to worry about these future transactions once all the active ones have been checked up.

It turns out this purging rule also applies to SIREAD locks that haven't been released in the primary algorithm too. Let's see why from the following

Example 27: Let T1 and T2 be two transactions executed under Snapshot Isolation with a timestamp service satisfying the Clock Condition. And there is a predicate RW conflict from T1 to T2 such that the predicate read in T1 uses a 'decision set' version v1 while T2 established the next 'decision set' version v2 that changes the match. The start and commit timestamps for T1 and T2 are ts1, tc1 and ts2, tc2 respectively such that T1's start “happened before” T2's and T1's commit “happened before” T2's and hence ts1 < ts2 and tc1 < tc2.

If the SIREAD lock for the predicate read in T1 was released when T1 committed at tc1, then when T2 tried to commit its write at tc2, the RW conflict could not be recognized. So it must be kept until after tc1. And it must be kept until all the active or future transactions in the system have greater starting timestamps than tc1. That's because by then a potentially conflicting T2 won't be concurrent with T1 any more and hence won't be part of a 'dangerous structure'.

The item RW conflict case is similar.


	                                                                                                       ## 
	  

Garbage collection of field and 'decision set' versions versions is similar, but we express it in a different way: as long as the distributed clock has advanced to after the timestamp ts of the next version of a concerned version and no transaction in the system has a start timestamp <= ts. For a distributed clock that gives out timestamps monotonically at each node, the first half of this condition implies each node's maximum generated timestamp is greater than ts.

So we use a flag called SIREAD lock to capture the 'Read' part of a RW conflict. Can we do that with the 'Write' part too and replace the write lock with another flag? After all, one attractive feature of an optimistic concurrency control mechanism is that we don't have to build a locking system. It turns out we can. In the Snapshot Isolation we've defined in the last section, updates to both the same field or the same 'decision set' are required NOT to be concurrent and this renders the write locks unnecessary. A flag to replace a write lock must also be kept until all the active and future transactions in the system have greater starting timestamps than its owning transaction's commit timestamp. An implementation of this flag system will be given in the next subsection.

With the optimization from type D of the Serializability Theorem, there will be less false positives when it comes to aborting transactions to prevent conflict loops from forming. But the fact that a conflict loop implies existence of 'dangerous structure', but not the other way around still will give rise to some of them.

Various ways to optimize the primary algorithm are discussed in Cahill's paper ([Ca 08]). I include them here for your reference. In Cahill's primary algorithm, they always abort the pivot(T2 in Theorem 2.1 of [Fe 05]) if either end of a RW conflict can be aborted. One may choose a different strategy like aborting the younger transaction so that long running ones are more likely to survive. Also for simplicity, in the primary algorithm, the abortion decision is postponed till the transaction's commit. But one can do that once a 'dangerous structure' is detected. Likewise, conflicts are not recorded(with T.inConflict and T.outConflict) against transactions that have already been aborted or that will due to both flags being set.

At last, as a solute to Cahill's fine work, I will include one of my own here: detect 'dangerous structure' that is not in a cycle and avoid aborting those three transactions when the primary algorithm executes if a pre-analysis is possible(like in the case where the transactions in an application are known).

5.5 A flag system

We'll provide a sketch of an implementation for a flag system to capture the RW conflicts as promised in the last subsection with a design that starts with the Observer design pattern as described in the 'gang of four' book([Ga 94]). The diagrams for class inheritance are depicted as follows:

Subject
Attach(Observer)
Detach(Observer)
Notify()
<--------------------------------------------------------------------
ConcreteSubject
GetState()
SetState(Observer)
subjectState
Observer
Update(()
<--------------------------------------------------------------------
ConcreteObserver
transaction
timestamp
observerState

For the item RW conflict case, a ConcreteSubject object representing a field version is instantiated whenever a read of that version or the write of next version happens for the first time. Each such read is represented as a ConcreteObserver, which is stored in the list observers inherited from the Subject class. The transaction that performs this read and the time when it happens are recorded in the two variables in ConcreteObserver. The Notify() method is implemented as follows:


      for all o in observers {
        o.Update()
        Detach(o)
      }
	

The subjectState field is used to represent the write of next version. It's initialized to be NULL and updated when that write commits. For simplicity reason, its type is also assumed to be ConcreteObserver in this sketch. The two firlds: transaction and timestamp in ConcreteObserver become the transaction that performs this write and the time when it happens, of course. You may add a new type for it if you prefer, but the diagrams need to be altered accordingly. It is updated in the SetState(Observer o) method when the write of next version happens as follows:


      if subjectState == NULL {
        subjectState = o
      }
      …
      Notify()
	

The Attach(Observer o) method is implemented as follows:


      if subjectState == NULL {
        observers.Add(o)
      } else {
        o.Update()
      }
	

The Update() method is implemented as usual:


      observerState = subject.getState()
	

Here the subject field is inherited from the Observer class and getState() simply returns subjectState. Inside method Update(), this.transaction's outConflict and observerState.transaction's inConflict are both updated to be true. In a typical operation scenario of the flag system, a bunch of reads of a field version come in first and each register itself in the observers list, then a write of next version happens and clears the list; after that, every read of that field version is recognized as causing a RW conflict and doesn't need to be added to the observers list any more.

In a typical implementation, we may arrange the access to subjectState to be atomic, while that to the list observers is protected by a semaphore. But then consider the following scenario: transaction T1 reads a field version(Attach(Observer o) is called in that process) and discovers that subjectState is NULL and tries to add an Observer to the list; soon after that, transaction T2 writes next version of that field(SetState(Observer o) is called in that process) and tries to notify every Observer in the observers list; if for some reason(the thread that executes T1 got context-switch out after subjectState is accessed, for example), T2 acquires the semaphore of the observers list before T1 does and clears the observers list before the addition of the Observer for T1 to it. This violates the semantics of the typical operation scenario described in the last paragraph by adding more Observers to the list after it is cleared. To circumvent this, we may require SetState and Attach to request an exclusive semaphore x before they access subjectState and observers and to relinquish it after that. The following are the resulted code for them:

SetState:


      sem_get(x)
      if subjectState == NULL {
        subjectState = o
      }
      …
      Notify()
      sem_give(x)
	

Attach:


      sem_get(x)
      if subjectState == NULL {
        observers.Add(o)
      } else {
        o.Update()
      }
      sem_give(x)
	

Locality is important. We now have a ConcreteSubject object for each field version of a field. To take advantage of temporal locality, we may place all the field versions of that field into one single ConcreteSubject object. This way we will have as many observers and subjectState pairs as the number of versions of that field in it. We may even use one single exclusive semaphore to control access to all these pairs if we want to reduce memory footprint, at the cost of lower concurrency.

Usually all fields in a logical unit are accessed at the same time. We may take advantage of spatial locality if we place field versions of all these fields in one ConcreteSubject object. We may also consider using one single exclusive semaphore for all of them if not too much performance hit is imposed. If nearby rows in an index are often accessed together, we may go the extra mile by placing the relevant field versions in these rows into one ConcreteSubject object. For example, in TiDB, placing the relevant fields in the rows of a Region into a ConcreteSubject object may not be a bad idea(if a Region is too large, try to divide it into smaller blocks); in a system like NDB Cluster, however, this may not necessarily be beneficial since this part of spatial locality is at least partially destroyed by hashing.

I truly hope that future serializability implementations can make all these – what field versions to place into a ConcreteSubject object and how to use semaphores to control access – configurable to developers. After all, who knows whether an application can take advantage of locality better than its developer? Or even better if the implementation could provide a logical unit version counterpart to field version so that a developer can choose it if he/she is certain that the application accesses data strictly through logical units. This approach can also reduce memory footprint by reducing the number of Observer objects in the system.

Even so, there still could be too many Observer objects in the system that might lead to resource exhaustion. PostgreSQL handles this with the following strategy: if too many tuples in a page have acquired a tuple SIREAD lock, aggregate them into one page SIREAD lock; if too many page SIREAD locks are in the system, aggregate them into one relation SIREAD lock. The details can be found here in PostgreSQL's wiki(at the end of “1.5 PostgreSQL Implementation”). Future flag system implementation could provide a similar mechanism, with an additional field level in the picture of course. Such an approach will nullify our effort of resolving conflicts down to field level and lead to false positives that would not have surfaced, hence should only be considered a measure of last resort. Before that, the system should provide enough amount of memory for this purpose. Luckily, today's high-end servers can already host memory up to a few Tbs. There should be more than enough memory for us, especially when the data in the database are stored on disk. It is a good idea to make the maximum amount of memory for the flag system to allocate Observer objects to be configurable so that database application developers can make their own decision since once again, who knows the application better than them? Designing the Observer objects as compact as possible also helps.

In the predicate RW conflict case, when a predicate read of or a write to range R happens, a ConcreteSubject class is instantiated. The predicate reads are registered in the observers list as in the item RW conflict case, while subjectState becomes a list of objects representing writes in this range. We need a new concrete observer for this kind of object this time. So we have the following diagrams:

Observer
Update(Observer)
<--------------------------------------------------------------------
ConcreteObserver
transaction
timestamp
Observer
Update(Observer)
<--------------------------------------------------------------------
ConcreteObserverForWrite
transaction
timestamp

The methods to access subjectState are different too.

Subject
Attach(Observer)
Detach(Observer)
Notify(Observer)
<--------------------------------------------------------------------
ConcreteSubject
AttachWrite(Observer)
DetachWrite(Observer)
NotifyWrite(Observer)
subjectState

The implementation of Notify(Observer oForWrite) is as follows:


      for all oForRead in observers {
        oForRead.Update(oForWrite)
      }
	

Here oForRead and oForWrite are of types ConcreteObserver and ConcreteObserverForWrite respectively. This is similar for the implementation of NotifyWrite(Observer oForRead):


      for all oForWrite in subjectState {
        oForWrite.Update(oForRead)
      }
	

The implementations of the Update(Observer) methods in the two concrete observer classes are symmetric. In ConcreteObserver, the Observer oForWrite passed in is a ConcreteObserverForWrite object that represents a write; the Update(Observer) method examines whether that write is the first match change for the read represented by its calling Observer oForRead. On the other hand, in ConcreteObserverForWrite, the Observer oForRead passed in is a ConcreteObserver object that represents a read; the Update(Observer) method examines whether the write represented by its calling Obseever oForWrite is the first match change for it. If it's affirmative, outConflict of the reading transaction and inConflict of the writing transaction are both set.

Both observers and subjectState are guarded with a shared semaphore for access. Let's called these two semaphores s.o and s.s respectively. Then the implementation of method AttachWrite(Observer oForWrite) is as follows:


      sem_get(s.s, exclusive_mode)
      subjectState.Add(oForWrite)
      sem_give(s.s)
      sem_get(s.o, shared_mode)
      Notify(oForWrite)
      sem_give(s.o)
	

Symmetrically, the implementation of method Attach(Observer oForRead) is as follows:


      sem_get(s.o, exclusive_mode)
      observers.Add(oForRead)
      sem_give(s.o)
      sem_get(s.s, shared_mode)
      NotifyWrite(oForRead)
      sem_give(s.s)
	

If we view observers and subjectState as two data fields and the access of them in methods AttachWrite and Attach as two transactions, it is possible to find a conflict loop between them when they are executed concurrently and hence it is NOT serializable by type D of the Serializability Theorem. However, the semantics that each and every predicate RW conflict is captured remains correct since each possible predicate RW conflict is evaluated by either AttachWrite or Attach. That said, there is a chance that a predicate RW conflict is evaluated twice in this process. But that doesn't change the correctness of the algorithm since the updates to outConflict and inConflict are idempotent.

In some implementations, ranges could be merged or split. For example, in TiDB, two Regions can be merged if there are two few tuples in them and a Region can be split if it gets bigger than its default size of 256 MB. The algorithm must be able to cope with this. For the merging case, two relevant ConcreteSubject objects are merged into one, so are the observers lists and the subjectState lists in them. For the splitting case, the observers list is copied to both newly generated ConcreteSubject objects, while the subjectState list is split into two so that each only contains writes for its sub-range.

So far, this works fine for a predefined range R that is associated with a known object like a Region in TiDB, a page in Cahill's prototype of Serializable Snapshot Isolation deployed in Berkeley DB or a tuple in the added table for predicate RW conflicts in the serializability implementation I've developed for NDB Cluster. But in other implementations when this association is missing, there is an extra issue. For example, in the CockroachDB case, the actual range(or its components) inside a where clause is recorded in a node-local cache to handle predicate RW conflicts. Say, in a statement “… where key>=1 and key<5 or key>100 and key<200”, [1, 5) and (100, 200) are inserted into its target node-local cache respectively. If we use a range R for [1, 5) in the primary algorithm and an insert into it “happened before” the predicate read, but they still constitute a predicate RW conflict(for example, the transaction with the predicate read starts with a earlier timestamp and it's a long-running one, the predicate read happens very late in that transaction); then when the insert comes in, we don't know where to put it since the range R for [1, 5) hasn't been determined yet. To get around this, we need to provide an auxiliary data structure like a list to hold such writes until an appropriate predicate read shows up, or otherwise it's garbage-collected.

So it's time to talk more about garbage collection. Some objects in observers and all objects in subjectState are never deleted in the primary algorithm. They need to be garbage-collected when they are not needed any more. For those inside observers, they are just SIREAD locks and the rule for garbage-collecting them has been discussed. For an object inside subjectState, the rule for garbage collection is also the same: it can be purged as long as no active or future transactio in the system has a start timestamp <= timestamp of it; this way, all the possibly conflicting SIREAD locks have been processed against it before garbage collection. The Detach and DetachWrite methods are used in this garbage collection process. This rule also applies to the writes in the auxiliary data structure.

We also need a buffer pool to store pre-located and initialized ConcreteObserver and ConcreteSubject objects for performance considerations, especially in a system that doesn't support slab allocation.

Notice in this flag system, every read is accompanied by a write(of an Observer object), at least. In contrast, in a database with pessimistic technology, only a locking read is accompanied by a write(the lock request in the locking system). This may impact performance, of course. Future implementation should provide tools for pre-analysis so that unnecessary writes to the flag system(for example, if a field is only read, but never written or the other way around) are not possible. I understand Serializable Snapshot Isolation poses as a serializability implementation that can handle workload that is unknown at the time of development. But sometimes performance trumps and important workloads usually don't allow arbitrary transactions to be fed into the system for safety reasons. Providing these pre-analysis tools certainly helps with this latter situation. For example, if these tools are good enough, we don't even need to activate the flag system for TPCC since it will be detected that it doesn't contain any 'dangerous structure'.

So far the flag system works if all transactions commit successfully. But what if a transaction got aborted prematurely, or the system crashes? For example, what happens if a transaction T is aborted after a SIREAD lock is added to the observers list and the subjectState list is being notified? We'll try to answer the following two key questions should T be aborted prematurely, or the system crash:

  1. Does the flag system still guarantee correctness by destructing each and every 'dangerous structure'?
  2. Can the flag system maintain a correct state when this happens?

We first look at the case when a transaction got aborted prematurely. In the example we've just described, we may just remove it(with the Detach method) when its containing transaction is rolled back. Some objects in the subjectState list may be deleted(with the DetachWrite method) when their containing transactions are rolled back because their associated inConflict fields are set. Although these transactions are aborted for some transaction that never commits, it only generates false positives for presence of 'dangerous structure's. So the answer to the first question is affirmative. The flag system remains in a correct state after T is rolled back since all the objects for the relevant transactions that can't commit are removed.

In the case where the system crashes, for the first question, if the recovery mechanism is like Write-Ahead-Logging so that no committed transaction is lost when the system crashes, all we need to worry about are those transactions that haven't committed when disaster strikes. Some of them may have caused other transactions to abort, but that is it. In particular, they don't roll back committed ones and hence doesn't affect the surviving ones. So the answer to the first question is again affirmative.

If, on the other hand, the recovery mechanism is like neighborhood-Write-Ahead-Logging as in NDB Cluster so that some committed transactions are lost in the crash, the answer to the first question is still affirmative. The reason is simple: if all committed transactions can't generate any 'dangerous structure', how could some of them?

The second question doesn't apply in either case if the system crashes since upon system restart, the flag system-which exists in memory only-will be recreated from scratch. We've used an example for demonstration here. But actually it works for all other situations.

Notice this flag system implementation doesn't require a join to be re-written as equivalent statements as in the serializability implementation I've developed for NDB Cluster. That is because conflict resolution happens when a predicate read is on-going, which is after an execution plan is chosen.

5.6 False Positives

Some steps in this algorithm could generate false positives for conflict loops. We'll identify them and try to provide alleviation if possible.

  1. Having a 'dangerous structure' doesn't mean it is in a conflict loop. An implementation may provide tools for pre-analysis of conflicts so that the obvious false positives can be discovered beforehand. And it should provide means to annotate the relevant conflicts such that the primary algorithm can skip them.
  2. Aggregation of finer granularity SIREAD locks into one coarser granularity SIREAD lock could also lead to false positives.
  3. Range R chosen to cover a predicate read is too large and this could lead to false positives. Make sure this cover is as 'tight' as possible is an alleviation. For example, in TiDB, if a Region is too large, we may use a sub-region of it as R if that is supported.
  4. If range R is scattered into different nodes, an update that moves a tuple from one component to another could lead to a false positive. To eliminate this type of false positives, we need to globally monitor R. This may induce a performance hit.
  5. The last type of false positives are those caused by a transaction that never commits as we've just demonstrated in last subsection. One way to eliminate some of them is to set the inConflict and outConflict fields in the primary algorithm right before T commits(delay calling of Update() methods until then), so that those transactions aborted early are not causing any false positive.

This last technique of reducing false positives is relevant to the answer of the following question: can we move the flags up and down in a transaction? In the serializability implementation I've developed for NDB Cluster, we can always move the locks for prevention measures up the transaction, and sometimes move them down. What is the situation in a flag system? Are there any implications of doing that?

It turns out we may move them up or down freely. That is because altering their locations in a transaction only changes the time they are inserted into the observers and subjectState lists, not the outcome of the methods those actions triggered. We've just seen the advantage of moving these flags downward: potentially less other transactions are rolled back because they may have already committed when the flags are set. The disadvantage, of course, is that this transaction itself might get aborted when a 'dangerous structure' exists since the counterpart has already committed and can't be rolled back any more. In this case, some less work could be saved should this transaction set those flags and got rolled back early.

5.7 Summary and discussion

Let's summarize what we've accomplished in generalizing Serializable Snapshot Isolation to a distributed system:

  1. A field-based Snapshot Isolation for a distributed system is defined in Section 4.
  2. On top of it, type D of the Serializability Theorem is proved in Section 4. It can be applied to a distributed system as other types.
  3. Two alternatives are provided to actually implement the underlying field-based database system in this section. A distributed clock satisfying the Clock Condition can be chosen to implement a field-based Snapshot Isolation on top of it. 'Decision set' versions are captured as demonstrated in Section 4.
  4. The Clock Condition guarantees that arguments in [Fe 05] can go through in a distributed system. This time, type D of the Serializability Theorem is used to conclude serializability instead.
  5. A flag system is sketched to realize the book-keeping algorithm in [Ca 08].
  6. Different choices of distributed clocks and underlying database implementations give rise to different systems. And the flag system is also an alternative to the hybrid system employed in PostgreSQL. And the generalization is hence generic.

So now we have a Serializable Snapshot Isolation implementation without using any long locks. Can we use long locks to replace the flags in the flag system? If we only replace the read flag or the write flag with a long lock, then we need a hybrid system that involves both flags and long locks. That is exactly what PostgreSQL does and the details are described in the wiki here. So can we replace both with long locks? We can, but it turns out there is a catch.

From the implementation of the flag system, we know that a SIREAD lock can't be purged after its containing transaction T1 commits since another transaction T2 started before T1's commit may perform a write after T1's commit that conflicts this SIREAD lock. If T2 is long-running and we use a long lock as a SIREAD lock, this SIREAD lock will block writes conflicting with it for a very long time. This is certainly undesirable since Snapshot Isolation based technology always poses as one that writes don't need to wait for reads on the same data item. PostgreSQL‘s team knows about this and it is mentioned in its wiki(in the section “4. Innovations”). I don't know if Cahill's team knows about it. It is not mentioned in [Ca 08].

Although this section is based on type D of the Serializability Theorem, you may also apply the method developed in it to a type B of the Serializability Theorem as long as the field versions are derived. Specifically, if you already have a distributed Snapshot Isolation implementation based on tuple versions, with one of the distributed clocks discussed in this section, all you have to do is to implement the primary algorithm and the flag system on top of it and achieve serializability. That's exactly what we could've done with alternative one. Concurrency control is not as fine-grained as with type D of the Serializability Theorem, but the alternation to your code is minimized.

In general, there could be other specific issues you need to work out when you try to develop a Serializable Snapshot Isolation implementation in a distributed database system. For example, in the second-tier serializability implementation I've developed for NDB Cluster, I also work on a subsection called 'Durability of consistency' so that more guarantees are provided in the case a NDB Cluster suffers a system-wide crash. Such work would require a lot of details about the system to fit in so that it could go through. If you are trying to work out an issue in the process and encounter problems, please don't hesitate to contact me if you think I may be able to help.

From what we've discussed in this section, depending on the specific implementation, it may or may not require long locks other than semaphores for concurrent writes to the same field. But in the case it may, we need a field level capable locking system for finest concurrency control.

6. Field level capable locking system and a hybrid system

TODO

7. History, issues and the possible future

A lot of contents of this section about the future are guaranteed to be premature and hence it is subject to change in the future, so please read it with your discretion.

The concepts of consistency and serializability date back to the 70s ([Es 76]). In particular, they precede the coining of SQL in 1993.

Back in SQL 101 in school, we were shown with all kinds of anomalies that could arise from interleaved execution of transactions and eventually 2PL was demonstrated to conclude that all the problems were solved with such a serializability implementation.

But then when we look at commercial databases like MySQL InnoDB, we discover that its default isolation level is Read-Committed, instead of serializability. If we dig deeper, we may find out the reason is that 2PL actually demonstrates thrashing behavior when the load is high ([To 93]}. In many cases, the sophisticated database application developers are presented with a 'choose your own poison' scenario: either sacrifice performance for better sleep, or keep up to the beat until corruption of data eventually results. Those who are not satisfied with this situation has never stopped the effort to find a serializability implementation that gives us both consistency and performance for many, if not most, workloads.

In late 2011 or early 2012, I noticed PostgreSQL was calling their product 'the most advanced database system in the world', so I looked into it to see why. One thing led to another, I discovered Fekete's article ([Fe 05]) and eventually Adya's article ([Ad 00]). I was thrilled by Adya's framework's generality. In particular it seemed to me that we may apply it to a distributed system like NDB Cluster to achieve serializability there. Naturally, I tried to verify everything before I did that. This has led to all the concerns about the correctness of PostgreSQL's serializability implementation I've discussed throughout this article, including those discussed in Appendix C. At last I started to realize that maybe Adya's thought that correctness of their version of the Serializability Theorem was implied by [Be 87] was incorrect after all. I didn't panic though. I thought maybe I could provide a proof myself, since after all this wasn't the first time we gave a proof to a version of Serializability Theorem.

A lot of struggle was down that road primarily for two reasons: First, I discovered Example 0 very late in the process and had been thinking that Fekete's generalization of Adya's framework to field level, including predicate-based conflicts, was right. After trying to prove that for quite a while, I started to realize something was wrong and examined their framework carefully trying to figure out what was wrong. Their way of interpreting predicate-based conflicts caught my attention. Eventually this led to the concept of 'decision set' and its version system. Second, I had to make sure the distributed framework I developed to be backward compatible with standalone serializability implementations like MySQL InnoDB's Serializable isolation level and PostgreSQL's Serializable Snapshot Isolation.

Before the summer of 2016, I proved a variant of the current type B of the Serializability Theorem, one that every update of the 'decision set' of a predicate read would cause a conflict, which is impractical to apply to a real world application, but it may serve as the boundary of closure so that my search for the solution is inward. When I compared it with Adya's version of the Serializability Theorem, I proved a variant of the current type A of the Serializability Theorem. Then I discovered Example 0. Example 0 signifies that interpreting multi-field predicate-based conflicts all the way down to field level could be problematic. That is exactly what Fekete's paper does and however, Fekete uses examples with multi-field predicate-based conflicts throughout his paper. Specifically, Fekete's paper doesn't have a concept that is equivalent to 'decision set'. The discovery of Example 0 has raised more questions than providing answers, with the one in the center being: what version of Serializability Theorem does PostgreSQL's Serializable Snapshot Isolation use to guarantee serializability?

After I proved the current type A & type B of the Serializability Theorem, I decided to write it up. But in that process, I discovered type C of the Serializability Theorem. At first, I wasn't sure about its correctness since it seems it implies whatever versions are chosen to read after a predicate match, it is always correct. Later I figured out it means that the versions chosen must be such that conditions for a serial history are satisfied: when a transaction is executed, it always reads the latest committed version of data object that is available. An example describing its necessity will be given in Appendix A. Eventually, type D of the Serializability Theorem was discovered and I hope it will serve for the future implementation of serializable isolation level, for example, when a field level capable serialization mechanism is in place for 'pessimistic technology'.

The contribution of this article to the literature is three-fold. First, with the help of the 'happened before' partial order, we provide a generic way of reasoning about consistency in a distributed database system. This achievement is crucial since it allows us to solve a bigger problem, namely, one that is not confined to a standalone database system. Besides providing a serializability implementation for NDB Cluster, the proof in Appendix D that Google's Spanner actually operates under serializable isolation level is also a demonstration of this. While Spanner's serializability is achieved with expansive hardware provision like GPS time servers and atomic clocks, we achieve it with a simple mathematical partial order.

Second, we've successfully found a way to interpret conflicts down to field level. This will in general reduce conflicts to a minimum level and loads that require both data consistency and performance would benefit from it. Of course, in a lock based system, besides reducing conflicts, in the case conflicts and hence lock waits are unavoidable, we should also try to reduce lock waiting time by mounting the application to an in-memory database like NDB Cluster and minimizing the size of a transaction. Hence I highly recommend you to apply the method developed for NDB Cluster in this article in case both data consistency and performance are important to you. This is state of the art of consistency.

The third but the most important contribution is that we – Adya, Fekete and I together – have set up a generic framework such that a specific serializability implementation might be tailored for each specific database system. The axiomatic approach we've taken will enable us to apply this method to most, if not all database systems. We've demonstrated that for NDB Cluster in this article. For other 'pessimistic technology' based systems, especially those with a variant of two phase locking, a similar scheme should be able to be applied to achieve serializability. For 'optimistic technology' based systems like those implementing Snapshot Isolation, although we may still apply the same method since it is still a special case of Read-Committed isolation level, it is in general not optimal. Fekete's methodology is better. For hybrid system like Google's Spanner, we've shown that its isolation level is actually serializable in Appendix D. For timestamp-based technology, we've proved that CockroachDB's serializability implementation is mostly sound, albeit a few issues, in Appendix D. We'll try to extend this list along the way.

For us, the sophisticated database application developers, when the abyss of developing a consistent, large(usually implies a distributed infrastructure) and performance-boosted database application stares at us, we may now stare back for the first time with the method I develop for NDB Cluster in hand.

Along with these advancements are two important generalizations of the Serializability Theorem: the Generalized Serializability Theorems and the Ramification Theorems. The idea for the Generalized Serializability Theorems is not new in the literature. It is just the observation that Read-Only transactions can't affect database writes and hence a conflict-loop-free set of Read-Write transactions will leave a database consistent if it started so. In this article, I've pushed this idea to its extreme and tried to get rid of all the reads that can't affect writes. Some of the results are expressed in the form of the Ramification Theorems. What is scarce in the literature is how to apply the Generalized Serializability Theorems to real world applications. While we are at it, I will take this opportunity to explore that a little bit.

Let's consider two scenarios: you being a DBA or a client support staff for a database application where Generalized Serializability Theorem I or II is applied to achieve consistency.

In general, inconsistencies can be observed in a Read-Only transaction. If you are a DBA, as long as you don't take any action after seeing these inconsistencies, it will be fine. But what if you have to take action as in Example 17 after you see those potential inconsistent reads? If it is like the stock market case as in Example 17, you may run that specific Read-Only transaction multiple times to make sure. That is because inconsistencies only show up when a sequence of SQL statements are executed in a specific order to form a conflict loop. Chances are that when you re-run it, such a sequence will not exist any more and you will see no inconsistency.

The client support staff scenario is more complex since there is one factor that is less controllable, namely, your client. For example, you can't guarantee that your client not to take action if inconsistencies are seen.

If the action taken is controllable, like the case that a customer for a on-line shopping site sees inconsistencies in the order she/he placed and the only action he can take is to contact customer service, the support staff may re-run that specific Read-Only transaction multiple times and inform the client with correct information. If on the other hand, the action taken is NOT controllable for the support staff case or like in the nuclear reactor situation as in Example 17 for the DBA case, applying the Serializability Theorem instead to the application would be advisable.

Notice that there is limitation in the 're-running a Read-Only transaction multiple times to make sure' approach since re-running such a transaction might see changes to the database state and it is hard to tell which result to use sometimes.

In general, to successfully apply the Generalized Serializability Theorems to your application, you need to know your application, the database deployed and the limit of the Generalized Serializability Theorems.

The Ramification Theorems are designed to handle datetime field related inconsistencies and auto-increment columns. We've also discussed another application scenario after Example 24 including proving the prevention measures placed as in Example 16 won't introduce new inconsistencies into the original application. We can also use the way inconsistencies ramify through columns to try to restore consistency when consistency related data corruption has already been detected. We'll try to develop more ways to apply the Ramification Theorems down the road.

So this is what we've achieved in this article and the framework can be used to explain the four known serializable isolation level implementations in commercial databases: 2PL, Serializable Snapshot Isolation in PostgreSQL, the one developed by Google's Spanner as described in Appendix D and the one I've developed for NDB Cluster. That is why I'd like to advocate for this framework to replace the part for serializability in the ANSI SQL standard, which only allows 2PL.

Remark: The framework can actually explain five known serializable isolation level implementations in commercial databases, including the one developed by Cockroach Lab as described in Appendix D. I didn't count it in here since that implementation is still problematic. Interested readers may refer to Appendix D for more details.

As we can see, isolation levels are useful in the development of a serializability implementation, since such an implementation usually starts with a lower level, like Read-Committed in the NDB Cluster case and Snapshot Isolation in the Serializable Snapshot Isolation of PostgreSQL. There is another application of isolation levels in some commercial systems though. For example, MySQL InnoDB allows database application developers to annotate different transactions in an application with different isolation levels, so that it is as if the whole application is executed in a hybrid isolation level. However, I've never found a very good way to apply this kind of hybrid isolation level to an application, neither does the MySQL InnoDB documentation demonstrate any. Maybe the following example can explain why.

Let's say we annotate a transaction with serializable isolation level. This implies we expect it to provide the following guarantee: if this transaction reads consistent data from the database, it must write consistent data to it too. This requirement is a humble one since it is the basic assumption for the whole theory of conflict serializability to work out. But if there are other transactions in the application with lower isolation levels annotated, they may render the database state inconsistent even if it started out consistent. Then the serializable transaction might read these inconsistent data and write out inconsistent data to the database too. In this case, the requirement that 'if this transaction reads consistent data from the database, it must write consistent data to it too' is trivially satisfied. But the real question is: if you were held responsible by your boss for screwing up her/his database since the serializable transaction you wrote were not supposed to write inconsistent data, is this mathematical nonsense good enough to prevent you from being fired?

As database application developers, we care more about which fields in the database are consistent and which fields may not be, which statements in a transaction or which transactions return consistent reads more than which transactions are serializable and hence the requirement that 'if this transaction reads consistent data from the database, it must write consistent data to it too' is satisfied. The Ramification Theorems deliver us exactly that. Therefore I'd like to advocate including them in the future version of the ANSI SQL standard too.

7.1 Questions to be answered

Next let's explore a bit the questions haven't been answered in this article. I hope answers to them will help shape the future of the field of consistency.

When multiple transactions are trying to update the same item, it creates a race condition. There are currently two approaches to resolve this race condition: let them race or let them wait. The 'let them race' approach aborts all the transactions that lose the race, with only one winner. In a hot spot problem scenario where a lot of transactions are trying to update the same field simultaneously, the 'let them race' approach will lead to cascading abortion, which is as bad as the thrashing behavior demonstrated by 2PL. Hence implementation like Snapshot Isolation 'optimistically' assumes a race condition would not happen. This is the so-called 'optimistic technology' which underlines consistency model in important commercial databases like Oracle, Microsoft SQL server and PostgreSQL. The 'let them wait' approach assumes a race condition could happen and let the contenders wait for their turn. This is the so-called 'pessimistic technology'. MySQL InnoDB and NDB Cluster both implement it.

As a second-tier developer, I've never been a big fan of this 'optimistic technology ' approach. That is because if one assumes the role of a database application developer long enough, she/he will encounter the hot spot problem and need to handle it. So if one starts with 'optimistic technology', she/he will eventually have to take a look at 'pessimistic technology'. So why don't we start with the latter in the first place?

7.1.1 Question one

As described in Appendix D, Google's Spanner tries to retain all the advantages and to avoid all the disadvantages of 'optimistic technology' by sending only Read-Only transactions to a Snapshot Isolation component, while sending all the Read-Write transactions to a 2PL component, so that these two components merge to form the consistency implementation. This way, the hot spot problem is handled by 'pessimistic technology'. And the first question that hasn't been answered in this article is

  1. Since the framework developed in this article allows both 'optimistic technology' and 'pessimistic technology', can we use it to develop an implementation that combine benefits from both, without being just a simple merge like Google's Spanner's?

When I say 'without being just a simple merge', I mean that it should at least be like type H of the Serializability Theorem so that 'optimistic technology' or 'pessimistic technology' is implemented at a finer level like per table or per transaction based, rather than at server level like in Google's Spanner. Maybe we could cook up a version of Serializability Theorem like this, but is it practical to implement it in engineering? We'll try to find some answers along the way. Meanwhile, I am open to any suggestion to this end.

7.1.2 Question two

The second question is one that this article has been trying to answer.

  1. Can we develop a serializability implementation that works for many, if not most, database workloads, in the sense that it satisfies both consistency and performance requirements of an application?

Of the four serializability implementations discussed so far, 2PL and Google's Spanner(since it has a 2PL component) will suffer thrashing behavior for heavy loads which this article is trying to address, while Serializable Snapshot Isolation of PostgreSQL has to assume that race condition will never happen. So it seems these three may not be the best answer after all.

The serializability implementation I've developed for NDB Cluster represents a major advance in 'pessimistic technology' since the invention of 2PL. So is it a satisfying answer to the second question? Well, if the application is as simple as a short blog application like that in Example 12, the answer is certainly a 'yes'. For a rather complex application like TPCC, maybe we could find a way to sacrifice some consistency as in Theorem 4 to achieve desired performance. But when we can't sacrifice consistency like in the nuclear reactor case in Example 17 or for more more complex applications, I am not sure.

7.1.3 Free and paid service

I am currently working on a couple of papers in this series. After that I will find a place like a company which can offer me a NDB Cluster to play with to do some testing, probably starting with the TPCC application, to see how this implementation works out. But unfortunately I don't know every database application in the world and can't test them one by one. That is why I try to push up this article in the NDB Cluster forum and hope that the sophisticated NDB Cluster developers here can help me out: apply this serializability implementation in your application when necessary and give me some feedback, so that we can improve it together.

For that purpose, I've written a helper program that will assist you to analyze your application so you can employ necessary prevention measures to achieve the desired consistency requirement. The program is available here. If you have questions, don't hesitate to contact me either by replying to my article in the forum or by sending me an e-mail to creamyfish@gmail.com. Currently, I provide two kind of services: a free service and a paid service. The free service is for everyone that time of responding is not an issue because I'd like to answer those questions that are intriguing first. That way I can interact with the individual who has asked the question while she/he is still around. Every question will be answered eventually though. The paid service is for someone who is in an urgent situation. For example, if there is a deadline for your application and you probably will not make it without my help in time, you may consider using it. I intentionally set the rates very high(for an individual, not for a corporation) so that 'kids don't try it at home'. If you are interested in using this paid service, the terms of service and rates are available here.

In both services, especially the paid one, please follow the guidelines below for swift service, otherwise we might need at least a few rounds of communications before I can really start helping you.

  1. If your application only need consistency and it is neither large(so large that it requires a distributed infrastructure) nor performance sensitive, then probably MySQL InnoDB or PostgreSQL's serializability implementation can already fulfill your requirements(assuming PostgreSQL's Serializable Snapshot Isolation is correct).
  2. Otherwise the systematic way I've developed for MySQL may be helpful. To start with, make sure the schemers in your application are well-structured like those in TPCC. The way to fence off predicate RW conflicts is with granular locking. Having a fine-grained granularity for each table involved has major performance implications. In TPCC, splitting a warehouse into ten districts so that the tuples in this warehouse are sorted with these districts is one such example.
  3. Please describe your application in full details like that in TPCC since the current optimization process depends heavily on the application's semantics, especially if you need my help to alter your application for a performance boost. Specifically, please describe what each statement or each group of statements is trying to accomplish in each and every transaction.
  4. Try to provide a statistical break down of the transactions in your application. Take TPCC as an example, it should be something like: new-order transactions take up to 20% of the workload, delivery transactions take up to 2% of the workload since around 10 orders are delivered together, and such. If the application has been deployed, these statistical data should be easy to collect; otherwise please provide an educational guess or just describe the typical usage scenarios of your application. This will help me prioritize the conflicts in the subsequent analysis since when it comes to performance, not every conflict is created equal.
  5. I understand that if an application requires both consistency and performance, it is probably a crucial one in your company. So feel free to replace the column names with something like col1, col2 so that only necessary info is provided to me for inspection if discretion is needed. Note that such a cover-up may make my comprehension of your application difficult, make sure to compensate this in your detailed description of the semantics.
  6. If you want me to use the Generalized Serializability Theorems or the Ramification Theorems to provide consistency to your application since consistency of some of fields can be sacrificed, please specify the subset of fields that are important for your application(hence can't contain any inconsistency whatsoever).
  7. If it's Okay for me to alter your application for performance improvement, please explicitly request that since the default is NOT. I will try to suggest ways of doing so upon request if possible. Whether any of these suggestions can be applied is application specific.

7.1.4 Question three

If in the unfortunate case that after we implement a field level capable locking system and make sure our application is deployed in a memory-based system like NDB Cluster, we still can't provide a satisfying answer to the second question, what should we do? Let's look at the following example.

Example 28: Let t be the table in Example 0 with only one row (1, 1, 1, 5) in it. Execute the following statements in T1 and T2 in that order in NDB Cluster.


	               T1                                                              T2

          start transaction;

          select value into :value from t where id1=1;

                                                                              start transaction;

                                                                              update t set value=25 where id1=1;

                                                                              commit;

          update t set value=:value^2 where id1=1;

          commit;                                                                                                 
	    

There is clearly an item RW conflict from T1 to T2 and a WW conflict from T2 to T1. So the application of type B of the Serializability Theorem indicates this history H can't be a serializable history. However, if we compare the operations in H with those of serial history {T1, T2}, all the operations read or write the same values. The only discrepancy is the version of 'value' field written in the two histories.


	                                                                                                       ## 
	    

This example demonstrates a scenario that simulates the Generalized Serializability Theorem: a non-serializable history leaves a originally consistent database in a consistent state. But we can't use the Generalized Serializability Theorems to interpret it since the history doesn't have any Read-Only transactions and reads in a Read-Write transaction do affect writes.

Most work in the literature focus on avoiding conflict loops to provide Serializability. Few of them discusses when such a loop exists, how, when and where inconsistency would show up. If we can answer some of these questions, maybe we can find a way to interfere with it so that inconsistency is suppressed to a bearable level as the Generalized Serializability Theorems and Example 28 have demonstrated(in both cases, inconsistencies don't show up in the underlying database). So in general, we may ask the third question as follows.

  1. When a conflict loop exists, how, when and where would inconsistency show up?

7.2 Hot spot problem

While we are at the topic, let's explore the hot spot problem a little bit. We start with an example showing how bad it could be if 'optimistic technology' encountered the hot spot problem.

Example 29: Suppose a super bowl ticket selling application is deployed on a database implementing Serializable Snapshot Isolation. At its peak, 1000 transactions are started simultaneously, each attempting to purchase a ticket, in which the following statement is executed:


        update t set ticket=ticket-1 where ...;.                                                                                                 
	  

This creates a hot spot problem for the tuple containing the 'ticket' field. The database will execute exactly one transaction and abort 999 of them. Let's say the application will re-try every aborted transaction until all of them are finished. Also that there will be no more new transactions flooding in after these 1000 and each aborted transaction has done half of its work before its abortion for simplification.

The efficiency of a system is defined to be


        e = Real work done by the system / Total work done by the system.                                                                                                   
	  

In this case, e = 1000 / [1000 + (1 + 2 + … + 999) * 0.5] < 0.004, which is extremely low. A lot of CPU time is wasted on those aborted transactions, not to mention the time spent waiting for them to finish. This is a typical scenario of cascading abortion for 'optimistic technology' like Serializable Snapshot Isolation. And this is an under-estimation since there are usually new transactions flooding in.


	                                                                                            (To be continued ...)  
	  

In an ideal world, in a database system where 'pessimistic technology' is implemented, each transaction in a race condition will just wait in a lock queue quietly for its turn. This way in a thrashing system, although the wait for a lock is so long that it appears that the system is stalked, it's actually progressive and the transaction in wait will eventually be served. The reality, of course, is way more complex than the ideal case.

First, the update transactions flooding in might saturate other resources of system and practically render a DOS attack. For instance, if the queue of lock wait is too long, it might consume all the system memory and direct the system into an erroneous state.

Second, 'pessimistic technology' like NDB Cluster usually uses time-outs to detect deadlocks. This mechanism will time-out those statements that have been waiting too long for a lock. If we employ a re-try mechanism to restart these dead-lock victims, this effectively re-generates the cascading abortion scenario discussed in Example 29.

So to employ 'pessimistic technology' to deal with the hot spot problem, we must use a different deadlock detection/elimination strategy.

In the literature, besides using time-out to eliminate potential deadlocks, people have also explored lock dependence tracing to detect deadlock. In this approach, lock dependence between transactions are surveyed regularly to find out if a dependence loop forms and one of the transactions in the loop will be aborted if the result is affirmative. This algorithm usually implies a Depth-First-Search(DFS) in an inner loop. For a dense graph, DFS is an O(n^2) algorithm. So this lock dependence tracing mechanism may not be very practical. I am currently not aware of any commercial database system employing it.

There is another deadlock detection/elimination strategy in the literature. We may impose a total order on the locks acquired by transactions in an application. For example, if that application consists of five tables, we impose a total order on those tables; and when we try to acquire locks in a specific table, we also follow a total order – the order of the primary key could be one such choice; and if we implement field level capable locking system in the future, we may also need to impose a total order on the fields in a tuple for the same purpose. Notice that if we were to implement it so that type D of the Serializability Theorem can be applied, the 'decision set' locks must also be part of this order. Putting them before the field locks of the same row and after those of the previous row should be a good choice. This way, a deadlock would never happen and so this is a deadlock elimination strategy.

In this strategy, the order of acquiring locks in a transaction may not be the same of the order of executing the statements requiring those locks. This may imply we need to use a strategy to execute the transaction like that in the original 2PL design: we acquire locks in the first phase, we then execute the statements, and we release those locks in the second phase – it is also called conservative 2PL in the literature. In such a design, we need to access the page containing a data item twice, the first time to lock it(we at least need to see if it is there before we lock it) and the second time when we execute the statements. A disk-based database may suffer a performance hit since it might need to bring in the page twice to the memory pool in some cases. For example, after it is brought in for the first time, it becomes a victim of the eviction algorithm of the memory pool and it must be brought in a second time. For a memory-based system like NDB Cluster, this performance hit should be marginal.

Care must be taken when we implement such a design since besides the statements in a transaction, the so called DML statements, there are also DDL statements in a database system. If we only impose the total order for DML statements locks and not for DDL statements, a deadlock situation could still result; and because we don't have a deadlock detection mechanism in this scenario, the system will stalk when it happens.

So this is one possible future, a future where the hot spot problem can be handled by 'pessimistic technology'. But right now, if we are asked the question if we could handle the hot spot problem while providing serializability, the answer is probably negative. That is because Serializable Snapshot Isolation is born with the assumption that it doesn't have to deal with it, 2PL and Google's Spanner will demonstrate thrashing behavior and the method I've developed for NDB Cluster will time-out transactions that should have been waiting quietly for their turn. So it seems we don't have a very good solution with only third-tier technology(only if you agree that the method I've developed for NDB Cluster is mainly third-tier technology).

Here enters the sophisticated second-tier database application developer.

In my opinion, the hot spot problem is really a resource problem, in which people are competing to gain control of precious resources, like the super bowl tickets in Example 29. So the real solution is to increase resource providing. Unfortunately that is not always possible, like the stadium for the super bowl match has only a limited number of seats. When it is impossible, it is the database application developer who will catch the ball. Since it is not a computer science problem, we can't guarantee solving it completely by applying computer science methods and sometimes providing an alleviation is already the optimal solution. One approach to this end is to distribute the load to a few sites so that at each site the race become less intensive.

Example 29 (Continuation...): Suppose the load is distributed evenly to five sites, then we have a better efficiency e = 200 / [200 + (1 + 2 + … + 199) * 0.5] < 0.02. If on the other hand, the load is distributed evenly to ten sites, efficiency e = 100 / [100 + (1 + 2 + … + 99) * 0.5] < 0.04, which is about 10 times better than the original design.

Deploying the application to more than one site imposes extra cost on hardware provision, administration and coordination between sites. NDB Cluster, however, has the potential to accommodate this kind of alleviation scheme without introducing some of those cost. NDB Cluster is a database designed for TCOM applications. Suppose a popular musical event is hosted in a small town this year. When it happens music fans are flooding into that small town, while their phones are registered in the local service provider's database. In a more traditional design, this small town probably will be served by one single server and this event will very likely bring this server to its knees. NDB Cluster instead hashes these incoming requests to all of its data nodes and hence have a larger probability to survive it. This wisdom may help it survive the hot spot problem too.

Suppose instead of setting up five sites to service the peak load, we split the 'ticket' into five fields so that each incoming customer will execute one of the following statement in her/his transaction:


        update t set ticket1=ticket1-1;

        update t set ticket2=ticket2-1;

        update t set ticket3=ticket3-1;

        update t set ticket4=ticket4-1;

        update t set ticket5=ticket5-1;.                                                                                               
	  

This way if the hash function employed by NDB Cluster happens to hash these five fields to five data nodes, it will simulate the 'setting up five sites to service the peak load' approach. I truly hope this capability can be built into NDB Cluster in the future to make sure(instead of depending on the hash function's statistical characteristics) these five fields will be located in five different data node so that NDB Cluster will survive better in a race condition.


	                                                                                                       ## 
	  

However, this simple 'distributing the load to a few sites' approach breaks down as soon as the situation becomes slightly more complex. Suppose in this Example 29, a restaurant boss in the stadium wants to conduct a promotion by providing 100 free meals for the first 100 tickets sold for the 5000 VIP seats. So the following statement is added to the transaction:


      update t1 set free_meal=free_meal-1 where ...;.                                                                                                   
	

Here free_meal starts with an initial value 100. Of course, you may distribute the 'free_meal' field to the five sites like the 'ticket' field case so that the initial value at each site is 20. But then it is possible that at some sites, not all 20 free meals may be given out. For example, a site was initially set up NOT to sell VIP tickets, but that was changed right before the sell and the message wasn't communicated to the potential VIP seat buyers well enough. Then the total number of free meals given out could be less than 100. Now what if the restaurant boss is a very stringent one and considers the promotion unsuccessful and refuses to pay for it? In general, if we have some sort of global requirement for the data item in a race condition, it is hard to distribute it to different sites.(In this case, the global requirement for 'free_meal' is that it must reach value 0 at the end and we don't have a similar requirement for the 'ticket' field)

So if the 'distributing the load to a few sites' approach doesn't work, what do we do? The sophisticated second-tier database application developer might again got an answer: we may implement a control mechanism, a queue for example, at the second tier so that only a bounded number of transactions causing the race will be fed to the third tier per unit time. When the queue is saturated, we may print out messages like 'The server is busy, please try again later.' and reject the request. Those transactions that don't cause a race will pass through without blocking. This requires a detailed analysis, often a pressure test on your application to find out what would stress your system and what would not to design this control system. In general, handling the hot spot problem successfully requires collaboration between both the second tier and the third tier. Since this is not the primary subject of this article, I won't expand in this direction any more. You are very much welcome to discuss it and share your experience with me though. This is an area I find intriguing.

7.3 Security implications

To round up this section, I'd like to discuss the security implications of consistency a little bit. At the beginning of this article, I mentioned the Flexcoin incident. Now let's see how to mount such an attack in a Read-Committed isolation level like that of NDB Cluster's. According to [Si 14], the transaction of transferring funds between accounts from Flexcoin contains something like the following SQL snippet(We've put back a check on newbalance that is skipped in [Si 14] before the write to it here for our purpose. The semantics will be slightly different from that in [Si 14]:


      mybalance = database.read("account-number")
      newbalance = mybalance – amount
      if newbalance >= 0 {
        database.write("account-number", newbalance)
        dispense_cash(amount)   // or send bitcoins to customer
      }.                                                                                                  
	

Now suppose two transactions are executed by the same customer on the same account concurrently and interleaved like this:


	               T1                                                              T2

      mybalance = database.read("account-number")

      newbalance = mybalance – amount

                                                                        mybalance = database.read("account-number")

                                                                        newbalance = mybalance – amount

      if newbalance >= 0 {

        database.write("account-number", newbalance)

        dispense_cash(amount)

      }

                                                                        if newbalance >= 0 {

                                                                          database.write("account-number", newbalance)

                                                                          dispense_cash(amount)

                                                                        }                                                                                                   
	

A conflict loop consisting of a RW conflict from T2 to T1 and a WW conflict from T1 to T2 on the implicit 'balance' field forms and it is a 'lost update' scenario. For example, assume the original account balance is 2 bitcoins and amount = 1 bitcoin. After both transactions commit, the account balance is 1 bitcoin since one of the deduction is 'lost'. But totally 2 bitcoins are sent out since both transactions are successful.

In the attack that actually happened, the threat actor first created an account with a small amount of bitcoins in it. She/he then magnified this 'lost update' effect by sending thousands of transactions to this account concurrently. When the account balance reaches zero and the transactions stopped sending out bitcoins, an amount much larger than the original small balance was already sent out. The threat actor then created more accounts and repeated this process until Flexcoin's hot wallet was drained.

We may replace the fourth line in the snippet with something like the following update to see what happens, assuming 'balance' to be the field in the database that is updated:


      update t set balance = balance – amount 
         where account = account-number;                                                                                                    
	

Although the original RW and WW conflicts still exist and form a loop, the corresponding history is not a 'lost update' scenario any more since the second update actually sees the first update from the implicit read of balance and acts accordingly. After both transactions commit, the account balance will be 0 if it started with 2 bitcoins. So this modification seems to make the application correct, right?

It turns out it only makes it harder, but NOT impossible to mount an attack. Suppose a third transaction is executed and if we could manage to interleave the statements as such: the three database.reads and test conditions are executed first, then the three updates. All three transactions will commit successfully and the 'balance' field will be left with value -1.

Now let's see how to manage this interleaving and mount an attack in the real world. If the client end provided by the bitcoin exchange is open source and the transfer transaction code is in it, all you have to do is to insert a statement like


      sleep(10000)                                                                                                    
	

after


      newbalance = mybalance – amount                                                                                                     
	

and issue, say, 1000 such transactions to the server. Most of the transactions will line up the exact pattern we need. If the client end is not open source but the transfer transaction is still issued there, the threat actor might still manage to achieve this since the client end resides in an environment controlled by the threat actor. If the transaction is issued from an application server instead, the threat actor need to hack it first and it makes it much harder, although not impossible.

Some database application developers try to tackle consistency related issues with naive approaches. For example, for an important field like 'balance', they may impose rules in the application to make sure it is always non-negative or use a trigger on it to take action as soon as it becomes negative, like shutting down the application. The 'lost update' scenario shows that the first approach doesn't work at all in that case. The second approach is better, but the threat actor can still double her/his bet in the best case in the scenario not involving a 'lost update'.

Therefore I highly recommend you to use the systematic way developed in this article instead to fence off this kind of threat. To that end, all four serializability implementations we've explored certainly will do the job since a conflict loop will not be possible at all. It turns out even Snapshot Isolation is good enough for the example in this sub-section since the update of the 'balance' field will already make any two transactions non-concurrent. Notice that we can't apply the Generalized Serializability Theorems here since all the transactions are Read-Write and all the reads in them affect writes.

I will stay in this field for at least five more years and see if we could solve the serializability problem completely in this spin. You are very much welcome to join me in this effort.

Rererences

[Ad 00] Adya, A., Liskov, B., O'Neil, P. [2000]. “Generalized Isolation Level Definitions”, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073), Publication Year: 2000, Page(s): 67–78.
[Be 87] Bernstein, P., Hadzilacos, V., Goodman, N. [1987]. “Concurrency Control and Recovery in Database Systems”, ISBN 0-201-10715-5. https://www.microsoft.com/en-us/research/people/philbe/book/
[Be 95] Berenson, H., Bernstein, P., Gray, J., Melton, J., O'Neil, E., O'Neil, P. [1995]. “ A Critique of ANSI SQL Isolation Levels”, SIGMOD '95: Proceedings of the 1995 ACM SIGMOD international conference on Management of data, May 1995, Pages 1–10.
[Ca 08] Cahill, M., Rohm, U., Fekete, A. [2008]. “Serializable Isolation for Snapshot Databases”, ACM Transactions on Database Systems, December 2009, Article No.: 20.
[Co 13] Corbett, J. etc. [2013]. “Spanner: Google's Globally Distributed Database”, Transactions on Computer Systems, Vol. 31(2013), Issue 3, Article No.: 8.
[Es 76] Eswaran, K., Gray, J., Lorie, R., Traiger, I. [1976]. “The notions of consistency and predicate locks in a database system”, Commun. ACM, 19(11): 624–633, Nov. 1976.
[Fe 05] Fekete, A., Liarokapis, D., O'Neil, E., O'Neil, P., Shasha, D. [2005]. “Making Snapshot Isolation Serializable”, ACM Transactions on Database Systems, June 2005.
[Ga 94] Gamma, E., Helm, R., Johnson, R., Vlissides, J. [1994]. “Design Patterns: Elements of Reusable Object-Oriented Software”, ISBN 0-201-63361-2.
[Gr 93] Gray, J., Reuter, A. [1993]. “Transaction Processing: concepts and techniques”, ISBN 1-55860-190-2.
[Ki 13] Kingsbury, K. [2013]. “The trouble with timestamps”.
[Ku 14] Kulkarni, S., Demirbas, M., Madeppa, D., Avva, B., Leone, M. [2014]. “Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases”, the 18th International Conference on Principles of Distributed Systems (OPODIS-14) , 2014.
[La 78] Lamport, L. [1978]. “Time, clocks, and the ordering of events in a distributed system”, Communications of the ACM, 21(7):558–565, July 1978.
[Ma 83] Marzullo, K., Owicki, S. [1983]. “Maintaining the time in a distributed system”, Proceedings of the second annual ACM symposium on Principles of distributed computing, PODC 1983, pp. 295-305.
[Mi 81] Mills, D. L. [1981]. “Time synchronization in DCNET hosts”, Internet Project Report IEN-173, COMCAST Laboratories, Feb. 1981.
[Si 02] Silberschatz A., Galvin P., Gagne G.. [2002] “Operating system concepts”, Sixth Edition, ISBN 9971-51-388-9.
[Si 14] Sirer, E. [2014]. “Nosql meets bitcoin and brings down two exchanges: The story of flexcoin and poloniex”.
[To 93] Thomasian, A. [1993]. “Two phase locking performance and its thrashing behavior”, ACM Transactions on Database Systems, December 1993.
[Tr 16] Tracy, N. [2016]. “Serializable, Lockless, Distributed: Isolation in CockroachDB”.
[Wa 17] Warszawski, T., Bailis, P. [2017]. “ACIDRain: Concurrency-Related Attacks on Database-Backed Web Applications”, SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data, May 2017.

Appendix A: The theoretical framework, etc.

In this appendix, let's first briefly recap the process of applying the major theorems to a NDB Cluster application, then we summarize the theoretical framework of this article.

If only consistency, not performance matters, the helper program available here will do the job for you. Based on its output, set up the prevention measures as needed and it will guarantee consistency by type B of the Serializability Theorem if timestamp fields and auto-increment fields are not present. Read locks on some of the fields can be skipped if they are accompanied by a field write or the conflicting writes are insertions since these are not a real conflict. For granular locking necessary to fence off predicate RW conflicts, make sure to modify your application so that the rows for tables like t_s_n, t_o_n and t_d1_n in Example 16 are inserted and deleted in the appropriate transactions.

The former approach is purely syntactic. In the predicate RW conflict case, the analysis is completely based on Condition A; in the item RW conflict case, that the column read by a read and the column written by a write are the same doesn't necessarily mean the accesses happen on the same tuple. In many cases, the semantics of the application suggest that many possible conflicts in the syntactic approach's output are just false positive. This is demonstrated in Example 16 and many potential conflicts with 'not a conflict' in the comment are such cases. Identification of these bogus conflicts will certainly improve performance of your system since it will reduce memory footprint, CPU time and most importantly, lock waiting time. If you need my help on this in the free or paid service I am providing, make sure you describe your application's semantics as precise as possible. The TPCC specification provides a very good example to that end.

In the case timestamp fields and/or auto-increment fields are present, inconsistencies may show up in the ramification set of these fields. We might want to apply the Ramification Theorem to it since some of the prevention measures on the ramification set can be removed and performance will be enhanced. The details are described in Example 21.

In the whole process, if consistency of some of the reads(like those in the Read-Only transactions) can be sacrificed, we may apply the Generalized Serializability Theorems or their corresponding Ramification Theorems to further remove more prevention measures and give performance a boost. Whether this is applicable is certainly application specific.

Last method to enhance performance is, of course, to modify the application so that some of the conflicts would go away. This might sound daunting at first glance. But the following examples we've seen throughout this article are both demonstration of this: constraining the delivery transaction so that only one per warehouse id and district id is executing in this system so that conflict d2->d2 goes away; isolating time for execution of the delayed delivery transactions and that for conflicting new-order transactions so that conflicts d1->n and d2->n go away. Exactly how this can be done of course is again application specific.

A.1 The theoretical framework

In this article, I built a theoretical framework for the Serilizability Theorems on top of Adya's. The process combines the original framework with new concepts and definitions, which might seem messy for some of the audience. I'd like to clarify things up in this sub-section by restating it for your reference.

A transaction is like a program in a programming language like C. It may have input parameters, statements in a regular programming language like branching statements and loops, deterministic math calculations, non-deterministic math calculations(like calling the Rand() function in MySQL) and finally SQL statements. SQL statements are of two types: those that access the database and those that don't. For those that don't access the database, their behavior depends on the environment, just like regular programming language statements; for those that access the database, their behavior also depends on the database state. What is more, no mater what branches the execution might follow, the SQL statements are executed in a sequential order in a transaction.

Database consists of data objects that can be read or written by transactions. The data objects will be dealt with in this article are tuples, fields and most importantly, 'decision set's. Tuples, fields correspond to rows and fields in a table respectively. When we constrain the predicate in a SQL statement to be one that only involves one table, we define the set of columns in the only table that a predicate P uses to decide whether a tuple is selected for later access to be the 'decision set'. More complex predicates involving more than one table, like joins, sub-queries, unions, are handled separately.

A data object has one or more versions. However, transactions only interact with the database in terms of data objects; the database system maps each operation on a data object to a specific version of that object. A transaction may read versions created by committed, uncommitted, or even aborted transactions; constraints imposed by some isolation levels will prevent certain types of reads, for example, those created by aborted transactions.

When a transaction writes a data object x, it creates a new version of x. A transaction Ti, i>=0, can modify a data object multiple times; its first modification of data object x is denoted by xi:1, the second by xi:2, and so on and they are called internal versions. Version xi denotes the final modification of x performed by Ti before it commits or aborts. A transaction’s last operation, commit or abort, indicates whether its execution was successful or not; there is at most one commit or abort operation for each transaction. The committed state reflects the modifications of committed transactions. When transaction Ti commits, each version xi created by Ti becomes a part of the committed state and we say that Ti installs xi; the system determines the ordering of xi relative to other committed versions of x. If Ti aborts, xi does not become part of the committed state.

Conceptually, the initial committed state comes into existence as a result of running a special initialization transaction, Tinit. Transaction Tinit creates all data objects that will ever exist in the database; at this point, each data object x has an initial version, called the unborn version. When an application transaction creates a data object x (e.g., by inserting a tuple in a relation), we model it as the creation of a visible version for x. Thus, a transaction that loads the database creates the initial visible versions of the objects being inserted. When a transaction Ti deletes a data object x(e.g., by deleting a tuple from some relation), we model it as the creation of a special dead version, i.e., in this case, xi is a dead version. Thus, data object versions can be of three kinds: unborn, visible, and dead.

If a data object x is deleted from the committed database state and inserted later, we consider the two incarnations of x to be distinct objects. When a transaction Ti performs an insert operation, the system selects a unique data object x that has never been selected for insertion before and Ti creates a visible version of x if it commits.

We assume data object versions exist forever in the committed state to simplify the handling of inserts and deletes, i.e., we simply treat inserts/deletes as write (update) operations. An implementation only needs to maintain visible versions of data objects, and a single-version implementation can maintain just one visible version at a time. Furthermore, application transactions in a real system access only visible versions.

A history H over a set of transactions consists of two parts: a partial order of events E that reflects the order of the operations (e.g., read, write, abort, commit) of those transactions, and a version order, <<, which is a total order on committed versions of each data object.

Finding a way to impose such a total order on committed versions of each data object should be easy. For instance, in a system with a field level capable locking system, we may use the order of locking events on a field to introduce such an order; for a system with Snapshot Isolation that is lock free, the complete isolation in time of two transactions updating the same field should give rise to such an order. In the case tuple versions are present, the situation is similar if a total order of tuple versions is used; otherwise an order of field versions is derived from the tuple versions. For 'decision set' versions, they are either derived as in the type B case or defined as in the type D case.

Besides the committed versions, we also need to define an order for the internal versions of a data object inside a transaction in the case it is modified more than once. We've explained earlier before Example 1 that the execution of a transaction is a sequential process, namely, on a statement by statement basis. So assuming each statement doesn't update a data object twice, the order of these versions is obvious, for either field, tuple or 'decision set' versions.

The partial order 'happened before' is defined in Example 6. In general, we may impose a stronger condition on the write events of the same data object by requiring that they happen in a serial order so that it coincides with the 'happened before' partial order. This implies the version order of the committed versions coincides with the 'happened before' partial order too, namely, one version is before another if and only if it 'happened before' another.

A write operation on data object x by transaction Ti is denoted by wi(xi)(or wi(xi:m)); if it is useful to indicate the value v being written into xi, we use the notation, wi(xi, v). When a transaction Tj reads a version of x that was created by Ti, we denote this as rj(xi) (or rj(xi:m)). If it is useful to indicate the value v being read, we use the notation rj(xi, v). In a committed transaction, wi(xi) writes the committed version xi.

Definition 1: Version set of a predicate read.

When a transaction executes a read or write based on a predicate P, the system selects a version for each tuple in P’s relations. The set of selected versions is called the version set of this predicate read and is denoted by Vset(P).

Notice the versions in Vset(P) could be either committed or internal ones and they could be of either field, tuple or 'decision set' granularity. Predicate P is then evaluated against version set Vset(P), the tuples with a version in the version set that satisfies the predicate are chosen for later access. This evaluation process is the realization of a predicate read. A predicate read performed by transaction Ti is denoted by ri(P: Vset(P)) and let's call this subset of Vset(P) that satisfies the predicate the matching set. A write can change the matching set in two ways: it changes a non-matching version outside the set into a matching version inside the set(the IN operation) or it changes a matching version inside the set into a non-matching version outside the set(the OUT operation).

Now let's start listing the conditions for the Serializability Theorems.

Condition 1: The 'happened before' partial order preserves the order of all events within a transaction including the commit and abort events.

Predicate reads are also included in E, isolation of them is Adya's major contribution to this framework.

Condition 2': If an event rj(xi:m) exists in E, it is preceded by wi(xi:m) in E, i.e., a transaction Tj cannot read version xi:m of object x before it has been produced by Ti; if an event rj(P: Vset(P)) exists in E, xi belongs to Vset(P) and xi:m is used for x's match, then wi(xi:m) exists and precedes the matching read. Notice that the version read by Tj is not necessarily the most recently installed version in the committed database state. Also, Ti may be uncommitted when rj(xi:m) or rj(P: Vset(P)) occurs, for example, if i equals to j.

In Condition 2' we didn't claim wi(xi:m) 'happened before' rj(P: Vset(P)). A predicate read could be a lengthy process, for example a full table scan in some cases, and wi(xi:m) could overlap with rj(P: Vset(P)).

Condition 3': If an event wi(xi:m) is followed by ri(xj) without an intervening event wi(xi:n) in E, xj must be xi:m; if an event wi(xi:m) is followed by ri(P: Vset(P)) without an intervening event wi(xi:n) in E and x belongs to Vset(P), then the version of x chosen by Vset(P) is wi(xi:m). This condition ensures that if a transaction modifies object x and later reads x in an item read or a predicate read, it will observe its last update to x.

In some Snapshot Isolation implementations, Condition 3' could be violated since every read is supposed to read from the snapshot when the transaction starts. We've demonstrated after the proof of the Serializability Theorem that the proof will go through even so.

Here in Conditions 2' & 3', a version of x could refer to a field, a tuple or a 'decision set' one.

Condition 4: The history must be complete. Namely, if E contains a read or write event that mentions a transaction Ti, E must contain a commit or abort event for Ti.

From now on, we'll only consider complete histories. Notice that Condition 4 allows us to only consider committed transactions since atomicity guarantees the aborted ones NOT to have any effect on the system. So we'll only consider systems that can provide atomicity from this point on.

Definition 2: Change the matches of a predicate read.

We say that a transaction Ti changes the matches of a predicate read rj(P: Vset(P)) of Tj if Ti installs xi, xh immediately precedes xi in the version order, and xh matches P whereas xi does not or vice-versa. In this case, we also say that xi changes the matches of the predicate read.

The definition above identifies Ti to be a transaction where a change occurs for the matching set of rj(P: Vset(P)). Note that i can be equal to j. Version xi, xh here could represent a field, a tuple or a 'decision set' version respectively.

Definition: Conflict with

For i not equal to j, a write in transaction Ti which establishes a version xi is said to conflict with a predicate read rj(P: Vset(P)) for predicate P in transaction Tj if the following two conditions are satisfied:

Condition A: The intersection of the 'write set' of the write operation that establishes xi and the 'decision set' of P is not empty.
Condition B: The write operation in Ti changes the match of predicate P. And it is one that its generated version xi is closest to the version used in the predicate read in the version order, before or after it.

Version xi, xh in the last definition could represent a field, a tuple or a 'decision set' version respectively.

Definition 3': Directly predicate-read-depends.(predicate WR conflict).

Transaction Tj directly predicate-read-depends on Ti if Tj performs an operation rj(P: Vset(P)), xk belongs to Vset(P), i = k or xi << xk, and the write in Ti that establishes xi conflicts with rj(P: Vset(P)).

Definition 4': Directly predicate-anti-depends.(predicate RW conflict).

Transaction Tj directly predicate-anti-depends on Ti if Tj performs an operation rj(P: Vset(P)), xk belongs to Vset(P), xi >> xk, and the write in Ti that establishes xi conflicts with rj(P: Vset(P)). Also, Tj doesn't generate the version after xk itself.

Again a version like xi in these two definitions represents either a field, a tuple or a 'decision set' version respectively. Notice there is an extra requirement that Tj doesn't generate the version after xk itself in Definition 4' comparing with Definition 4 in Adya's work. If Tj does generate the version after xk itself, the dependence between Tj and Ti is captured with a sequence of WW conflicts on x.

Definition 5: Directly item-read-depends(item WR conflict).

We say that Tj directly item-read-depends on Ti if Ti installs some object version xi and Tj reads xi.

Definition 6: Directly item-anti-depends(item RW conflict).

We say that Tj directly item-anti-depends on transaction Ti if Ti reads some object version xk and Tj installs x’s next version (after xk) in the version order.

Definition 7: Directly Write-Depends(WW conflict).

A transaction Tj directly write-depends on Ti if Ti installs a version xi and Tj installs x’s next version (after xi) in the version order.

Again a version like xi in these three definitions represents either a field, a tuple or a 'decision set' version respectively.

Definition 8: We define the direct serialization graph arising from a history H, denoted by DSG(H), as follows. Each node in the graph corresponds to a committed transaction and directed edges correspond to different types of direct conflict. There is a read/write/anti-dependency edge from transaction Ti to transaction Tj if Tj directly read/write/anti-depends on Ti.

Here we combine Definitions 3' & 5, 4' & 6 and call them Directly Read-Depends(WR conflict), Directly Anti-Depends(RW conflict) respectively for convenience.

Condition 5: Aborted Reads.

A history H shows phenomenon 'Aborted Reads' if it contains an aborted transaction T1 and a committed transaction T2 such that T2 has read some object (maybe via a predicate read) modified by T1. This phenomenon can be represented using the following history fragments:


        w1(x1:i) : : : r2(x1:i) : : : (a1 and c2 in any order)

        w1(x1:i) : : : r2(P: x1:i, ...) : : : (a1 and c2 in any order)
                                                                                        
	  

Here a1 and c2 refer to the abortion of T1 and commit of T2 respectively.

Condition 6: Intermediate Reads.

A history H shows phenomenon 'Intermediate Reads' if it contains a committed transaction T2 that has read a version of object x (maybe via a predicate read) written by transaction T1 that was not T1’s final modification of x. The following history fragments represent this phenomenon:


        w1(x1:i) : : : r2(x1:i) : : : w1(x1:j ) : : : c2

        w1(x1:i) : : : r2(P: x1:i; :::) : : : w1(x1:j ) : : : c2
	  

Condition 7: Cycle of Conflicts.

A history H shows phenomenon 'Cycle of Conflicts' if DSG(H) contains a directed cycle with each edge being one of the five types of conflicts we defined earlier.

Definition 9: Let t be real clock time of the reference frame where the system is located and upon which the 'happened before' partial order is defined. Given an ordered sequence of transactions T1, T2, …, Tn, pick an API node as a coordinator and execute the transactions in the given order so that the next transaction is only started after the previous one reports its state as committed to the coordinator. When a transaction is executed, it always reads the latest committed version of data object that is available. Such an execution is called the serial history of T1, T2, …, Tn.

From this definition, it is clear that the order of transactions coincides with the order of write events of the same objects in the 'happened before' partial order, hence coincides with the total order of the object versions.

We assume that when a transaction reports back to the API coordinator its state as committed, all of its modifications are available for reading.

With this assumption, it's easy to see that a later transaction in the execution sequence sees the latest modifications in transactions before it: the time for the modifications of a previous transaction to be available is before the time for the commit message to reach the coordinator, which is before the time to start a later transaction, which is once again before the time the reading happens; and since the transaction reads the latest available modification, it reads the latest modifications in transactions before it.

On the other hand, an earlier transaction in the execution sequence sees no modification of a later transaction. Suppose the opposite was true and an earlier transaction could read a modification of a later transaction. The write operation 'happened before' the read operation for any reasonable implementation such that Condition 2' is satisfied with the 'happened before' partial order. However, the read operation 'happened before' the commit of the earlier transaction, which 'happened before' the start of the later transaction, which again 'happened before' the write operation. This contradicts with the fact that 'happened before' is a strict partial order.

The assumptions for the four types of the Serializability Theorem are as follows.

Type A: All the conflicts are based on tuple versions. It is used to demonstrate that Adya's Serializability Theorem can actually be proved, without resorting to the results in [Be 87].

For this type, the first internal tuple version generated by a write operation need to be paid more attention to. Besides the fields that are actually written in the write operation, there might be fields that are not. Since we are generating a tuple version, we have to fill in a value for each of those. Where do these values come from?

There are three cases. In the first case, all fields in the tuple are modified and we need to do nothing. In the second case, a tuple read 'happened before' the write in the same statement like in 'update...set col=col+1'; in this case, the values naturally come from the tuple read. In the third case, it is a blind write like 'update table grades set grade=”A”'. In this article, we assume the following:

The third case works the same as the second one except that it reads a tuple version implicitly before the write. In other words, there is no real blind write in our framework.

For predicate-based conflicts, Definitions 1 & 2 remain the same as in Adya's paper. The digression starts with the added Definition for the 'conflict with' concept. The item-based conflicts remain the same as in Adya's paper, of course.

Type B: Updates in transactions still generate tuple versions, both internal and committed ones. Field versions are derived from these tuple versions as usual. All the regular item-based conflicts are interpreted at field level, namely, based on the derived field versions.

Like the derived field versions, we also derive 'decision set' versions from these tuple versions. Namely, a derived 'decision set' version is generated whenever fields in the 'decision set' are modified in a tuple write and these derived 'decision set' versions are ordered the same way as their associated tuple versions.

For predicate reads, we use derived 'decision set' versions to capture the conflicts. This is true for Definitions 1, 2, 3' and 4' and the added Definition for the 'conflict with' concept.

There is a new kind of conflict in this type of the Serializability Theorem though. They are the WR, RW and WW conflicts between reads and writes of the committed derived 'decision set' versions, just defined like those for tuple or field versions. It is easy to see that the serial order of write events on the 'decision set' coincides with the 'happened before' partial order since that of the tuple write events do. After the introduction of this new set of conflicts, we also include them into the item-based conflicts, as opposed to the predicate-based ones. In particular, Definitions 5, 6 & 7 for type B include committed 'decision set' versions now, not just committed field versions.

As in the tuple write case in type A, we may require an implicit 'decision set' version to be read before a 'decision set' version write if not all the fields in the 'decision set' are written. That is because in a system that still generates tuple versions, a 'decision set' version write where not all the fields in the 'decision set' are written implies a tuple version write where not all the fields in it are written. Hence we may require that a tuple read must precede such a tuple write as in the type A case and the derived 'decision set' read will serve our purpose.

For possible item reads following a predicate read, we will use the 'derived' field versions from the same tuple version associated with the matching 'decision set' version.

Type C: This type of the Serializability Theorem still requires the tuple concept which corresponds to the row concept in a table, but it doesn't require the concept of tuple versions. It does require the existence of field versions and a total order generated by serial write operations on that field. All conflicts are interpreted at the field level and this implies that the 'decision set' of a predicate read only consists of one field. It also requires a way to identify which field version to read after a predicate read if item reads follow.

Type C of the Serializability Theorem is designed for the future, for example, when a field level capable locking system is available. Notice that the item-based conflicts between reads and writes of 'decision set' versions are just those between field versions.

In type C, the requirement of the way to specify what field versions to read after a predicate read is matched is implicit: as long as it makes a serial history satisfy its key property, namely, when a transaction is executed, it always reads the latest committed version of data object that is available.

Type D: Type D of the Serializability Theorem assumes a field version system as in Type C of the Serializability Theorem.

On top of it, we need a systematic way to define a 'decision set' version system so that the following requirements are met for type D of the Serializability Theorem.

  1. A way to define the system.
  2. A way to read the 'decision set' versions in a predicate read without mistakenly using an arbitrary set of field versions in the 'decision set' instead, so that predicate-based conflicts can be defined.
  3. A way to read the 'decision set' versions for the initial internal version without mistakenly using an arbitrary set of field versions in the 'decision set' instead, so that successive internal versions and eventually the committed 'decision set' version can be defined.
  4. And also a way to introduce item-based conflicts between reads and writes of 'decision set' versions so that we can reason about conflicts with them.

Of course, it also requires a way to identify which field versions to read after a predicate read as in type C if item reads follow.

Definition 10: Let H and H' be histories generated by different executions of the same set of transactions, H is equivalent to H' if

  1. H and H' give rise to the same set of database operations: corresponding predicate reads return the same set of tuples, corresponding item reads read the same version of data with same value, while corresponding writes write the same value to the same version of data, for both internal and committed versions. Notice that for type B, the item-based operations also refer to those for 'decision set' versions.
  2. DSG(H) = DSG(H')

H is serializable if H' is a serial history.

So this is the framework we've developed in this article. On top of this, we may present the Serializability Theorems and the Ramification Theorems, which I will not repeat here. If you wonder why we make all these design decisions, juicy details and discussions are readily available in the article.

Next, let's demonstrate a weird Read-Committed isolation level to which we can't apply these theorems in the following example.

Example 30: Consider a system implementing an isolation level the same as MySQL InnoDB's Repeatable-Read isolation level, except that when an item read is requested, the system doesn't just serve the latest version before a transaction's starting timestamp; instead, a random function is used to choose a version before that timestamp to fulfill the request.

Such an isolation level is indeed a Read-Committed one since Condition 6 is satisfied. Consider the following pair of transactions T1 and T2, where T1 modifies a field and T2 reads it and this is the only possible conflict between them. Then execute {T1, T2} in that order and repeat this multiple times. The results are usually NOT identical if the random function is good enough since T2 might read that field from the original database state or from the version established by T1. But whatever the outcome history is, it is guaranteed NOT to contain a conflict loop. Should the Serializability Theorem be applied here, the outcome history would be equivalent to a serial one. However, in the case T2 would read the update by T1, the equivalent serial history had to be {T1, T2}. But {T1, T2} can't be a serial history since the following condition in Definition 9, namely the definition for a serial history, is not satisfied: when a transaction is executed, it always reads the latest committed version of data object that is available.


	                                                                                                       ## 
	  

In type C and type D of the Serializability Theorem, we claimed the requirement for the way to specify what field versions to read after a row is matched in a predicate read is implicit: as long as it makes a serial history satisfy its key property, namely, when a transaction is executed, it always reads the latest committed version of data object that is available. From this example, we see why it is necessary; from the proof of type C and type D of the Serializability Theorem, we see why it is sufficient.

Appendix B: The 'happened before' partial order

The 'happened before' partial order is inspired by one with the name “happened before” in [La 78]. In [La 78], an event is represented as a point in time and a node in a distributed system is considered to be a process consisting of a sequence of events in a total order since in the good old days processes are still single-threaded. Between nodes, events are also connected with a type II edge as in the 'happened before' partial order. Notice a 'node' here could be the CPU, the memory unit, or a input-output channel if a single computer is viewed as a distributed system, as long as each of these is a process.

In this article we, instead, device an event as a time interval so that events in a process can overlap since nowadays processes are all multi-threaded and in a node there are usually more than one process in the distributed system(For example, in NDB Cluster, there is an 'angel' process which overlooks the primary process and restarts it if it fails). This way, two events from two different threads in a node don't have to be one after another in the order of time and it makes more sense. Also, when two concurrent transactions – executed by two threads – release and acquire the same lock in that order, this pair of events can be represented by a type I edge. If we were to simply generalize the original “happened before” partial order in [La 78] to the thread case and interpret inter-thread messaging with type II edges, we would see type II edges inside a node and the model became more complex. However, this may be a better approach in some cases, we will see why in Appendix D when we examine CockroachDB.

In that time interval representing an event, we use readings from a physical clock in the node. It would be perfectly OK if we use the real clock time of the reference frame instead. In other words, if we've used the real clock time of the reference frame, all the relevant arguments in the proofs will remain correct since this only affects type I edges and I've chosen events in this type of edges very carefully so that End(e) < Begin(e') is still true even if End(e) and Begin(e') are both real clock time of the reference frame, assuming the edge in concern to be (e, e'). The reason I chose to mask up the real clock time of the reference frame is that it is from a theoretical clock in Special Relativity, which might be daunting for audience without too much physics background and using physical clock time could be more user friendly.

This mask-up comes with a catch. The key of this mask-up is the Monotonicity Condition, which synchronizes the physical clock with Albert Einstein's clock. However, [Ki 13] indicates in some POSIX systems the Monotonicity Condition might not be true. But the explanation from last paragraph says even this doesn't matter as long as we carefully choose events in an event chain.

The application of the 'happened before' partial order to give rise to a serializability implementation for NDB Cluster is inspired by the proof for 2PL(in both standalone and distributed systems): suppose T1 → T2 to be an edge in the conflict loop; the event of releasing a conflicting lock in T1 'happened before' the event of acquiring it in T2(property of conflicting locks), which 'happened before' the event of releasing any conflicting(with next transaction in the loop) lock in T2(property of 2PL); hence the existence of a conflict loop would introduce an event loop in the 'happened before' partial order, which provides the contradiction.

This proof can be generalized to associating one event to each transaction such that a conflict loop implies an event loop in the 'happened before' partial order. In the serializability implementation for NDB Cluster, we chose the 'commit' event in each transaction for it to work out, but actually any event that 'happened after' the execution of SQL statements in a transaction(hence 'happened after' any lock acquiring event) and 'happened before' any lock releasing event could also do the job. In particular, in any system that implements two-phase locking, such an event exists.

In the proof that Google's Spanner implements a serializable isolation level in Appendix D, we further generalize this method to associate a timestamp to each transaction such that a conflict loop implies a timestamp loop. In other words, although the Serializability Theorem we apply in Google's Spanner's case still uses the 'happened before' partial order, we don't necessarily have to use it to prove that a conflict loop doesn't exist. This means the method can be further generalized to the following: associate a characteristic to each transaction such that a conflict loop implies a loop of this characteristic which is not possible.

Appendix C: More info about conflict serializability in [Be 87]

[Be 87][ mainly discusses concurrency control and recovery. Since this article is about the former, we will only talk about issues in the chapters concerning concurrency control. Notice these issues don't imply incorrectness of the content of the book in general, but rather imply it can't be applied to today's database applications, in which secondary key access is heavily featured. On the other hand, if a database application is based on primary key access, like those before 1987, the content remains correct mostly. So it is really NOT this book itself, but rather that we apply it to a contemporary database application gives rise to these issues. That is exactly what Adya's paper did and it poses a problem for any work that depends on Adya's paper, in particular Fekete's paper and correctness of PostgreSQL's Serializable Snapshot Isolation.

The content about concurrency control in [Be 87] is discussed mainly in Chapters 2, 3, 4 & 5. We'll list the issues in that order.

Chapter 2 is about conflict serializability and its corresponding Serializability Theorem. The problem with the proof was already described earlier in the discussion of Example 0: that two histories being proved to be equivalent give rise to same set of operations is assumed to be a fact without a proof. This is hardly convincing if transactions have branches in their code. It usually depends on the database state those transactions read from to decide which branch to take. In a busy OLTP system where database state is a transient concept, it would be nice to give a proof so that it sounds conceivable.

Also Chapter 2 doesn't include any discussion about predicate read or its equivalence. It defines the following: two item operations to conflict with each other if they operate on the same data item and one of them is a write. So the Serializability Theorem proved there can at best deal with a database that would never grow or shrink. The book admits this at the beginning of section 3.7 when it discusses the phantom problem. It then suggests index locking, which is a special case of granular locking I've used in this article, and predicate locking, which is not practical at all since it involves a NP-complete problem, as solutions. Notice that these solutions are discussed in the context of 2PL. This helps to guarantee a history in 2PL is free of conflict loops. That is the only place the phantom problem is discussed and the definition of conflict remains the same as we mentioned at the beginning of this paragraph. In particular, not any Serializability Theorem based on conflicts involving predicate reads is ever proved, not to mention the fact that the conflict definition is not precise even for plain data item like a tuple since, for example, WW conflicts only exist for two consecutive writes on the same item, not for two with one thousand writes between them. How this Serializability Theorem can give rise to a proof to the one in Adya's generic(in particular, not restricted to 2PL) framework remains a complete mystery to me.

Chapter 3 and Chapter 4 are about applying the Serializability Theorem developed in Chapter 2 to locking and non-locking implementation of a scheduler, respectively.

Chapter 5 is about another serializability flavor called view serializability. View serializability is very different from conflict serializability, so is the proof of the Serializability Theorem for it. So it doesn't help Adya's case either.

Appendix D: Survey of consistency implementations in distributed database systems

In this appendix, we will survey some interesting implementations of distributed databases so that we may apply the theory and techniques developed in this article to them. We prefer memory-based databases, but we will also examine those that are disk-based. Along the way, more interesting implementations might be added to this appendix as they show up.

D.1 Google's Spanner

This analysis is based on [Co 13]. Google's Spanner builds on top of a couple of Google's software artifacts like Colossus and Tablet. Upon those, a Paxos system is implemented to achieve eventual consistency. The leader of each Paxos group participates in a variant of Two Phase Locking(2PL) for the Read-Write transactions, while other replicas of that Paxos group provide Snapshot Isolation reads to Read-Only transactions.

In this process, a distributed timing system is necessary to provide timestamp service for Snapshot Isolation reads and other purposes. A variant of Marzullo's algorithm for distributed time described in [Ma 83] is used to synchronize clocks on different time servers, which have been set to initial values by polling a list of accurate atomic clocks and GPS clocks with the algorithm described in [Mi 81]. This time service is an interesting one since it returns an interval containing the correct time instead of just one single value when it is queried. In other words, when a node requests time service by calling TT.now(), it returns an interval [earliest, latest], where both 'earliest' and 'latest' are timestamps and the absolute time when this call happens is represented by a timestamp inside this interval. Based on this time service, external consistency is preserved: if the actual starting time of a Read-Write transaction T2 happens after the actual commit time of Read-Write transaction T1, T2's commit timestamp is greater than that of T1's. Such an implementation turns out to be serializable and the proof goes as follows.

Proof: As usual, we will try to apply type B of the Serializability Theorem and we'll prove it by contradiction. Let's assume a conflict loop was present in the history. Inside this loop, choose the timestamp for the start of it if it was a Read-Only transaction and choose the timestamp for the commit of it if it was a Read-Write one. Then for all three types of conflicts, we claim that we would have order of timestamps to be consistent with order of transactions and hence a loop of timestamp would arise. That is a contradiction and a conflict loop doesn't exist.

To justify this claim, we consider two cases: a conflict in type B of the Serializability Theorem between two Read-Write transactions and one that between a Read-Write transaction and Read-Only one. Let's call these two transactions T1 and T2 and assume that the conflict is represented as T1 → T2.

In the first case, both transactions must be executed in Paxos leaders and 2PL guarantees the following chain:


        Commit timestamp s1 for T1 

        < Absolute time of commit event(the start of second phase of two-phase commit) of T1

        < Absolute time of releasing the conflicting lock for T1

        < Absolute time of acquiring the conflicting lock for T2

        < Absolute time of requesting TT.now() when T2 receives the commit request

        <  Commit timestamp s2 for T2                                                                                                  
	  

The first inequality is implied by the 'commit wait' condition of Spanner as described in section 4.1.2 of [Co 13]. The last inequality is true because s2 is chosen to be greater than the right end point of the interval returned by TT.now() as described in section 4.2.1 of [Co 13].

In the second case, if T1 → T2 is a WR conflict, s1 < s2 is clear since the Read-Only transaction T2's reads are served as in Snapshot Isolation. If on the other hand, T1 → T2 is a RW conflict, the fact that s1 < s2 follows from the following monotonicity invariant of Spanner as described in section 4.1.2 of [Co 13]: within each Paxos group, Spanner assigns timestamps to Paxos writes in monotonically increasing order, even across leaders. To see that, we assume that s1 >= s2 was true. Then when T1 performed its reads, timestamp s2 mustn't have been written by the Paxos system since otherwise T1 would've seen updates by T2, which would contradict with the fact that the conflict is a RW one. The only possibility was that s2 would be written later by the Paxos system. But that is not possible either since for T1 to perform its reads, s1 must be <= a safe timestamp s0 so that the replica is up-to-date for serving the read as described in section 4.1.3 of [Co 13]; on the other hand, the monotonicity invariant of Spanner implies s0 < s2 since s0 had been written and s2 hadn't. So we have s1 <= s0 < s2, which contradicts with s1 >= s2 we started with.


	                                                                                                       ## 
	  

The so-called absolute time in [Co 13] is just a synonym of the real clock time of the reference frame used throughout this article. I started looking for conflict loops in histories in Google's Spanner and couldn't find any. After a while I realized that this hybrid isolation level might be a serializable one and tried to give a proof. It was still a surprise to me after I proved it since I thought a merge of 2PL and Snapshot Isolation would only give us an isolation level at least as good as Snapshot Isolation, not necessarily the highest one. I don't know if the developers of Spanner know that their database actually implements the serializable isolation level. It is not mentioned in [Co 13].

If an application in Spanner contains datetime related fields, we might have to apply a version of the Ramification Theorem to conclude Serializability on Ram(B)', B being the set of datetime fields, since I am not sure the timestamp system they use can get rid of all the timestamp-based inconsistency.

A row in a Spanner table consists of three parts: a primary key, the commit timestamp of this row as a version number, the data. So it is practically a key-value data-store with versions. Chances are that a variant of the proof in Example 12 can be used to show that Spanner is serializable. But I still like this proof better since it is more generic. It is a pity that Spanner didn't implement a full-blown relational database, otherwise we may see this serializability implementation's full power in action. Wish we could see that in the future.

D.2 OceanBase

OceanBase is a product of Alibaba. It seems to me it borrows a lot of wisdom from Oracle, in that it implements a Snapshot Isolation a lot like Oracle's, except that it is a distributed system; and some wisdom from Google's Spanner, in that it uses a variant of Paxos system called Raft to replicate data too. OceanBase's implementation of Snapshot Isolation is different from that of Google's Spanner's in that it uses a machine generated increasing sequence, instead of a clock, to timestamp the transactions. For a standalone system, the generation of such a sequence is straight forward. But OceanBase is a distributed system, the increasing sequence must be generalized to a global version. OceanBase deploys specific servers, so called Global Timestamp Service(GTS) in its nomenclature, for that purpose. GTS provides an increasing sequence for timestamp purpose for each Tenant(synonym of database instance in OceanBase's nomenclature, roughly corresponds a store in their Amazon.com like on-line store application called Tmall, usually consists of one leader and two replicas, located on three machines) and there are as many of these sequences as there are Tenants. Such a generalization offers about two million timestamps per second for each Tenant, which is usually much more than a store's need. Then for only transactions accessing the same Tenant, Snapshot Isolation is used to provide consistency. In other words, although Oceanbase's architecture is a distributed one, the unit that implements Snapshot Isolation is really standalone, namely, in the leader of a tenant. One advantage of using an increasing sequence to provide timestamp service is that realizing 'external consistency' becomes trivial.

This partition-like implementation of Snapshot Isolation, however, imposes a major limitation to OceanBase: there is NO consistency support whatsoever for an application that consumes more than two million timestamps per second unless it can be partitioned into a lot of Tenants like an Amazon.com application. In other words, if consistency is important for such an application, you have to implement it at the second tier. Believe me, that usually implies heavy lifting! Google's Spanner uses a bunch of servers that are synchronized regularly to provide timestamp service and you can query any of them, so the limit should be that of the whole bunch and the danger for them to become saturated is less severe. And the method I've developed for NDB Cluster doesn't exhibit this kind of limitation at all since it doesn't require timestamps to achieve consistency. It seems to me OceanBase is designed for an application like Amazon.com and that is it; for other application types, it has to fit into a single Tenant. Such a design of course is suitable for a cloud environment since a tenant may not occupy a whole machine.

Besides this limitation of the timestamp service, OceanBase also suffers the same problems Snapshot Isolation does, such as write-skew anomalies and cascading abortion. That said, OceanBase has survived Alibaba's annual sale on Nov. 11th(an online version of Black Friday Sale). It also occupies the first place in the TPCC benchmark list for clusters. This is impressive, especially if you are building an application like Amazon.com. Of course, if in your application SQL statements are executed upon data that span across multiple data nodes(say, data can't fit in a single data node), this benchmark result may not be meaningful for you since OceanBase doesn't support this kind of statements naturally due to its limitation. Its timestamp service, however, doesn't require hardware provision like GPS clocks and atomic clocks that Google's Spanner does and hence its deployment could be more cost-efficient when it applies.

We've talked about external consistency about Spanner and OceanBase. Let's take a look at NDB Cluster. External consistency demonstrated in Spanner and OceanBase doesn't apply to NDB Cluster since it is not a timestamp-based system. There does exist the following abnormal behavior: for two sequentially started Read-Only transactions, the first one return a new value written by a Read-Write transaction because of a delay(for example, the thread handling this transaction got context-switched out) while the second one gets an old value. External consistency is about whether a client of the application feels right. What we emphasize in this article is to make the database right. And we've built a serializable system such that consistency based data corruption is managed, even if a client feels odd sometimes.

D.3 CockroachDB

(To BE UPDATED. PLEASE CHECK BACK IN A FEW DAYS.)

CockroachDB implements a serializable isolation level and the following analysis is based on [Tr 16].

CockroachDB is a distributed database system that uses a global clock called Hybrid Logical Clock(HLC) as described in [Ku 14] for its serializability implementation. HLC is a combination of the logical clock provided by Lamport in [La 78], and the physical clock provided by NTP. Each timestamp is in the form of (l, c), where l is the timestamp returned by NTP and c is a counter. The primary algorithm for maintaining such a clock is given as follows:


      Initially l.j := 0; c.j := 0

      Send or local event
        l'.j := l.j;
        l.j := max(l'.j, pt.j);
        If (l.j=l'.j) then c.j := c.j + 1
            Else c.j := 0;
        Timestamp with l.j, c.j

      Receive event of message m
        l'.j := l.j;
        l.j := max(l'.j, l.m, pt.j);
        If (l.j=l'.j=l.m) then c.j := max(c.j, c.m)+1
            Elseif (l.j=l'.j) then c.j := c.j + 1
            Elseif (l.j=l.m) then c.j := c.m + 1
            Else c.j := 0
        Timestamp with l.j, c.j                                                                                                     
	

Here sub-script j represents an arbitrary server participating in this clock protocol, sub-script m represents the server from which the message(also represented as m) is sent, and pt represents the physical time returned by the NTP protocol. As we can see, this clock protocol stores the latest physical clock it can sense in l, which is usually enough to describe the clock advancement; if it is not, c is there for that purpose. Notice that such a clock deviates from a traditional one, which usually assigns a number to each event. The order between the timestamps is the lexicographic order, which is a total order. And in particular, we have


      (l1, c1) < (l2, c2) iff l1 < l2 or l1 = l2 and c1 < c2.                                                                                                    
	

With e and f representing events in the system, this clock demonstrates the following nice properties:

  1. e 'happened before' f => (l.e, c.e) < (l.f, c.f).
  2. l.e is close to pt.e, i.e., |l.e – pt.e| is bounded by Epsilon, where the real clock time lies within the interval [pt.e – Epsilon/2, pt.e + Epsilon/2].
  3. (l, c) has O(1) computational complexity and space requirement.

Remark: For the “happened before” partial order in [Ku 14], we can't just use the one described in [La 78] since [La 78]'s is for a system with single-threaded processes and today's systems are almost always multi-threaded. We should use the variant where events inside a thread are represented as points in time and two events connected by an inter-thread message is interpreted as a type II edge as mentioned in Appendix B. If we were to use the variant 'happened before' as in the rest of this article, some of the results in [Ku 14], like theorem 4, may not be easily proved. Notice that using this “happened before” variant doesn't affect the correctness of the serializability theorems since it is only used in the timestamp service in of the system; for the serializability theorems, we still use 'happened before' as in the rest of this article. This is a perfect example where we may interpret the temporal order in a distributed system in different ways mathematically.

Property 1, the so-called 'Clock Condition' in [La 78], is extremely important, for example, if we were to use it to replace the expansive distributed timing service in Google's Spanner to generate a consistent snapshot for the 'optimistic' component.

Property 2 suggests that this clock heavily depends on the physical clock provided by NTP. On the one hand, this means things like violation of external consistency becomes less visible for clients if NTP time is synced to be close to real clock time; on the other hand, this also implies if NTP synchronization is seriously off, it might impact this clock protocol.

For example, if topology of the network allows a server to be disconnected from a NTP source, but remains connected to other participants of the clock protocol, this server's physical time will drift away from NTP time if the clock onboard this server is skew. [Ku 14] claims that they can handle unstable situations like this with two supplementary rules to the primary algorithm: reset the participant's l to pt(the time provided by the physical clock onboard the server, not NTP time) and start again, and ignore out-of-bounds messages. With the first rule, [Ku 14] claims in the case of extreme clock errors by NTP or transient memory corruption, HLC can be corrected once physical/NTP clock stabilizes; with the second rule, [Ku 14] not only ignores out-of-bound messages, but also increases c by 1 when that happens. Both rules deviate from the primary algorithm and the question is whether the properties still hold with these modification. [Ku 14] doesn't contain a justification to this.

What is more, in property 3, the c value is actually bounded from above by a function of Epsilon. It would be desirable for Epsilon to remain a constant when the clock protocol is operating. For example, we probably allocate a large enough integer variable for c according to this upper bound. If Epsilon fluctuates, the variable might overflow. An erroneous state of the NTP protocol might give us exactly that.

It seems the developers of CockroachDB seem agree with me on this. They deal with it with the following rule: when a node detects that its clock is out of sync with at least half of the other nodes in the cluster by 80% of maximum offset allowed, it crashes immediately. This approach pertains to the primary algorithm all the time, although it might imply the algorithm operates with a larger Epsilon sometimes(it really depends on the relation between Epsilon and the maximum offset). This implies any property that doesn't involve Epsilon, in particular property 1, remains true. So if we were to use HLC for an implementation of Serializable Snapshot Isolation or Spanner's consistent snapshot, they will both remain correct with unstable situations like drift-away servers since it is equivalent to a clock protocol that operates with a larger Epsilon. And the issue described in the last paragraph will also go away if we choose the variable for c carefully based on the maximum offset instead.

There are two consequences with CockroachDB's approach. One is that anomalies like violation of external consistency will show up more often when the clock protocol operates with a larger Epsilon. The other is that if HLC co-locates with data, the brought-down of the server could render those data unavailable(unless there is a replication of that data like in NDB Cluster). This could further compromise the semantics of the deployed application(For example, if the average of a field is computed system-wide in the application, it might render it biased if part of the data is not available). Notice that Google's Spanner is free from this issue since they use dedicated servers for their TrueTime service.

Now let's examine CockroachDB's serializability implementation. In the arena of concurrency control, it turns out besides 'optimistic technology' and 'pessimistic technology', there is a third species called 'timestamp-based technology'. CockroachDB's serializability implementation is a typical one of it and of course the timestamps in it are provided by HLC.

When a transaction starts, it is assigned a timestamp. Every operation in that transaction is considered to take place at that timestamp. The following three rules are to handle WR, RW and WW conflicts. Let's assume the conflict in concern is from transaction T1 to T2, or T1 → T2, as usual and the associated timestamps are t1 and t2 respectively. And when we say a read, we refer to either an item read or a predicate read.

WR conflicts: CockroachDB implements a multi-version database. When a read happens in T2, the system returns the most recent version with a timestamp t < t2. Hence whenever there is a WR conflict between T1 and T2, t1 < t2.

RW conflicts: Upon any read operation, the timestamp is recorded in a node-local timestamp cache. This cache will return the most recent timestamp at which the key is read when queried. All write operations query the timestamp cache for the key they are writing. If the returned timestamp is greater than or equal to the timestamp of the write operation, this writing transaction is aborted and will be restarted with a later timestamp. Now suppose t1 >= t2, then when T1 performed its read, T2 hadn't perform its write yet since otherwise T1's read would see a version greater or equal to that established by T2 and contradicts with the fact that the conflict is a RW one. When T2 performed its write after T1's read, the cache would return at least t1. This would cause T2 to abort and be restarted with a timestamp larger than t1. This is a contradiction. Hence if the conflict between T1 and T2 is a RW one, we also have t1 < t2. Notice in a RW conflict, the read doesn't necessarily 'happen before' the write. One can find example like this easily in Snapshot Isolation.

WW conflicts: If a write operation attempts to write to a key, but that key already has a version with a timestamp greater than or equal to that of the operation itself, the transaction of this operation is aborted and will be restarted with a later timestamp. Again if the conflict between T1 and T2 is a WW one, we have t1 < t2.

Notice that in the WR case, the read refers to both item read and predicate read. In the RW case, the timestamp cache is an interval cache and the key range involved in the read is recorded, hence both item read and predicate read are taken care of. We've just demonstrated a history in this system is free of conflict loops since < is a strict partial order. So as long as CockroachDB interprets predicate-based conflicts correctly as in the Serializability Theorems, this should provide a serializability implementation.

Unfortunately CockroachDB, like PostgreSQL, also resorts to [Be 87] for its serializability proof, which has been demonstrated to be unsuitable for modern workloads where secondary key accesses are ubiquitous. Of course, it is up to CockroachDB to clarify this.

There is another issue with this serializability implementation, related to the rule about RW conflicts. It could actually result in a starvation situation for a write operation if too many reads are intended for the same data item. Consider the following situation: a Read-Write transaction is computationally intensive(hence long running) and the result is updated to a field(say, we are computing the Nasdaq index and we need to query a lot of fields and do some serious computation on them), and a lot of short-lived Read-Only transactions constantly floods in to read this field since it is critical for business. This way, chances are that when the Read-Write transaction finishes, the cache already records a lot of reads of that field with timestamps larger than that of the Read-Write transaction's and it has to be rolled back and restarted. The semantics of this application is certainly for this field to be updated as soon as it is available and this starvation situation does exactly the opposite of that. Notice this starvation situation is worse than the case where a write lock waits for a bunch of read locks since in the later case the write will eventually be served. This is not a surprise since the rule for RW conflicts basically says that a read with a later timestamp has higher priority than a write on the same item with an earlier timestamp. Usually we want an update be reflected in the database ASAP. CockroachDB sacrifices this for serializability, but it turns out there is a price.

Remark: The first time I encountered a timestamp-based serializability implementation was in [Si 02] when I explored technical paths of achieving serializability more than a decade ago. Comparing with that, there are two advancements in CockroachDB's implementation. The first one, of course, is the adoption of distributed timestamp service HLC so that serializability can be discussed in a distributed database system. The second one is the use of a multi-version system so that transaction with an earlier timestamp than that of the field it's trying to read doesn't have to be aborted. The starvation issue was also present in [Si 02]'s implementation. I couldn't resolve it and that was one major reason I turned to 'optimistic technology' and 'pessimistic technology' for a solution. Today I still can not. I am open to any suggestion to resolve it to make this implementation more sensible.

D.4 TiDB

I'd like to explicitly thank 'Hazel' and her colleague from TiDB for answering my questions about TiDB's timestamp service to begin this subsection.

TiDB is distributed, disk-based database system. TiDB claims it implements two isolation levels: Repeatable-Read and Read-Committed. It actually implements three: with isolation level setting to Repeatable-Read and system configuration variable tidb_txn_mode setting to 'optimistic', we got a Snapshot Isolation implementation; with isolation level setting to Repeatable-Read and system configuration variable tidb_txn_mode setting to 'pessimistic', we got a Repeatable-Read isolation level that is similar to that of MySQL InnoDB's; with isolation level setting to Read-Committed and system configuration variable tidb_txn_mode setting to 'pessimistic', we have a Read-Committed isolation level that is similar to that of MySQL InnoDB's.

For the audience of this article, Snapshot Isolation may not be interesting; but the distributed clock used to implement it is. We'll explore it in detail here. Let's start with TiDB' architecture.

The architecture of TiDB is described here. There are three kinds of server in a TiDB cluster: TiDB server, Placement Driver(PD) server and TiKV server. A TiDB server is like a MySQL server in a NDB Cluster which serves as an API server to interface the clients. The PD servers serve as the heart of the cluster and the collection of them is like the NDB kernel in a NDB Cluster; among other things, the timestamp service of the distributed clock is provided by them. The TiKV servers are responsible for storing data, just like the data nodes in a NDB Cluster.

A TiKV server actually uses RocksDB to store its data to local storage. RocksDB is a key-value store, hence the name TiKV for such a server. Although RocksDB is a key-value store, TiDB is a full-blown relational database. Every row in a TiDB table is transform into a key-value pair before it is stored in the underlying RocksDB. For example, in the primary index of a table with four columns col1, col2, col3 and col4, a row is represented as


      key:    tablePrefix{TableID}_recordPrefixSep{RowID}
      value: [col1, col2, col3, col4]                                                                                               
	

in RocksDB. A similar scheme applies to secondary indices. Details can be found here.

TiDB also provides high availability with a variant of Paxos system called Raft, just like OceanBase does. The timestamp service, for example, is located in the leader of a Raft group formed by PD servers. This timestamp service is used to provide two timestamps for each transaction to signify the start and commit of it so that Snapshot Isolation can be implemented. This timestamp service is the most interesting part of this Snapshot Isolation implementation. Let's see how it overcomes OceanBase's issue of NOT being able to provide sufficient number of timestamps to the whole cluster with an increasing sequence.

The timestamp service in TiDB is called Timestamp Oracle(TSO). The object for defining it is as follows:


      type tsoObject struct {
        physical time.Time
        logical int64
        …
      }                                                                                          
	

Here 'physical' represents a Unix timestamp in milliseconds which is 46-bit long, and 'logical' is an 18-bit number trying to provide as many as 2^18 timestamps per millisecond. When a client requests a timestamp, what it got is a pair of integers in the following form: (physical, logical). The component responsible for responding to such a request is called an allocator and it resides in the leader of the Raft group. When the physical field is NOT updated, the allocator only returns a bunch of timestamps with that fixed physical field and various logical fields in an increasing order upon requests. The update of the physical field also generates an increasing sequence in its own and it is the most intricate part of the algorithm. The operation of this part of the timestamp service is explained here.

We need four variables for this part of the algorithm:


      Tlast: The physical field stored, on disk or in the TSO object.

      Tnext: Next value for the physical field.

      Tnow: Time returned by the Wall clock.

      Tx: Duration of each round of timestamp service, defaults to 3 seconds.                                                                                  
	

A Raft group leader for the PD servers is elected at cluster's initial start or re-elected after the previous leader crashes. This new PD Leader then starts the calibration of the physical time. It first retrieves Tlast persisted by the previous PD Leader from the disk. This Tlast value represents the upper bound of previous round of timestamp service for the physical field and hence every timestamp serviced before the crash got a smaller physical field than Tlast. Then the new PD compares Tlast with Tnow:


      If Tnow - Tlast < 1 ms, then Tnext = Tlast + 1; otherwise Tnext = Tnow.                                                                                 
	

Then Tnext is stored in the physical field of the TSO object and this finishes the calibration of the time service. Before the allocator starts to service timestamp, Tnext + Tx is stored to disk through Raft, so that it could survive the next crash and again be retrieved as Tlast upon restart. Also all timestamps in the time period [Tnext, Tnext + Tx)(3000 x 2^18 of them) are generated in the memory. Then the allocator starts shoveling out timestamps upon request.

Every 50 ms, the algorithm tries to advance the physical field of the TSO object so that the service won't run out of the 2^18 allotment for the logical field. This process is called a Fast-Forward and it works like this:

It first calculates JetLag := Tnow – Tlast, here Tlast is the value retrieved from the TSO object. If JetLag > 1ms, the physical time in the hybrid logical clock is slower than the current physical time, and needs to be fast-forwarded to make Tnext = Tnow.

In addition, when the logical field reaches its threshold value(2^18) during the time services, it stops and waits for the next Fast-Forward. So, to prevent this situation from happening, the Fast-Forward process also checks if the current logical field exceeds half of its threshold value, the physical field in the TSO will be advanced as well if it is affirmative. In this case, Tnext = Tlast + 1.

Then Tnext is stored in the physical field of TSO, where its logical field is set to 0 and the timestamp service resumes after Fast-Forward.

In the Fast-Forward process, the logic also checks whether Tnext is too close to Tlast, the value stored on disk. If the answer is affirmative, a new Tlast = Tnext + Tx is generated and stored on disk so that the timestamp service can continue. The source code for the allocator and TSO are here and here(mainly in the syncTimestamp() and updateTimestamp() functions).

From this description of the timestamp service algorithm, we can see that an allocator hands out timestamps in an increasing order. So naturally we would like to compare it with OceanBase's timestamp service which provides an increasing sequence by a thread for each tenant in the cluster. Each second, TiDB's timestamp service can generate 1000/50 x 2^18 = 5.2M timestamps, which is more than twice that of OceanBase's. Besides pre-allocating all the timestamps in memory, TiDB also services timestamps in a batch. These are two reasons why TiDB's timestamp service trumps.

The 50ms duration between two consecutive Fast-Forwards is the default value of the 'tso-update-physical-interval' configuration parameter, which can be set to a minimum value of 1 ms for more timestamps at the cost of 10% more CPU usage. So each second, the timestamp service can theoretically generate a maximum of 10^3 x 2^18 = 2.6 x 10^8 timestamps. If this theoretical limit can actually be reached, the timestamps generated are two order of magnitude of that of OceanBase's. However, the default value could be an empirical sweet spot, so please make sure you test it out if you want to alter it.

With the default setting of 'tso-update-physical-interval', the maximum throughput a TiDB cluster can support is about 2.6 x 10^6 tps since each transaction requires two timestamps. If this is not enough for your deployment, TiDB supports a service mode that is similar to OceanBase's: it may provide a key space, which is like a tenant in OceanBase, for each application; a TSO microservice(with one allocator) is split from PD services to service each key space group; in the extreme case, there could be only one key space in such a group and this replicates the OceanBase case – one increasing sequence as timestamps for each tenant. This is a new feature that is available starting with version 8.0.0.

With this distributed clock, TiDB provides a Snapshot Isolation implementation for its OLTP workloads. However, this distributed clock doesn't satisfy the Clock Condition and hence we can't apply the work in Section 5 to it directly to achieve serializability, although we've come to the conclusion that we may also apply type B of the Serializability Theorem to generalize Serializable Snapshot Isolation to a distributed Snapshot Isolation implementation with tuple versions there.

The fact that it doesn't satisfy the Clock Condition is easy to observe: suppose in node A , transaction T1 updated a tuple and a request was sent to node B to perform a read, then it requested the commit timestamp s1; meanwhile, T2 in node B requested the commit timestamp s2 after node B had serviced T1's read; if the Clock Condition was honored, then we should have s1 < s2; but since TiDB uses batching to request timestamps, if the batch in node A was empty when s1 was requested while node B's batch was almost full, then s1 > s2 was possible since node A's batch could be sent out much later than node B's. Even if we could turn off timestamp request batching, a similar situation could still arise because TiDB supports geo-distributed clusters and WAN provides a lot of uncertainty, including node A's request arriving late at PD.

So if we still like to apply the work in Section 5 to conclude serializability in TiDB, the proof to Theorem 2.1 in [Fe 05] in subsection 5.1 must be done again with TiDB's timestamp service. Fortunately, there are only two places we need to modify in that proof.

The first one is that we used the Clock Condition to conclude that T's start timestamp < T's commit timestamp for every transaction Tin the proof. And this can be seen from TiDB's optimistic transaction model here: both timestamps are requested by the TiDB server that is coordinating the transaction and the commit timestamp is requested after the start timestamp's arrival; hence the returned start timestamp will be smaller than the commit one since TSO returns timestamps in an increasing order.

The other one concerns the WW conflict case in the proof of Lemma 2.2 in [Fe 05] and I am quoting the arguments in subsection 5.1 here:

'If the conflict between T1 and T2 is a WW one, by First-Committer-Wins rule or First-Updater-Wins rule, we have timestamp of T1's start < timestamp of T1's commit(Clock Condition) < timestamp of T2's start < timestamp of T2's commit(Clock Condition). Notice the second inequality can't be the other way around(timestamp of T2's commit < timestamp of T1's start) since then we would have the following timestamp loop: timestamp of T2's commit < timestamp of T1's start < timestamp of T1's conflicting write < timestamp of T2's conflicting write < timestamp of T2's commit by the Clock Condition.'

In the TiDB case, the inequality timestamp of T1's commit < timestamp of T2's start can be shown by the logic of write conflict resolution here: when transaction T2 tries to commit, it checks whether the condition data.commit_ts > T2.start_ts is true; if it is affirmative, some other transaction commits a version with a larger timestamp and T2 is rolled back; either T2 is rolled back or not, we have data.commit_ts < T2.start_ts because TSO returns timestamps in an increasing order; but of course, data.commit_ts belongs to transaction T1.

Notice that although TSO also uses physical time in its design, it seems the caveats about using physical time discussed thoroughly in Appendix B and the subsections about OceanBase and CockroachDB in Appendix D are masked up by the logical part, at least for the key property that TSO returns timestamps in an increasing order, which is crucial for the previous arguments to go through.

Now that we've proved Theorem 2.1 in [Fe 05] with the timestamp service of TiDB, we may apply it to conclude that if the DSG of an application is free of 'dangerous structure', it is serializable. In the case of TPCC, from the analysis in Example 16, we may sketch the DSG as in figure 13 of [Fe 05] and see that there are two possible 'dangerous structure's.

The first one is from the following RW conflicts: the first or fourth RW conflict in the s → n case, and the second RW conflict in the n → n case. However, this 'dangerous structure' is not real. That is because for the RW conflict in the n → n case to be possible, the two new-order transactions must also update the s_quantity field of the same tuple. This would push them away and prevent them to be concurrent under Snapshot Isolation and the possible RW conflict would disappear too. Notice in a Read-Committed like that of NDB Cluster's, it is possible for these two transactions to be concurrent.

The second one is from the following RW conflicts: one of the three RW conflicts in the o → d2 case, and one of the first three RW conflicts in the d2 → d2 case. But from the discussion in the d2 → d2 case, we know that the latter RW conflict doesn't exist under Snapshot Isolation, in particular TiDB's. Hence this 'dangerous structure' is not real either. Again, in a Read-Committed like that of NDB Cluster's, it is possible for these two transactions to be concurrent. So now we can conclude serializability for TPCC in TiDB's Snapshot Isolation.

In the case 'dangerous structure's do exist, we need to use techniques like materialization or promotion to eliminate them. Please refer to [Fe 05] for an exposure of these techniques. Notice that joins need to be re-written as equivalent statements as in the serializability implementation I've developed for NDB Cluster before conflict analysis.

We may also apply the method I've developed in Section 1 to achieve serializability to this specific Snapshot Isolation implementation with materialization and promotion since it can be viewed as a generic Read-Committed isolation level. But it is apparently inferior comparing with Fekete's method, so I won't elaborate on it any more.

And of course you may apply the Generalized Serializability Theorems to enhance performance while guaranteeing consistency of the underlying database, if inconsistencies in some reads are OK; in that case, please use the conflict_parser helper program to get you started analyzing the conflicts. Or you may use the Ramification Theorems to constrain inconsistencies to Ram(B). Please refer to Example 21 for a demonstration of the calculation of Ram(B).

A transaction under TiDB's Snapshot Isolation reads the most recently committed tuple version is guaranteed by the following arguments: a transaction T1 reading at timestamp s1 and a transaction T2 that committed at timestamp s2 < s1; we will show that T1 sees T2’s writes. Since s2 < s1, we know that the TSO gave out s2 before or in the same batch as s1; hence, T2 requested s2 before T1 received s1. We know that T1 can’t do reads before receiving its start timestamp s1 and that T2 wrote locks before requesting its commit timestamp s2 . Therefore, the above property guarantees that T2 must have at least written all its locks before T1 did any reads; T1’s Get() will see either the fully committed write record or the lock, in which case T1 will block until the lock is released. Either way, T2’s write is visible to T1’s Get(). These arguments can be found here. Notice TiDB provides this guarantee with the following key properties of the system: monotonicity of timestamp service, replication of locks for optimistic transaction by the Raft system, reads waiting for locks to be released.

Durability of consistency is guaranteed by Theorem 6 since durability is provided by Raft.

Unfortunately, we can't mount a serializability implementation at 2nd-tier to the two pessimistic isolation level implementations because a tiny piece of the puzzle is missing. In the successful mounting of a serializability implementation to a NDB Cluster, the key part is the proof of Theorem 1 & 2. If we choose the commit event to be one that is after execution of all SQL statements, but before 2PC described here, we can see that the proof of Theorem 1 and the 'If' part of Theorem 2 can go through in TiDB's pessimistic isolation level implementations, but not the 'Only If' part. The 'Only If' part depends on Example 14 and in the code snippet similar to the following one there, the locking read in T1 blocks for T2's insert in NDB Cluster for the arguments to be correct.


                              T1                                                      T2

                                                                                start transaction;

                                                                                insert into t values (2,2,2,2);

      start transaction;

      select * from t where id1>0 and id1<4 for update;                                                                           
	

But in TiDB, T1 doesn't block for T2 and the arguments in the 'Only If' part of Theorem 2 can't go through. Let's see a possible scenario in TiDB that this behavior would render the 'Only If' part to be incorrect. Suppose T1 and T2 are defined as follows:


                  T1

      start transaction;

      acquire l1;
    
      select * from t where id1>0 and id1<4 for update;

      commit;
    
                  T2                                 

      start transaction;

      insert into t values (2,2,2,2);

      acquire l2;
                                                             
      commit;                                                                            
	

Here l1 and l2 are conflicting locks as in Theorem 2.

Consider the situation when T2 is committing and 2PC starts to release locks acquired. In this releasing process, there is a chance that l2 has already been released, but the write lock for the insert hasn't. If this status lasts long enough, T1 may acquire l1 and start the execution of the locking read. Since the insert doesn't block this locking read, it will read the the database state before the insertion. So we have a predicate RW conflict between T1 and T2 and yet l2 → l1. This refutes the 'Only If' part of Theorem 2.

Finally, let's round up this subsection by comparing it with NDB Cluster in the following table:

TiDB NDB Cluster
Architecture Share-nothing Share-nothing
In-memory? No Yes, but some columns can be placed on disk
SQL dialect MySQL MySQL
Consistency guarantee Snapshot Isolation, Read-Committed, Repeatable-Read Read-Committed
High availability Raft Synchronous replication
Hotspot alleviation Region balancing* Hashing
Gaps destroyed? Partially** Totally by hashing
Primary-secondary replication? Yes Yes
Bidirectional or circular replication? Yes Yes
Geo-replication Intra-cluster Inter-cluster
Durability? Yes, with Raft No, if system-wide crash strikes

* Region balancing is explained here.

** TiDB currently doesn't implement gap locking. But gaps within the same region are not destroyed.