Dynamic Finite Versioning: An E ective Versioning

0 downloads 0 Views 225KB Size Report
execute than update transactions. Moreover, queries may not need to access the most up-to-date data in the database, as long as they access a transaction-.
Dynamic Finite Versioning: An E ective Versioning Approach to Concurrent Transaction and Query Processing Kun-Lung Wu, Philip S. Yu and Ming-Syan Chen

IBM T.J. Watson Research Center P.O. Box 704 Yorktown Heights, NY 10598

Abstract Severe data contention may exist between short update transactions and long read-only queries if both are to access the same database through a conventional concurrency control method. In this paper, we present dynamic nite versioning (DFV) schemes to e ectively support concurrent processing of transactions and queries. Without acquiring locks, queries read from a small, xed number of dynamically derived, transaction-consistent, but maybe slightly obsolete, logical snapshots of the database. On the other hand, transactions access the most up-to-date data in the database without data contention from queries. Intermediate versions created between snapshots are automatically discarded. Furthermore, dirty pages updated by active transactions are allowed to be written back into the database before commitment (i.e., the STEAL policy), and at the same time, consistent logical snapshots can be advanced automatically without quiescing the ongoing transactions or queries.

1 Introduction Concurrent processing of transactions and queries has become increasingly important as more and more decision-makers are depending on up-to-the-minute information from their database systems [1]. In this paper, we examine the problem of e ectively supporting concurrent transaction and query processing. We assume that only transactions can update the database, while queries are read-only actions. In general, queries tend to be ad hoc and take signi cantly longer to execute than update transactions. Moreover, queries may not need to access the most up-to-date data in the database, as long as they access a transactionconsistent one that is not substantially obsolete. Severe data contention may exist between transactions and queries if both are to access the same database concurrently through a conventional concurrency control mechanism, such as two-phase locking [2, 3]. To address this issue, there are typically three

approaches. The rst approach compromises serializability for queries, such as the cursor stability used in IBM's DB2 where the read locks acquired by queries are given up as the cursor moves from one page to the next. Queries always read committed data, but they may not be serializable with update transactions (degree 2 consistency [4]). Another example is the notion of epsilon serializability [5, 6]. Unlike cursor stability, epsilon serializability allows queries to read uncommitted data (degree 1 consistency [4]). However, the amount of inconsistency is bounded and automatically controlled by the system [5]. The other two alternatives provide serializable execution for queries. The second approach physically maintains two separate databases, one for transactions and the other for queries. Storage requirements are doubled, and extra costs are required to periodically copy data from the transaction database to the query database. The high cost associated with copying may force queries to read a substantially obsolete database. The third alternative approach maintains multiple versions of data items, and is generally called multiversioning [7] (also called transient versioning [1]). In multiversioning, data contention between transactions and queries can be reduced, or eliminated: transactions create a new version of a data item upon each update while queries access an old, but transaction-consistent, version [8, 9, 7, 10, 11, 12, 13, 14]. Depending on the total number of versions maintained for each data item, most existing multiversioning schemes can be further classi ed into two categories: two versions and an unrestricted number of versions. Two-version approaches [15, 16] reduce, but do not eliminate, data contention between transactions and queries because only one old version is maintained. Thus, contention still exists. On the other hand, at the expense of high storage overhead and complex version management, approaches using an unrestricted number of versions can eliminate data contention. Recently, various versioning schemes maintaining only a xed number of versions have been independently proposed to address the problems of storage overhead and version management complexity associ-

ated with an unrestricted number of versions [10, 17, 18, 19, 20]. In this paper, we present a general class of such xed versioning schemes, referred to as dynamic nite versioning (DFV), to e ectively support concurrent transaction and query processing [20]. (A detailed comparison and contrast among these di erent schemes is provided in Section 5.3.) DFV eliminates data contention by maintaining only a small, xed number of versions. Transactions access the most up-to-date data in the database; queries, on the other hand, may read a slightly out-of-date, but transaction-consistent, logical database snapshot. A small number of transactionconsistent logical snapshots are dynamically derived, not physically copied, to provide timely access to data for queries. Intermediate versions of database items created between snapshots can be automatically discarded, thus reducing the storage overhead. Moreover, there is no forced abort of extremely long-running queries due to early garbage collection to make room for transaction updates. The maximum number of versions maintained for each data item in DFV is M +2, where M = 2; 3; . . . is the number of logical snapshots used. M can be xed and under the direct control of the database administrator. Notice that only when a data item is updated frequently are M+2 versions of it needed. If a data item is not updated for some period of time, as few as one single version of it needs to be maintained. Therefore, the average storage costs are quite small for a general workload environment where transaction accesses are skewed, i.e., a small portion of the database are updated by a large number of transactions. In general, increasing the number of snapshots M demonstrates a trade-o between storage overhead and query obsolescence [17, 21]. For details on the trade-o as M increases and performance analysis of DFV, readers are referred to [17]. In this paper, we focus mainly on the design issues of DFV. One major contribution of DFV is its unique design. In DFV, dirty pages are allowed to be written back to the database before commitment (called the STEAL policy [22]), and at the same time, consistent logical snapshots can be advanced without quiescing the ongoing transactions or queries. (Most commercial database systems, such as IBM's DB2 [23] and IMS [24], support the STEAL policy.) This is achieved mainly by maintaining separate data structures for time-invariant \footprints" of transactions on the data items that they have created and for time-varying global state information in a small number of variables in memory. Ef cient algorithms are developed using these separate data structures to dynamically identify the appropriate version when a data item is referenced, enabling the automatic snapshot advancement and simultaneously reducing the complexity of version retrieval and garbage collection. One distinct feature of such a unique implementation in DFV is that any of the snapshots, not necessarily the oldest one, can be advanced when there

is no query reading it. This feature can substantially reduce both the storage cost and obsolescence, especially for a large M [17]. The paper is organized as follows. Section 2 outlines the basic ideas of a DFV scheme using two snapshots. Section 3 presents the design details of DFV. Section 4 discusses other design trade-o s. Before summarizing the paper, we discuss related work in Section 5.

2 The basic ideas of DFV

We assume in this paper that the transaction/query manager, or scheduler, knows a request is from an update transaction or read-only query. A concurrency control mechanism, such as two-phase locking, is assumed to maintain a serializable order for di erent update transactions. However, there is no synchronization between queries and transactions, or among different queries, i.e., queries acquire no locks at all. We assume that a page is the granularity of locking by transactions and also the granularity of version management. (The locking granularity in IBM's DB2 can be either a page or a table space, which contains many pages [25].) However, DFV schemes are also applicable when the granularities of locking and version management are both a record.1 In DFV, M  2 logical snapshots can be maintained, and at most M + 2 versions may be needed for each page. Unless otherwise mentioned, M = 2 is assumed for the description of DFV. Since physically copying the before-image when a page is updated may incur a signi cant overhead [3], DFV does not overwrite the most recently committed version of a page to avoid such an overhead. Thus, the most recently committed version of each page always exists at any moment. Furthermore, because we allow the STEAL policy for transaction updates, a working version, representing the uncommitted update, may also exist for some pages. Therefore, together with the two query snapshot versions, at most four versions of a page can be maintained in DFV. However, the two snapshot versions are only logical and may not require physical space at all. With the possible existence of some working versions in the database, it can be dicult to eciently establish a transaction-consistent logical snapshot at any moment. One possible approach is to quiesce the transactions: rst stop initiating any new transaction and then wait until all the ongoing transactions are committed, including forcing all the updates from bu er memory into the database. Alternatively, we can mark the working versions created since the last snapshot, and then change their marks when they become commit1 If the DFV scheme is applied at the record level, the storage overhead can be further reduced. On the other hand, more complex bookkeeping is required and sequential scan becomes less ecient as queries have to nd the appropriate version of a record one by one.

a query snapshot advance (QS-advance) is triggered as follows:

Version History: A1

A0

A2

t0

A3

A4

A5 time

t1 Q0 A1

A0

Q2

Q1 A2

A3

A4

A5

A6 time

t1

t0 Q0

Q1

Q2

Q0

Q3

Version Access: Traditional multiversioning

Dynamic finite versioning

Q0

A0

A0

Q1

A2

A1

Q2

A4

A1

Q3

A6

A5

Figure 1: A QS-advance scenario and version access (": query arrival, #: query completion). ted versions. Since these working versions may scatter around in the database, changing their marks involves expensive re-accesses to the database. Both approaches incur a high cost that may be undesirable or even intolerable for most high-performance database systems. We present a unique DFV design so that a logical snapshot can be established eciently at any time. Associated with each physical version we maintain the time-invariant footprint of a transaction. Unlike the marking approach mentioned above, the footprint will never require any change once created. Additionally, we also keep a list of the IDs of currently active transactions; an ID is removed from the list once the transaction is committed. Removing an ID from the active list is equivalent to implicitly making the working versions created by the transaction into the most recently committed versions. Since IDs can be removed at any time, this list can change at any moment. Establishing a consistent logical snapshot now becomes as simple as making a copy of the active list of IDs and recording the current time. Neither a quiescence of transactions nor a costly re-access to the database is required. In order to simplify the version management and garbage collection, one of the existing physical versions, if there are multiple ones, may need to be overwritten when a page is updated. By comparing the footprint information associated with each physical version with the global state information, we can dynamically identify the appropriate version for a transaction overwrite. As a result, intermediate versions created between snapshots can be automatically discarded. Additionally, we dynamically derive two transactionconsistent logical snapshots, denoted as QS0 , the old one (called the previous query snapshot), and QS1 , the recent one (called the current query snapshot). Arriving queries access the recent snapshot QS1 . When the last query accessing the old snapshot QS0 completes,

QS0 QS1 ; QS1 new snapshot: The active queries that arrived before this QS-advance, and were accessing QS1 continue to access the same snapshot, which is now QS0 . The next QS-advance will take place when the last of these completes. As an example, Figure 1 shows a scenario for a query snapshot advance and the versions that various queries will access. A QS-advance is automatically initiated at the instant when query Q0 is completed. Compared with traditional multiversioning schemes in which the entire version history may have to be maintained for potential query access, DFV only keeps three versions (those marked with a circle in Figure 1), and allows query snapshots to advance automatically. Note that we only show the version history of a single page and do not include the working version in Figure 1. Queries Q1 and Q2 both read the slightly out-of-date version A1 in DFV, instead of versions A2 and A4, respectively, in traditional multiversioning approaches.

3 The design of DFV

Section 3.1 presents the storage management; Section 3.2 shows the implicit version transitions; Section 3.3 demonstrates the automatic QS-advance; Section 3.4 presents the dynamic version identi cation algorithms; and Section 3.5 shows the on-the- y garbage collection.

3.1 Storage management

We rst consider disk storage management. Due to the dynamic allocation and deallocation of physical pages, the sequentiality of the database may become an issue; i.e., logically contiguous pages may not be physically adjacent. One approach to approximately preserving the sequentiality of the database is to allocate the four versions of a database page on the same disk track or on a nearby track so that the seek time can be largely reduced. This can be achieved by not fully allocating the entire track for the original database initially, leaving a certain free space for version maintenance. When an update to a page is written back to disk, if only a single physical version is maintained for the page, three physically contiguous, free pages on the same track or on a nearby track can be allocated. A pointer can be created on the original database page pointing to the rst of the three additional pages. The update is written into one of the three newly allocated pages and the original copy remains intact. These three additional pages together can be subsequently deallocated if the logical page has not been updated for some period of time. The three-at-one-time allocation and

deallocation strategies are designed to simplify the storage management and better preserve the sequentiality of the database. Alternatively, only a single free page can be allocated when it is required. Two variables are associated with each physical version: the time stamp variable tt, and the transactionidenti er variable tid. Variable tt records the time at which a physical page is updated, while tid records the identi er of the transaction updating the page. Both tt and tid represent the \footprints" of a transaction left on a page, which are xed and will never require any change once created. Both variables can be maintained together with the data or separately. If maintained separately, the need to have all versions of a page in memory to compare the tt's and tid's can be avoided when the page is referenced. Notice that the multiple versions of a page can be maintained in the database bu er in main memory (or expanded storage, such as the one employed between main storage and disk storage by IBM's 3090 systems [26]). All four versions can be kept in the bu er memory. In this case, data structures can be maintained in memory to track the update times and transaction identi ers, and provide the memory addresses of the multiple versions. Only when a page is to be replaced from the bu er due to bu er shortage, is additional disk space allocated, if needed, to store the multiple versions of the page. As memory becomes less expensive, more memory can be used to increase the bu er size, making possible the implementation of DFV schemes mainly in the bu er with little additional requirement in disk storage. Conceptually, with a large bu er, frequently updated pages can be kept in the bu er. Less frequently updated pages will be merged back to one version by the time it is pushed out from the bu er (see Section 3.5). Hence, only few pages on disk are expected to be stored with multiple versions.

3.2 Implicit version transitions

Due to QS-advances and transaction updates, the logical version(s) that a page copy represents may change continually. Even though there are at most four physical copies maintained for each database page, there are ve possible logical states that a page copy may represent: W, NQ, CQ, PQ and R. W represents the working version; i.e., the page is updated by an uncommitted transaction. CQ is the current-query version; it is the version updated by the latest transaction committed before the current query snapshot time, denoted as t1 . NQ is the next-query version; it is the most recently committed version. When a transaction is committed, all the W-version pages that it has created become NQs. PQ is the previous-query version; it is the version updated by the latest transaction committed before the previous query snapshot time, denoted as t0. R is reusable; it can be overwritten by a

P1

QS-advance

PQCQNQ 1:

2: Tx-write P1 6: PQCQNQ

P2

P3

W

QS-advance

P4

R

R

8:

3: Tx-commit

Tx-write P1 PQ CQ

P2

P3

NQ

P1

P4 R

9:

PQ

4: Tx-write P1 7:

PQ CQ

P2 NQ

P3

P1

P4

W

P1 PQ CQ

P2

P3 R

NQ

P3

CQ NQ

P4 R

R

11: Tx-write P2 P3 P4

QS-advance R

PQ

CQ NQ

10:

5: Tx-commit

Tx-write

P2

QS-advance

R

P1

P4 R

PQ

P2

CQ 13:

P1 PQ

W

R

12: Tx-commit P3 P4

P2 CQ

NQ

14: Tx-write

R

Tx-write P3 NQ

P4 W

15: Tx-write

Figure 2: Some version transitions for a page from the initial state. new transaction update. However, version transitions occur implicitly: W, NQ, CQ, PQ, and R are invisible labels. The only auxiliary information maintained about a page copy is a time stamp tt and a transaction identi er tid which are recorded when the page copy is created. After its creation, there is no requirement of re-labelling the page copy. Version transitions are recognized when, subsequently, correct versions for transaction and query accesses are identi ed. At any instant, for any database page, there always exist at least three logical versions: NQ, CQ and PQ. However, a single physical version may simultaneously represent multiple logical versions, such as (PQ, CQ), (CQ, NQ), or (PQ, CQ, NQ). (PQ, CQ, NQ) means a single physical version represents simultaneously the logical NQ, CQ and PQ versions of that page. The exact logical version(s) that a page represents changes implicitly when (a) a transaction updates the page, (b) the updating transaction is committed, and (c) query snapshots advance. The rules for version transitions are described as follows. VT-rule 1: On a transaction write (Tx-write), the new data are written into the W version, if there already exists a W; otherwise, the new data are written into the most recently updated R. In both cases, the updated version becomes the new W version. VT-rule 2: On a transaction commit (Tx-commit), W becomes NQ. Furthermore, the original NQ becomes R, if it is singly represented by a page copy. In other words, the original NQ is now neither the most recently committed version nor a query version, and thus can be discarded. As an example, Figure 2 shows some version transitions for a logical page from the initial state. (For space limitation, we do not demonstrate the entire version transitions for the page.) Each arc is labelled with the cause of the transition and a number for reference purpose. Transitions 3, 5 and 12 in Fig-

P1

P2

P3

P4

P1

P2

P3

P4

QS-advance PQ

CQ NQ

R

R

R

11: Tx-write P1

P2

P3

PQ CQ NQ

16:

R

R

Tx-write P4

14:

P1

P2

P3

P4

17: PQ

CQ NQ

W

R

P1

P3

P2

R

PQ CQNQ

W

R

QS-advance

12: Tx-commit

P1

P4

P2

P3

P4

QS-advance PQ

CQ

NQ

R

P1

P2

P3

P4

R

18:

13: Tx-write

Tx-write 15:

P1

PQ

P2

CQ NQ

P3

R

P4

19: PQ

CQ

NQ

W

R

PQ

CQ NQ

W

QS-advance

Figure 3: More version transitions. ure 2 demonstrate the scenarios for version transitions on a Tx -commit. Transition 3 shows page copy P2 becomes NQ from W, while page copy P1 changes from (PQ, CQ, NQ) to (PQ, CQ). Transition 5 shows page copy P2 changes from NQ to R, because P2 originally represents only NQ. VT-rule 3: On a QS-advance due to the fact that the last query accessing QS 0 is completed, (i) the original PQ becomes R, if it is singly represented by a page copy (see transitions 16-19 in Figure 3); (ii) the original CQ becomes PQ, (see transitions 18 and 19 in Figure 3); and (iii) the original NQ becomes both CQ and NQ (see transitions 9 and 10 in Figure 2 and transitions 18 and 19 in Figure 3). Transitions 1 and 8 in Figure 2 demonstrate that nothing happens when PQ, CQ, and NQ are all represented by a single page copy.

3.3 Automatic QS-advance

When a transaction is started, its identi er is recorded in the current active transaction list, AcTLc , and is removed from AcTLc when it is committed. To simplify the presentation, we assume that all the updates by a transaction are written back to the database when it is committed. (As indicated in Section 3.1, the multiple versions can be maintained in the database bu er. Therefore, it is also possible to allow for deferred writes, as is allowed by typical database systems, such as IBM's DB2 [23].) When an updated page is written to the disk, the time of the update and the transaction ID are recorded. Notice that, unlike traditional time stamping approaches, it is not the transaction initiation time or the transaction commit time, but the page update time, that is recorded. Obviously, initiation time does not tell the order of commit. However, if committime is used, then either additional I/Os may be required to re-read the pages already on disk when a transaction is to be committed, or it is not

allowed for uncommitted updates to be written back to the database, as in the case to be discussed in Section 4. With the use of update times and active transaction lists, the order of commit times relative to query snapshot times can be derived dynamically. We maintain two transaction-consistent logical snapshots: the current query snapshot QS 1 and the previous query snapshot QS 0 . A query snapshot is identi ed by the snapshot time and an active transaction list at the snapshot time. QS 1 is the consistent logical snapshot taken at time t1 , and is identi ed by t1 and AcTL1 . Similarly, QS 0 is the consistent logical snapshot taken at time t0, and is identi ed by t0 and AcTL0 . Also associated with each query snapshot is an active query list: QL1 , the active query list accessing QS 1 and QL0 , the active query list accessing QS 0 . To start a query, it is assigned a time stamp of its initiation time and its ID is added into QL1 . When a query is completed, its ID is deleted from QL1, or QL0 if a QS-advance has been issued since the query is started. Queries initiated after t1 read QS 1; queries initiated between t0 and t1 read QS 0 .2 A query snapshot can be advanced in real time without quiescing the transaction or query processing when the associated active query list becomes empty. In other words, when all active queries accessing a query snapshot are nished, the query snapshot is no longer needed and can be advanced immediately, and the page copies representing the no-longer-needed snapshot can be overwritten by new updates. We rst present the QS-advance due to QL0 becoming empty. In the discussion of QS-advance, we assume that t0 6= t1 ; therefore, a query snapshot is advanced when no active queries are accessing it. In fact, there are occasions that t0 and t1 are identical, such as at time zero and when there have been no queries active in the system for some period of time. In such a case where t0 = t1 , the system can make t1 di erent from t0 by advancing QS 1 after a xed amount of time has elapsed since the arrival of the rst subsequent query. The following updates to pertinent data structures, which are maintained in memory, accomplish a QSadvance due to QL0 becoming empty: QL0 QL1; AcTL0 AcTL1 ; t0 t1 ; QL1 empty; AcTL1 AcTLc ; t1 current time: Notice that the manipulation of these data structures is the only action required for advancing a query snapshot and is done without any disk access; no quiescing of either the transaction or query processing is required. Moreover, when a query is rst started, it always reads the current query snapshot. During the course of query processing, query snapshots may advance, and the \then" QS 1 may become the \new" 2 Note, since newly arriving queries are only added to QL , 1 QL0 will eventually become empty.

Algorithm Locate-CQ:

i = 0; while (tt[Pi] > t1) i = i + 1; if (tid[Pi] 2 AcTL1 ) return version Pi+1; else return version Pi;

Algorithm Locate-PQ:

i = 0; while (tt[Pi] > t0) i = i + 1; if (tid[Pi] 2 AcTL0 ) return version Pi+1; else return version Pi;

Figure 4: Algorithms for identifying CQ and PQ. QS 0. However, it is only the name of the snapshot that has been changed. The queries reading the \new" QS 0 are still reading the same snapshot that they have been reading before the QS-advance. However, QS 1 can also be independently advanced if QS 1 becomes empty. As a result, the upcoming queries can access a more up-to-date version of the database even though there are still active queries accessing QS 0 . When a new query arrives and there are no other active queries accessing QS 1, a new QS 1 can be taken immediately. Thus, the newly arrived query will access the database from the immediately taken snapshot. To advance such a query snapshot, QL1 empty; AcTL1 AcTLc ; t1 current time: Since a query snapshot is allowed to advance only when the associated query list is empty, an exceptionally long-running query, such as the one that involves reading the entire database, may cause upcoming queries to read a very obsolete version of the database. To improve the up-to-dateness of upcoming query access, the exceptionally long-running query can be deliberately assigned a time stamp that is less than t1 instead of its initiation time, and therefore is forced to access QS 0 . QS 1, as a result, can be advanced more frequently.3 Such an approach requires the transaction/query manager to be informed in advance that a query is exceptionally long-running.

3.4 Dynamic version identi cation 3 However, if a long-running query immediately follows a short-running query and both queries are logically related and need to access the same query snapshot, then the long-running query must so inform the scheduler to prevent itself from being forced to access a di erent, older snapshot.

To simplify the presentation, we introduce some new notation. Assume that in the following discussion there are four page copies, denoted as Pj , where j = 0; 1; 2 and 3, for page L such that P0 is the most recently updated version and P3 is the least recently updated version. This update order can be derived from the time stamps when a page is referenced. Alternatively, for versions in memory, a directory indicating the update order of the versions of each page can be maintained. In addition, we assume that tt[Pj ] represents the time stamp variable tt associated with page copy Pj , and tid[Pj ] represents the transaction-identi er variable tid associated with Pj .4

Transaction Read

A transaction read (Tx -read) gets the most recently updated version. Notice that the most recently updated version may not be the most recently committed version; it may still be a working version.

Query Read

The CQ or PQ version is returned on a query read (Q-read), according to the initiation time stamp of the query. CQ is returned when the time stamp of the requesting query is greater than t1. On the other hand, PQ is returned when the time stamp of the requesting query is greater than t0 but less than t1 . Figures 4 shows the algorithms for identifying CQ and PQ. Algorithm Locate-CQ returns the page copy representing CQ, and algorithm Locate-PQ, likewise, returns the page copy representing PQ. If page Pi represents CQ, its time stamp is the largest that is less than t1 and its transaction identi er is not in AcTL1 (the active transaction list at t1). If page Pi represents PQ, its time stamp is the largest that is less than t0 and its transaction identi er is not in AcTL0 (the active transaction list at t0 ). Both algorithms are simple and do not require lengthy computation, making rapid version identi cation possible. Figure 5 shows an example of using time stamps and transaction identi ers for dynamic version identi cation by Locate-CQ and Locate-PQ. Two queries are active: Q1 and Q2. Q1 , started between t0 and t1 , will access PQ; and Q2 , started after t1, will access CQ. Notice that the only data structures maintained with each page copy are a transaction identi er, tid, and a time stamp, tt. When page A is referenced by query Q2, page copy P1, the CQ version, is returned by Locate-CQ because its time stamp is the largest time stamp that is less than t1 and its tid is not in AcTL1 . When page B is referenced by queries Q1 and Q2 , page copy P0 is returned by both Locate-CQ and Locate-PQ because it represents simultaneously the CQ and PQ versions of page B. 4 If there are less than four page copies, pseudo P 's can be j made up with a negative tt[Pj ].

t0

t1

time 1

2

3

4

5

6

7

8

Q1

P3 P2 page A

logical states:

logical states:

P0

TX 1 TX 2 TX 3 TX 4 1 3 7 6

R

PQ NQ CQ

P3 P2 page B

P1

P1

W

R

R

Q2

AcTL

= { TX 4 }

c

AcTL1 = { TX 4 }

t 1 = 8, t 0 = 5,

AcTL 0 = { }

QL1 = { Q2 } QL = { Q } 1 0

P0

TX 0 TX 0 TX 2 TX 3 -1 -1 2 4

R

9

NQ CQ PQ

P3 P2 page C

logical states:

P1

P0

TX 0 TX 0 TX 1 TX 3 -1 -1 1 7

R

R

PQ NQ CQ

Figure 5: A version identi cation example. Note that just time stamps alone are not sucient to decide the exact version a page copy represents. Consider the case where a page was updated by a transaction before t1 but the transaction is committed after t1 . When the transaction is committed, its identi er is deleted only from AcTLc . As a result, at a subsequent time the page copy is an NQ version with respect to the current database, but should not be included in QS 1, even though its time stamp is less than t1 and it is now a committed page copy. With the maintenance of AcTL1 and AcTL0 , a working version at the time a snapshot was taken can be correctly recognized. As an example, in Figure 5, suppose TX4 , started at time 6, is committed at time 10, and its identi er is deleted from AcTLc at time 10. When query Q2 accesses page A at time 11, page copy P0 would not be included in the current query snapshot, even though P0 has been NQ since time 10.

Transaction Write

The policy for a transaction write (Tx -write) is to write into the W version, if there is a W; otherwise write into the most recently updated R version. A W version is the page with the largest time stamp and its transaction ID is in AcTLc (the current active transaction list). The CQ, PQ, and NQ versions are kept intact, making possible noninterfering, concurrent transaction and query processing. To prevent from overwriting a CQ, NQ or PQ version, they have to be identi ed correctly and dynamically. If page Pi represents an NQ version, it is the most recently committed version. An NQ version is recognized by the fact that it has the largest time stamp and it is committed; i.e., its transaction ID is not in AcTLc . Figure 6 shows the algorithms for locating W and the most recently updated R. Algorithm LocateWR0 locates the appropriate version for Tx -write, assuming that the QS-advance can only occur when QL0

Algorithm Locate-WR0: if (tid[P0] 2 AcTLc ) return version P0; else f

Pm = Locate-CQ; Pn = Locate-PQ; if (m = 2 or (n ? m = 2)) return version P1; else return version Pn+1 ;

g

Algorithm Locate-WR: if (tid[P0] 2 AcTLc ) return version P0; else f

Pm = Locate-CQ; Pn = Locate-PQ; if (n = 3 and m = 1) return version P2; else if (m = 2 or (n ? m  2)) return version P1; else return version Pn+1 ;

g

Figure 6: Algorithms for identifying W and the most recently updated R. becomes empty. Algorithm Locate-WR locates the appropriate version for Tx -write, when a QS-advance can be initiated because either QL0 or QL1 becomes empty. In the following, we only outline the basic ideas of both algorithms, detailed correctness proofs can be found in [20]. Assume that Pm is CQ and Pn is PQ. Consider rst the simple case where a QS-advance can only occur when QL0 becoming empty. In such a case, the most recently updated R is either P1 or Pn+1. When the most recently updated R is P1 , then NQ is P0 , and m = 2 or n ? m = 2. However, if QS-advances can occur either because QL0 or QL1 becomes empty, the most recently updated R can be P1, P2, or Pn+1. When the most recently updated R is P2, then P0 is NQ and m = 1 and n = 3. When the most recently updated R is P1, then P0 is NQ and m = 2 or n ? m  2.

3.5 On-the- y garbage collection

When the most recently updated version is the NQ version of a page, and it was created before t0, then the PQ, CQ and NQ are all represented by a single page copy. The other three pages can be garbage collected. To preserve the sequentiality of the database, the combined PQ, CQ and NQ version can be copied back to the initially allocated page, if needed, and the three contiguous physical pages are set free. Garbage

collection can be done whenever a read operation by a transaction or a query detects such a condition, or it can be done whenever a Tx -write needs three additional pages. Notice that garbage collection never prematurely reclaims any storage representing the CQ or PQ version. As a result, there is no forced abort of long-running queries due to early garbage collection to make room for new transaction updates.

4 Other design trade-o s

4.1 More than four versions

The design of DFV schemes described in Section 3 is generalizable to more than four versions, where more than two query versions can be maintained. The very same data structures are maintained for each snapshot, and similar dynamic version identi cation schemes can be applied. Automatic QS-advance can also be applied when any of the snapshots has no query accessing it. This is a unique feature of this DFV scheme that helps substantially reduce the storage overhead and obsolescence especially for a large M [17], as compared with the scheme which only allows a QS-advance when there is no query accessing the oldest snapshot. With more query versions, on the average, the frequency of a query snapshot advance is increased. And therefore, queries will read more up-to-date consistent data. This is especially useful for long-running queries because we can force them to access the oldest logical snapshot and other snapshots can be advanced regularly.

4.2 Three versions

When the maximumnumber of versions maintained for each page is three, the DFV approach described in Section 3 is still applicable, but with a more restricted functionality. There are two alternatives to implement a dynamic three-versioning scheme. The rst alternative is to use only one query version; namely, there is no PQ version and only CQ, NQ, W and R versions exist. As a result, a query snapshot cannot be advanced when there are still active queries in the system. (This approach can work only if the query arrival rate is very low.) Updates by uncommitted transactions are still allowed to be written into the database at any instant. The second alternative is to prohibit the updates of uncommitted transactions from being written into the database; only the updates of committed transactions are allowed to be written into the database. In other words, the STEAL policy is not allowed. The W version, as a result, does not exist. Since there is no W version, the time stamp of a version represents both the version's creation time and commit time. Therefore, only the time stamps are required for dynamic version identi cation; the active transaction lists are

not. Without maintaining the active transaction lists, Algorithms Locate-PQ and Locate-CQ are greatly simpli ed. The PQ version is the version with the largest time stamp that is less than t0 , and the CQ version is that with the largest time stamp that is less than t1. Without the W version, when a new version is created, the NQ version may have to be overwritten. Overwriting the NQ occurs when it is singly represented by a page copy, which means the original NQ was created after t1 and would never become the CQ version once the new version is created. In order to guarantee the consistency of a query snapshot, in the dynamic three-versioning scheme, no transaction commit is allowed during the period when a query snapshot is advanced.5 Because no active transaction list is maintained, advancing a query snapshot involves even simpler operations: only t1 , t0, QL1 , and QL0 need to be changed. As a result, the impact of a querysnapshot advancement on delaying transactions from being committed is negligible.

5 Related work

5.1 Two versions

Bayer et al., and Stearns and Rosenkrantz have proposed multiversion concurrency protocols using two versions [15, 16]. Data contention is reduced | but not eliminated. The increase in the level of concurrency is limited because only one single old version is maintained, and queries may still compete with transactions through special locking protocols.

5.2 Unrestricted number of versions

To serialize a query before all concurrently active transactions whenever it arrives, every old version of a data item may need to be kept, so that any incoming query can access the consistent snapshot at its arrival time. As a result, there may be a large and unrestricted number of versions maintained for a given data item. Reed has proposed a scheme which conceptually keeps forever every version of a data item created [13]. Garbage collection was not fully addressed. Read-only queries may have to be aborted if certain old versions that they need are no longer available. Chan and Gray have developed a version management technique using a ring bu er as the version pool to store old versions of data items [11, 12]. When the ring bu er over ows, some old versions may have to be discarded to make room for versions to be created by future transaction updates. As a result, queries may also have to be aborted,

5 Note that, with the use of the W version, committing a transaction in the original four-version scheme can be reduced to removing the transaction identi er from AcTLc . As long as the action of removing the identi er is atomic, transaction commits are allowed during the period when a query snapshot is taken.

as in Reed's approach. The possibility of aborting a query due to an early garbage collection can be eliminated by a scheme developed by Weihl [14]. However, it is achieved at the cost of imposing a complex initiation phase for query execution. Before a query can start accessing any database item, it has to ensure that the appropriate versions of all the data items that it needs are available and they are registered to avoid early garbage collection. Agrawal and Sengupta have proposed a modular-synchronization design to separate version control from concurrency control in multiversion databases [8]. Read-only transactions undergo no concurrency control, but the version control mechanism may maintain an unrestricted number of versions for a data item.

queries, if serializability is to be observed. Pu has proposed an algorithm to read the entire single-version database concurrently with the normal transaction processing [2]. His objective is to use a special locking protocol to realize an on-the- y, incremental, and consistent reading of the entire database. Garcia-Molina and Wielderhold have considered using the semantically \read-only" information to process queries eciently [27]. Read-only queries are classi ed into different types, and their correctness criteria are de ned accordingly. Their emphasis, however, is on systems where the state of an individual data item is replicated, and their goal is primarily to locally process a given read-only action without making accesses to remote sites.

5.3 Fixed number of versions

6 Summary

Recently, various implementations which employ only a xed number of versions have been independently proposed [20, 10, 18, 19]. Bober and Carey proposed an extension to the ring-bu er approach developed in [11, 12] by using record-level versioning and on-page caching for storing prior versions [10]. They also introduced the concept of view sharing, which groups together queries to run against the same transaction-consistent view of the database. However, their implementation does not support the STEAL policy, since a record has to be stamped with the commit time stamp of the transaction that updates it. (Supporting the STEAL policy is one of the major focuses of the DFV schemes.) Moreover, there is no clear notion of the snapshot advancement in [10] as in the DFV scheme. Mohan et al. [18] also proposed methods for maintaining transient versioning of records to avoid locking by read-only transactions. Unlike the DFV scheme (with M > 2 snapshots) which allows any of the snapshots to be advanced when there is no query reading it, the implementation in [18] only allows the oldest snapshot to be advanced. As indicated in [17], when M is large, such a distinct feature of DFV can result in a substantial reduction in both the storage overhead and query obsolescence. An implicit versioning scheme, developed by Dias et al. [19], also allows queries to read a consistent snapshot of the database. However, the proposal in [19] mainly presents a new architecture for concurrent transaction and query processing. Two di erent database engines, one for transaction processing and the other for query processing, are proposed. Periodically, a time step is advanced to synchronize the database state for query processing and that for transaction access.

5.4 One version

There are proposals that do not use multiple versions to support more ecient processing of readonly actions [2, 27]. These schemes intrinsically have very severe data contention between transactions and

In this paper, we presented an ecient versioning approach, called dynamic nite versioning DFV, to supporting concurrent transaction and query processing. In this DFV scheme, there is no blocking between update transactions and read-only queries, no quiescing of either transactions or queries for allowing for the formation of a more up-to-date query-version of the database, and no forced abort of long-running queries due to early garbage collection to make room for new transaction updates. Without locking, queries access the appropriate consistent, but perhaps slightly out-ofdate, logical database snapshots; transactions, on the other hand, access the most recent data in the database through a typical concurrency control mechanism. Data contention is eliminated without incurring severe storage overhead. In DFV, only a xed and small number of logical versions are dynamically maintained for each database page. A new update may overwrite an obsolete version, and a physical page copy may simultaneously represent multiple logical versions. The exact version(s) that a physical page represents may change implicitly from time to time, and is identi ed dynamically when the page is referenced. Garbage collection can be eciently performed upon reference when a page has not been updated for some period of time, and it is recognized that a single copy is sucient to represent the required logical versions. One major contribution of DFV is its unique design where the STEAL policy is allowed and at the same time logical snapshots can be automatically advanced without quiescing the ongoing transactions or queries. An ecient mechanism using separate data structures to represent time-invariant and time-varying information was developed to facilitate automatic querysnapshot advancement without quiescing the transaction or query processing. The time-invariant information represents the footprints of a transaction left on a page copy, and is associated with each page copy. The time-varying information represents the system state,

and is kept in memory. Appropriate versions are dynamically identi ed for transaction and query accesses upon reference by comparing the invariant footprint information with the varying system state information.

Acknowledgement

The authors would like to thank Prof. Calton Pu of Oregon Graduate Institute for his helpful comments on an earlier draft of this paper.

References [1] H. Pirahesh et al., \Parallelism in relational data base systems: Architectural issues and design approaches," in Proc. of 2nd Int. Symp. on Databases in Parallel and Distributed Systems, pp. 4{29, 1990. [2] C. Pu, \On-the- y, incremental, consistent reading of entire databases," in Proc. of Very Large Data Bases, pp. 369{375, 1985. [3] M. J. Carey and W. A. Muhanna, \The performance of multiversion concurrency control algorithms," ACM Trans. on Computer Systems, vol. 4, no. 4, pp. 338{ 378, Nov. 1986. [4] J. N. Gray et al., \Granularity of locks and degrees of consistency in a shared data base," in Proc. of IFIP TC-2 Working Conference on Modelling in Data Base Management Systems (G. M. Nijssen, ed.), pp. 1{29, North-Holland, 1976. [5] K.-L. Wu, P. S. Yu, and C. Pu, \Divergence control for epsilon-serializability," in Proc. of Int. Conf. on Data Engineering, pp. 506{515, 1992. [6] C. Pu and A. Le , \Replica control in distributed systems: An asynchronous approach," in Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 377{ 386, 1991. [7] P. A. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987. [8] D. Agrawal and S. Sengupta, \Modular synchronization in multiversion databases: Version control and concurrency control," in Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 408{417, 1989. [9] P. A. Bernstein and N. Goodman, \Multiversion concurrency control|theory and algorithms," ACM Trans. on Database Systems, vol. 8, no. 4, pp. 465{ 483, Dec. 1983. [10] P. M. Bober and M. J. Carey, \On mixing queries and transactions via multiversion locking," in Proc. of Int. Conf. on Data Engineering, pp. 535{545, 1992. [11] A. Chan et al., \The implementation of an integrated concurrency control and recovery scheme," in Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 184{191, 1982.

[12] A. Chan and R. Gray, \Implementing distributed readonly transactions," IEEE Trans. on Software Engineering, vol. SE-11, no. 2, pp. 205{212, Feb. 1985. [13] D. P. Reed, \Implementing atomic actions on decentralized data," ACM Trans. on Computer Systems, vol. 1, no. 1, pp. 3{23, Feb. 1983. [14] W. E. Weihl, \Distributed version management for read-only actions," IEEE Trans. on Software Engineering, vol. SE-13, no. 1, pp. 55{64, Jan. 1987. [15] R. Bayer, H. Heller, and A. Reiser, \Parallelism and recovery in database systems," ACM Trans. on Database Systems, vol. 5, no. 2, pp. 139{156, June 1980. [16] R. E. Stearns and D. J. Rosenkrantz, \Distributed database concurrency control using before-values," in Proc. of the 1981 ACM SIGMOD Int. Conf. on Management, of Data, pp. 74{83, 1981. [17] A. Merchant, K.-L. Wu, P. S. Yu, and M.-S. Chen, \Performance analysis of dynamic nite versioning for concurrent transaction and query processing," in Proc. of 1992 ACM SIGMETRICS and PERFORMANCE '92, pp. 103{114, 1992. [18] C. Mohan, H. Pirahesh, and R. Lorie, \Ecient and

exible methods for transient versioning of records to avoid locking by read-only transactions," in Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 124{133, 1992. [19] D. M. Dias, A. Goyal, and F. N. Parr, \An intelligent page store for concurrent transaction and query processing," in Proc. of 2nd Int. Workshop on Research Issues on Data Engineering: Transaction and Query Processing, pp. 12{19, 1992. [20] K.-L. Wu, P. S. Yu, and M.-S. Chen, \Dynamic nite versioning for concurrent transaction and query processing," Tech. Rep. RC 16633, IBM T. J. Watson Research Center, Mar. 1991. [21] P. M. Bober and D. M. Dias, \Storage cost tradeo s for multiversion concurrency control," Tech. Rep. RC 18367, IBM T. J. Watson Research Center, Jan. 1992. [22] T. Haerder and A. Reuter, \Principles of transactionoriented database recovery," ACM Computing Surveys, vol. 15, no. 4, pp. 287{317, Dec. 1983. [23] J. Z. Teng and R. A. Gumaer, \Managing IBM Database 2 bu ers to maximize performance," IBM Systems Journal, vol. 23, no. 2, pp. 211{218, 1984. [24] J. P. Strickland, P. P. Uhrowczik, and V. L. Watts, \IMS/VS: An evolving system," IBM Systems Journal, vol. 21, no. 4, pp. 490{511, 1982. [25] D. J. Haderle and R. D. Jackson, \IBM Database 2 overview," IBM Systems Journal, vol. 23, no. 2, pp. 112{125, 1984. [26] E. I. Cohen, G. M. King, and J. T. Brady, \Storage hierarchies," IBM Systems Journal, vol. 28, no. 1, pp. 62{76, 1989. [27] H. Garcia-Molina and G. Wiederhold, \Read-only transactions in a distributed database," ACM Trans. on Database Systems, vol. 7, no. 2, pp. 209{234, June 1982.