WWW 2007 / Track: Performance and Scalability Session: Scalable Systems for Dynamic Content GlobeTP: Template-Based Database Replication for Scalable Web Applications Tobias Groothuyse Vrije Universiteit De Boelelaan 1081a Amsterdam, The Netherlands Swaminathan Sivasubramanian Vrije Universiteit De Boelelaan 1081a Amsterdam, The Netherlands Guillaume Pierre Vrije Universiteit De Boelelaan 1081a Amsterdam, The Netherlands tobias.groothuyse@xs4all.nl ABSTRACT swami@cs.vu.nl gpierre@cs.vu.nl Generic database replication algorithms do not scale linearly in throughput as all up date, deletion and insertion (UDI) queries must b e applied to every database replica. The throughput is therefore limited to the p oint where the numb er of UDI queries alone is sufficient to overload one server. In such scenarios, partial replication of a database can help, as UDI queries are executed only by a subset of all servers. In this pap er we prop ose Glob eTP, a system that employs partial replication to improve database throughput. Glob eTP exploits the fact that a Web application's query workload is comp osed of a small set of read and write templates. Using knowledge of these templates and their resp ective execution costs, Glob eTP provides database table placements that produce significant improvements in database throughput. We demonstrate the efficiency of this technique using two different industry standard b enchmarks. In our exp eriments, Glob eTP increases the throughput by 57% to 150% compared to full replication, while using identical hardware configuration. Furthermore, adding a single query cache improves the throughput by another 30% to 60%. Categories and Subject Descriptors C.2.4 [Computer-Communication Networks]: Distributed Systems; C.4 [Performance of systems]: Design studies; H.3.4 [Information Storage and Retrieval]: Systems and Software. General Terms Performance. Keywords Database replication, partial replication, scalability, Web applications. 1. INTRODUCTION Over the past few years the World-Wide Web has taken significant imp ortance into our lives, and many businesses and public services now rely on it as their primary communication medium. This drives the need for scalable hosting Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. architectures capable of supp orting arbitrary levels of load with acceptable p erformance. However, while this problem is now well understood for static content [11, 14, 20, 22], providing scalable infrastructures for hosting dynamically generated Web content still remains a challenge. Dynamic Web content allows Web sites to p ersonalize the delivered contents to individual clients, and to take action up on certain requests such as processing an order in an ecommerce site. Content is dynamically generated up on each client request by application-sp ecific business logic, which typically issues one or more queries to an underlying database. Numerous systems for scalable hosting of Web applications have b een prop osed. These systems typically cache (fragments of ) the generated pages [7], distribute the computation across multiple application servers [1, 23] or cache the results of database queries [2, 4, 19, 27, 30]. However, although these techniques can b e very effective dep ending on the application, in many cases their ultimate scalability b ottleneck resides in the throughput of the origin database where the authoritative version of the application state is stored. Database replication techniques can of course help here, but the generic replication algorithms used by most databases do not scale linearly as they require to apply all up date, deletion and insertion (UDI) queries to every database replica. The system throughput is therefore limited to the p oint where the quantity of UDI queries alone is sufficient to overload one server, regardless of the numb er of machines employed [12]. The only solutions to this problem are to increase the throughput of each individual server or to use partial replication so that UDI queries can b e executed at only a subset of all servers. However, partially replicating a database is in turn difficult b ecause queries can p otentially span data items which are stored at different servers. Current partially replicated solutions rely on either active participation of the application programmer [15] or on one sp ecial server holding the full database to execute complex queries (and thereby b ecoming the new throughput b ottleneck) [26]. This pap er presents Glob eTP, a database replication system that exploits the fact that the database queries issued by typical Web applications b elong to a relatively small numb er of query templates [2, 19, 27]. A query template is a parametrized SQL query whose parameter values are passed to the system at runtime. Prior knowledge of these templates allows one to select database table placements such that each query template can b e treated locally by at 301 WWW 2007 / Track: Performance and Scalability Clients Clients Clients Session: Scalable Systems for Dynamic Content To remove this b ottleneck, various techniques have b een prop osed to cache the results of database queries at the edge servers [2, 4, 19, 27, 30]. Consistency of cached results must also b e maintained when the underlying database is up dated. This is usually done by requiring the application programmer to sp ecify a numb er of query templates that are issued by the application. The programmer must also sp ecify conflicts b etween templates. When a given UDI query template is invoked, all conflicting read templates are invalidated. This technique allows to reduce the database query latency as a numb er of queries can b e answered locally. The total system throughput is also increased b ecause less queries are addressed to the origin server. However, database caching systems have good hit ratio only if the database queries exhibit high temp oral locality and contain relatively few up dates. A common technique used to improve the p erformance of a database is replication. Traditional RDBMS replication solutions replicate the complete database across several machines located within a cluster [6, 17, 21, 24]. The incoming read query workload is distributed evenly across all the replicas. Consistency is maintained by applying UDI queries to all replicas using 2-phase commit protocols or snapshot isolation. Database replication improves throughput as the incoming read query workload is shared among multiple servers. However, if the workload contains a significant fraction of UDI queries, then these systems incurs a limited throughput as all UDI queries must b e applied to all replicas. As we show in our exp eriments, when the load of UDI queries alone is sufficient to overload any one of the servers, then the system cannot improve its throughput any more. A numb er of commercial systems such as Oracle keep database servers consistent not by executing UDI queries at each replica but by executing these queries at the master server only, and by propagating up date logs to the slave servers. Applying such logs is significantly more efficient than re-executing the queries. However, this technique does not improve the maximum throughput as it requires a single master server to execute all UDI queries. The master server then determines the total system's throughput. A few database systems based on p eer-to-p eer technologies have b een prop osed to distribute the query processing load across arbitrary numb er of servers [9, 13]. These systems typically aim to achieve scalability by partioning the dataset across p eers. However, the features offered by these systems are rather limited and handling complex queries that span the entire data set remains a challenge. In [15], the authors prop ose an edge computing infrastructure where the application programmers can choose the data replication and distribution strategies that are b est suited for the application. This approach can yield considerable gains in p erformance and availability, provided that the selected strategies are well suited for the application. However, coming up with the right strategies requires significant insight of the application programmers in domains such as fault-tolerance and weak data consistency. Contrary to this approach, we strive for requiring minimum supp ort from the application programmers, and try to keep replication as transparent to the application as p ossible. Note that the main focus of this pap er is not replication for availability. We however return briefly to this issue in Section 6.2 to Edge server Edge server Edge server Cache Cache Cache Cache Origin database Figure 1: Typical Edge-Server Architecture. least one server. We demonstrate that careful table placements based on the data span and the relative execution costs of different templates can provide ma jor scalability gains in terms of sustained throughput. We further show that this technique can easily b e used in combination with any existing template-based database query caching system, thereby obtaining reduced access latency and yet some more throughput scalability. This pap er is organized as follows: Section 2 presents the related work. Section 3 discusses our system model, and Section 4 details our table placement algorithms. Section 5 presents p erformance results and Section 6 discusses a numb er of issues that arise from our approach. Finally, Section 7 concludes the pap er. 2. BACKGROUND AND RELATED WORK Traditional content delivery networks (CDNs) often host Web applications using techniques such as fragment caching whereby (fragments of ) the generated pages are cached at the edge servers [7, 10, 18]. This technique p erforms well if the temp oral locality of requests is high and if the underlying database is up dated rarely. However, applications that do not exhibit these b ehavior require more sophisticated edgeserver-based techniques. A typical edge-server architecture used by many advanced systems to host Web applications is depicted in Figure 1. Client requests are issued to edge servers located across the Internet. Each edge server has a full copy of the application code but no database. Database queries are sent out to a local query cache that can answer previously issued queries without incurring wide area latency. Cache misses and UDI queries are issued to the origin server. Queries are then p otentially intercepted by another cache and, in the case of another miss, reach the origin database server to b e processed. This architecture has b een adapted in many versions dep ending on the sp ecifics of each system. The first typ e of edge-server architecture is edge computing infrastructures, where the application code is replicated at all edge servers and no cache is present [1, 23]. The database is kept centralized at the origin server so all database queries are forwarded to the origin server. Although this technique allows to distribute the computations to generate pages, it remains limited by the wide-area latency incurred for each query, and by the throughput b ottleneck of the origin database. 302 WWW 2007 / Track: Performance and Scalability show how availability constraints can b e taken into account in Glob eTP. Some of our previous research already prop osed to employ partial database replication to improve the p erformance of Web applications [26]. However, this architecture relies on record-level replication granularity. This design choice offers excellent query latency, but does not improve on throughput as a central server must maintain a copy of the full database (and therefore constitutes the throughput b ottleneck of the system). Note that replication for throughput and replication for latency do not contradict each other. We show in Section 5.5 how Glob eTP can easily b e coupled with a widearea database caching system so that b oth throughput and latency can b e improved at the same time. Session: Scalable Systems for Dynamic Content Edge servers Query router DB server Tables 1,2 DB server Tables 2,3 DB server Table 3 Figure 2: Architecture of a partially replicated origin server. 3. SYSTEM MODEL 3.1 Application Model Web applications are usually implemented by some business logic running in an application server, which is invoked each time an HTTP request is received. This business logic, in turn, may issue any numb er of SQL queries to the underlying database. Queries can b e used to read information from the database, but also to up date, delete or insert information. We refer to the latter as UDI queries. We assume that the queries issued by a given Web application can b e classified as b elonging to a relatively small numb er of query templates. A database query template is a parametrized SQL query whose parameter values are passed to the system at runtime. This scheme is deployed, for example, using Java's prepared statement. Examples of parametrized read query templates include QT 1: "SELECT price, stock, details FROM b ook WHERE id=?" and QT 2 : "SELECT price, stock, details FROM b ook, author WHERE b ook.name LIKE (?) AND author.name = ?". An example of up date template is U T 1: "UPDATE price=price+1 FROM b ook WHERE id=?". In our current implementation, we require the application develop er to explicitly define the templates. However, our implementation can b e easily extended to apply static analysis techniques to identify the query templates automatically [19]. The explicit definition of query templates is at the basis of several database caching systems as it allows an easy definition of cache invalidation rules [2, 19, 27]. In contrast, the work presented in this pap er uses the same notion of query templates but in a different manner: we exploit the list of templates to derive table placements that guarantee that at least one server is able to execute each query from the application. 3.2 System Model The aim of our work is to increase the scalability of the origin database depicted in Figure 1 in terms of the maximum throughput it can sustain, while maintaining reasonable query execution latency. As shown in Figure 2, in our system the origin database is implemented by an array of database servers. We assume that all origin database servers are located within a single data center. The replication granularity in our system is the database table, so each database server hosts a replica of one or more table(s) from the application. Since not all servers contain all the data, it is necessary to execute each query at a server that has all the necessary data locally. This is ensured by the query router, which receives all incoming queries and routes each query to a server that contains all the necessary tables to answer the query. To this end, the query router knows the current placement of tables onto databases servers. It is also in charge of maintaining the consistency of the replicated databases. It issues UDI queries to all the servers that hold the tables to b e modified; if all queries are successful then the op eration is committed, otherwise it is rolled back. This simple algorithm guarantees sequential consistency. In our implementations, all read and write queries are first received by the query router which in turn executes the query at appropriate replica. Since all UDI queries are queued in a single location at the query router, the router servers a serialization p oint and maintains consistency across replicas. Note that the current implementation of our system does not supp ort transactions. Instead, Glob eTP treats each read and write query as indep endent op erations. We b elieve that this is not a ma jor restriction, as most Web applications do not require transactional database b ehavior. However, should transactions b e required, adding supp ort for them in our system would b e relatively straightforward. Since the query router receives all incoming queries b efore they are executed, it can also act as the transaction monitor and implement any classical protocol such as two-phase commit. We however do not investigate this issue further in this pap er. Our work focuses on the structure of a Web application's origin database. Consequently, we make no assumption ab out the origin of the queries addressed to our system. In the simplest setup, queries can b e issued directly by one or more application server(s). However, our system can also b e easily coupled with a distributed database query cache such as DBProxy [2] and Glob eCBC [27]. In this case, the same definition of query templates can b e used b oth by the caching system in order to maintain consistency, and the origin database in order to optimize throughput. 3.3 Issues The work presented in this pap er is motivated by the observation that the explicit definition of query templates of a Web application allows to select the placement of partially replicated data such that the total system throughput is optimized. We consider that such knowledge allow us to avoid two pitfalls that generic replicated databases necessarily face. First, application-unaware database systems do not know in advance the full set of query templates that will b e issued to them. In particular, this means that it is imp ossible for them to determine a priori which tables must b e kept together, and which ones can b e sepa- 303 WWW 2007 / Track: Performance and Scalability rated. Generic database systems usually address this issue by supp orting only full replication, so that the data necessary to answer any query are always available at the same place. However, this has an imp ortant impact on the system's throughput. Second, the middleware that determines which replica should treat each read query does not take query characteristics into account. However, the execution times of different queries issued by a given Web application may vary by several orders of magnitude. In such a context, simple round-robin algorithms may not lead to optimal load balancing. To b e able to determine the placement of database tables on replica servers that allows to sustain the highest throughput, we must solve three main issues. First, not all p ossible placements of tables onto servers will allow to find at least one server capable of executing each of the application's query templates. We therefore need to analyze the set of query templates to determine a subset of placements that are functionally correct. Second, we must take the resp ective query execution times of different templates and their classification as read or UDI queries to determine the b est placement in terms of throughput. Besides requiring accurate estimations of query execution times, finding the optimal placement incurs a huge computational complexity, even for relatively small systems. We therefore need a good heuristic. Finally, once the resulting system is instantiated we need to define a load balancing algorithm that allows the query router to distribute read queries efficiently across the servers that can treat them. Session: Scalable Systems for Dynamic Content 4.2 Load Analysis Even though any cluster placement will lead to a functionally correct system, not all placements will lead to the same system throughput. To maximize throughput, it is crucial that no database server is overloaded. In other words, we need to place the table clusters such that we minimize the load of the most loaded server. This process is done in two steps. First, we evaluate the load imp osed on each of the identified clusters for a representative workload. Second, we identify the placement that will create the b est repartition of load across the servers. 4.2.1 Estimation of Load on Table Clusters The load that each table cluster will imp ose on the server(s) where it is hosted dep ends on three factors: (i) the classification as b elonging to a read or UDI template: read queries can b e executed on one server, while UDI queries must b e applied on all servers holding the corresp onding cluster; (ii) the occurrence frequency of the template in the exp ected workload; and (iii) the computational complexity of executing the query on a given database server. Classifying queries as read or UDI can b e done by simple query analysis. Similarly, the occurrence frequency of templates can b e derived from observation of an existing workload. However, estimating the load that each query imp oses on the database where it is run requires careful attention. Mature database systems such as MySQL and PostgreSQL make their own estimations of the internal execution of different queries as part of their query optimization procedure. These execution time estimations are made available, for example using PostgreSQL's EXPLAIN and EXPLAIN ANALYZE commands. However, these estimations take only the actual execution time into account, and ignore other factors such as the connection overhead. Another p ossible method consists of simply measuring the resp onse time of each query template in a live system. The advantage of this method is that it measures the end-to-end resp onse time of the database tier and includes connection overhead. Note that query execution times should b e measured under low load to avoid p olluting measurements with load-related overheads such as the queuing latency [29]. Figure 3 shows the accuracy of the three cost estimation techniques applied to the query templates from the RUBBoS b enchmark [25]. In each graph we estimated the cost of each query template and compared it with the actual execution time under high load. A p erfect estimator would produce p oints located on the y = x diagonal line. Clearly, the estimations produced by the database query optimizers are not as accurate as actual measurements made under low loads. In the rest of this pap er we therefore restrict ourselves to this last method. 4. DATA PLACEMENT The underlying idea b ehind our approach is to partially replicate the database so that UDI workload can b e split across different replicas. This process involves the following three steps: (i) Cluster Identification: the process by which we determine the set of database tables that needs to b e replicated together, (ii) Load Analysis: the process by which we determine the load received by each of the cluster, and (iii) Cluster Placement: determining the placement of the identified clusters across the set of database servers so that the load incurred by each of the database replica is minimized. 4.1 Cluster Identification Our system relies on placement of individual tables on database servers to minimize the numb er of servers that must process UDI queries. However, not all placements are functionally correct as all tables accessed by a query template must b e present in the same server for the query to b e executed. The goal of cluster identification is to determine sets of tables that must b e placed together on at least one server, such that there is at least one server where each query template can b e executed. We must characterize each query template with two attributes: (i) whether it is a read or a UDI query; (ii) the set of tables (also called table cluster) that it accesses. For instance, in the aforementioned query templates, template QT 1 will b e associated to a single-table cluster {book}, while QT 2 will b e associated to {book, author }. Clusters can overlap, as a table can b elong to multiple clusters. The problem of finding functionally correct table placements can then b e reduced to a cluster placement problem; any table cluster placement will b e functionally correct. 4.2.2 Estimation of Load on Database Servers In a replicated database, read queries are executed at one database node, whereas UDI queries are executed at all nodes that hold the data modified by the UDI query. To determine the load that each database server will incur for a given table placement and a given query workload, we must distinguish the two typ es of queries. Each UDI query in the studied workload will result in applying the associated execution cost to each of the database servers holding the corresp onding table(s). On the other hand, each read query will create execution cost on only 304 WWW 2007 / Track: Performance and Scalability 100 Estimated execution cost (sec) 10 1 0.1 0.01 1e-3 1e-4 1e-5 1e-6 1e-6 1e-5 1e-41e-3 0.01 0.1 1 10 100 Real execution cost (sec) Estimated execution cost (sec) 1 0.1 0.01 0.001 1e-4 1e-5 1e-5 10 Session: Scalable Systems for Dynamic Content Estimated execution cost (sec) 10 1 0.1 0.01 0.001 1e-4 0.001 0.01 0.1 1 Real execution cost (sec) 10 0.001 0.01 0.1 1 10 Real execution cost (sec) (a) EXPLAIN query. (b) EXPLAIN ANALYZE query. (c) Measurement under low load. Figure 3: Accuracy of different methods for query cost estimation. to other servers (steps 6 and 7). Two techniques can b e used here (step 7): either migrating one of the clusters to another server (thereby offloading the server of the whole associated load), or replicating one of the clusters to another server (thereby offloading the server of part of the read query load). The algorithm evaluates all p ossible op erations of this typ e, and checks if one of them improves the quality of the placement. This op eration is rep eated until no more improvement can b e gained (step 4). The most loaded server, which cannot b e offloaded any more without overloading another one is considered to b e in its 'optimal' state and is removed from the working set of servers (step 8). The algorithm then tries to optimize the load of the second most loaded server, and so on until all servers have reached their 'optimal' state. Even though there is no guarantee that this heuristic will find the optimal placement, in our exp erience it always identifies a reasonably good placement within seconds (whereas the full search algorithm would take days). 1 2 3 Distribute table clusters uniformly across server nodes; S = set of server nodes ; while S = do /*We want to minimize the maximum server load*/ while (max(estimated server load) is decreased) do 4 5 N = Most loaded server in S ; 6 foreach Table cluster C placed in N do 7 Try to reduce N 's load by migrating or replicating C to other servers; end end S = S - (the most loaded server in S ); 8 end Algorithm 1: Pseudocode of the table cluster placement algorithm. one server; we count that, on average, each database server holding the corresp onding cluster will incur the execution cost of the query, divided by the numb er of replicas. This analysis allows us to roughly compute the execution cost that each database server will incur for a given table placement and a given query workload. To maximize the system throughput, it is essential that no database server is overloaded. We therefore aim at balancing the load such that the cost of the most loaded server is minimized. 4.4 Query Routing Query routing is an imp ortant issue that affects the p erformance of replicated databases. Simple round-robin schemes are efficient only when all the incoming queries have similar cost. However, when applications tend to have queries with different execution costs, round-robin scheduling can lead to load skews across database servers, resulting in p oor access latencies. Query routing gains a higher significance in Glob eTP. In a partially replicated database system such as ours, queries can no longer b e sent to arbitrary database nodes. The query router used in Glob eTP thus differs from the traditional query routers used in fully replicated databases in the following asp ects. First, read queries can b e scheduled only among a subset of database servers instead of all servers. Second, UDI queries must executed at all database servers that store the tables modified by the incoming UDI query. The process of selecting a database server to route an incoming read query has considerable impact on the overall p erformance of the system. This is usually determined by the routing p olicy adopted by the replication system. In our work, we exp erimented with the following p olicies. 4.3 Cluster Placement Finding the optimal table placement can b e realized by iterating through all valid table placements, evaluating the resp ective cost of each database server under each placement, and selecting the b est one. However, the computational complexity of this exhaustive search is O(2N T /N !), where T is the numb er of table clusters to b e placed and N is the numb er of nodes to place them on. This very high complexity makes it unpractical even for relatively small system sizes. We must therefore find a heuristic instead. As shown in Algorithm 1, our heuristic starts with a very roughly balanced placement, and iteratively tries to improve it by applying simple transformations in table placement. The first step (step 1) is to place clusters uniformly onto servers to create an initial configuration. The heuristic then iteratively attempts to improve the quality of the placement (steps 3-8). Since the goal is to find the placement where the maximum server load is minimized, we identify the most loaded server (step 5) and try to offload some of its clusters 4.4.1 RR-QID: Round-Robin per Query ID RR-QID is an extension to the round-robin p olicy that is suitable for partially replicated databases. In this p olicy, the query router maintains a separate queue for each 305 WWW 2007 / Track: Performance and Scalability query template identified by its query identifier, QID. Each queue is associated with the set of database servers that can serve the incoming queries of typ e QID. Subsequently, each incoming read query is scheduled among the candidate servers (associated with its queue) in a round-robin fashion. Session: Scalable Systems for Dynamic Content 10% have moderator privileges, and 200, 000 comments. The size of the database is approximately 1.5 GB. The application defines 36 read and 8 UDI query templates. In our exp eriments, we used the default user workload which generates 0.76% of UDI queries. The client workload for b oth applications is generated by Emulated Browsers (EBs). The run-time b ehavior of an EB models a single active client session. Starting from the home page of the site, each EB uses a Customer Behavior Graph Model (a Markov chain with Web pages acting as nodes and navigation action probabilities as edges) to navigate among Web pages and p erform a sequence of Web interactions. The b ehavior model also incorp orates a think time parameter that controls the amount of time an EB waits b etween receiving a resp onse and issuing the next request, thereby modeling a human user more accurately. We set the average think time to 6 seconds. To generate flexible yet reproducible workloads, we run each b enchmark under relatively low load (i.e., with 30 to 100 EBs) multiple times and collect the corresp onding database query logs. Query logs from different runs can then b e merged to generate higher load scenarios. For instance, to evaluate the p erformance of our system for 600 EBs, we merge the query logs from six different runs with 100 EBs, and stream the result to the query router. This allows us to study the p erformance of the database tier alone indep endently of other tiers. The query router is implemented as a stand-alone server written in Java. It maintains a p ool of connections to each database server and schedules each incoming query based on the adopted routing p olicy. Database servers run PostgreSQL version 8.1.3. Both full and partial database replication are p erformed at the query router level as describ ed in Section 3.2. All our exp eriments are p erformed on a Linux-based server cluster. Each server is configured with dual-processor Pentium I I I 900 MHz CPU, 2 GB of memory and a 120 GB IDE hard disk. These servers are connected to each other with a gigabit LAN, so the network latency b etween the servers is negligible. 4.4.2 Cost-Based Routing The underlying idea b ehind the cost-based routing p olicy is to utilize the execution cost estimations to balance the load among database servers. To this end, up on arrival of an incoming query, the query router first estimates the current load of each database server. Subsequently, it schedules the incoming query to the least loaded database server (that also has the required set of tables). To this end, the query router maintains a list of queries that have b een dispatched to each database server and still awaiting resp onse. This list contains the list of queries currently under (or awaiting) execution at a database server. Subsequently, the load of a database server is approximated as the sum of the estimated cost of these queries. Finally, the server with least cost is scheduled to execute the next incoming query. We show the resp ective p erformance of these two routing p olicies in the next section. 5. PERFORMANCE EVALUATION In this section, we compare the p erformance of full database replication to Glob eTP for two well-known Web application b enchmarks. TPC-W is a standard e-commerce b enchmark that models an online b ookstore such as amazon.com [28], while RUBBoS is a bulletin-b oard b enchmark modeled after slashdot.org [25]. We selected these two applications for their different data access characteristics. For example, in a typical news forum, most users are interested in the latest news. On the other hand, in a b ookstore application, the shopping interests of customers can b e very diverse, thereby leading to much lower query locality. This allows us to study the b ehavior of our systems for different data access patterns. In addition to these exp eriments, we also evaluate the b enefit of adding a database query caching layer to Glob eTP. 5.2 Potential Reductions of UDI Queries One imp ortant goal of Glob eTP is to reduce the replication degree of individual database tables to reduce the numb er of UDI queries to b e processed. However, the extent to which this is feasible greatly dep ends on the query templates issued by the application and the workload distribution. Figure 4(a) shows to which extent table-level partial replication allows to reduce the numb er of UDI queries to b e processed, expressed as the ratio of UDI queries to execute b etween full and partial replication. The higher the ratio, the greater the gain. Obviously, with just one server to host the database, partial and full replication are identical so the ratio is equal to 1. As the numb er of servers increases, partial replication allows to reduce the numb er of UDI queries by a ratio close to 3 for b oth applications. To evaluate more accurately the p otential load reduction that we can exp ect from partial replication, we should take into account the resp ective estimated costs of different query templates, as well as the read queries from the workload. Figure 4(b) shows the reduction ratio of estimated costs imp osed on each server, for different numb ers of servers. As we can see, the reduction factor is much lower than when count- 5.1 Experimental Setup The TPC-W application uses a database with seven tables, which are queried by 23 read and 7 UDI templates. In our exp eriments the database is initially filled with 288, 000 customer records. Other tables are scaled according to the TPC-W requirements. TPC-W defines three standard workload mixes that exercise different parts of the system: `browsing' generates 5% up date interactions; `shopping' generates 20% up date interactions; and `ordering' generates 50% update interactions. We use the `shopping' mix, which results in a database workload containing 5.6% of UDI queries1 . The RUBBoS application consists of a set of PHP scripts and a database containing five tables regarding users, stories, comments, submissions and moderator activities. The database is initially filled with 500, 000 users, out of which 1 Note that one must distinguish the up date interactions from the UDI queries. Up date interactions are usergenerated HTTP requests that lead to at least one UDI query, plus any numb er of read queries. Since Glob eTP only op erates at the database query level, the prop ortion of up date interactions is irrelevant here. 306 WWW 2007 / Track: Performance and Scalability Reduction ratio of UDI query number 3 2.5 2 1.5 1 0.5 0 0 5 TPC-W RuBBoS 10 15 20 Number of servers Reduction ratio of total execution cost 2 1.8 1.6 1.4 1.2 1 0.8 0 10 Session: Scalable Systems for Dynamic Content Reduction ratio of total execution cost 1.2 1.15 1.1 1.05 1 0.95 TPC-W RuBBoS 3 4 5 6 7 Number of servers TPC-W RuBBoS 20 30 40 Number of servers 50 25 1 2 8 (a) Reduction of the numb er of UDI queries. (b) Reduction of estimated execution cost p er server. (c) Reduction of estimated execution cost p er server (zoom). Figure 4: Potential Reduction of UDI queries. ing UDI queries alone. The reason is that b oth workloads are dominated by read queries, which are equally spread in full and partial replication. The exp eriments describ ed in the remaining of this pap er are based on configurations using up to 8 database servers. Figure 4(c) shows the resp ective p otential of partial replication for b oth b enchmarks under these conditions. TPC-W shows a relatively good p otential, up to 15% reduction in workload p er server. On the other hand, RUBBoS has a lower p otential. This is mainly due to the fact that RUBBoS generates very few UDI queries; reducing their numb er even further by ways of partial replication can therefore have only a limited impact. Note that other workloads may show somewhat different b ehavior. For example, RUBBoS defines a workload where search queries are disabled. Since searches are implemented as very exp ensive read queries, removing them from the workload mechanically improves the cost ratio of UDI queries and thereby the gains to b e exp ected from Glob eTP. Conversely, we cannot exclude that other Web applications may dictate to keep all database tables together, making our form of partial replication equivalent to full replication. For these, Glob eTP will not provide any improvement unless the application itself is up dated (see Section 6.1). In this pap er we focus on standard b enchmarks which offer real yet limited p otential for use in our system. However, as we will see in the following sections, even the relatively modest reductions in estimated costs shown here allow for significant gains in execution latency and in total system throughput. 40% more queries than full replication within 10 ms. In TPC-W, Glob eTP processes 20% more queries within 10 ms than full replication. In TPC-W, the RR-QID query routing p olicy delivers b etter p erformance than its cost-based counterpart. This can b e explained by the fact that in TPC-W the costs of different query templates are relatively similar. The unavoidable inaccuracy of our cost estimations therefore generates unbalanced load across servers, which leads to sub-optimal p erformance. On the other hand, RR-QID is very effective at balancing the load when queries have similar cost. In RUBBoS, Glob eTP combined with cost-based routing outp erforms b oth other configurations. In this case, the costs of different queries vary by three orders of magnitude (as shown in Figure 3(c)). In this case, cost-based routing works well b ecause even relatively coarse-grained estimations of the cost of each query helps avoiding issuing simple queries to already overloaded servers. In the following exp eriments we restrict ourselves to the most effective routing p olicy for each application. We therefore use RR-QID for measurements of TPC-W, and costbased routing for RUBBoS. One should note that Glob eTP has greater effect on the latency in the case of RUBBoS than for TPC-W. This may seem contradictory with results from the previous section. However, the latency and the throughput of a given system are not necessarily correlated. As we will see in the next section, the throughput improvements that Glob eTP provides are significantly greater for TPC-W than RUBBoS. 5.3 Effect of Partial Replication and TemplateAware Query Routing To illustrate the b enefits of Glob eTP, we measured the query execution latencies of read and UDI queries together using different configurations. For each of the two b enchmarks, we compared the p erformance of full replication, Glob eTP using RR-QID query routing, and Glob eTP using cost-based query routing. In all cases we used 4 database servers and one query router. We selected a load of 900 EBs for TPC-W and 330 EBs for RUBBoS, so that the tested configurations would b e significantly loaded. Figure 5 shows the cumulative latency distributions from b oth sets of exp eriments. As one can see, in b oth cases Glob eTP processes queries with a much lower latency than full replication. For example, in RUBBOS Glob eTP processes 5.4 Achievable Throughput To evaluate the scalability of our approach, we measured the maximum sustainable throughput of different approaches when using identical hardware resources. We first set a p erformance target in terms of query execution latency: in our exp eriments we aim at processing at least 90% of database queries within 100 ms. Note that this p erformance target is quite challenging, as several query templates have execution times greater than 100 ms, even under low loads (see Figure 3(c)). We then exercise system configurations with full and partial replication, and increase the workload by steps of 50 EBs for TPC-W and 30 EBs for RUBBoS. For each configuration we record the maximum numb er of EBs that each configuration can serve while resp ecting the latency target. The results are shown in Figure 6. 307 WWW 2007 / Track: Performance and Scalability 100 Cumulative distribution (%) Session: Scalable Systems for Dynamic Content 100 Cumulative distribution (%) GlobeTP-RRID 80 60 40 Full Replication 20 0 GlobeTP-cost-based 0.01 0.1 1 Query execution latency (s) 10 GlobeTP-cost-based 80 60 40 20 Full Replication 0 0.001 0.01 0.1 1 Query execution latency (s) 10 GlobeTP-RRID 0.001 (a) TPC-W, 900 EBs. (b) RUBBoS, 330 EBs. Figure 5: Query latency distributions using 4 servers. Number of emulated browsers 500 350 Number of emulated browsers 300 250 200 150 100 50 0 1 2 3 Full replication 4 5 6 Number of servers 7 8 GlobeTP 400 GlobeTP 300 200 100 Full replication 0 1 2 3 4 5 6 Number of servers 7 8 (a) TPC-W. (b) RUBBoS. Figure 6: Maximum achievable throughputs with 90% of queries processed within 100ms. In TPC-W, one server alone can sustain up to 50 EBs. As we increase the numb er of database servers, partial replication p erforms significantly b etter than full replication. In particular, the maximum throughput of the fully replicated system does not improve with more than four servers. This corresp onds to the p oint when the treatment of UDI queries alone saturates each server. This can b e explained by the fact that the execution time of a UDI query is typically an order of magnitude higher than that of a simple read query. On the other hand, Glob eTP can sustain up to 150% higher throughput while using identical hardware resources. Unlike full replication, it is capable of exploiting 8 servers to sustain higher throughput than when using only 4. This is due to the fact that each server has less UDI queries to process, and thereby exp eriences lower load and b etter execution latency. In RUBBoS, Glob eTP again p erforms b etter than full replication, yet with a lower difference. With 4 and 8 servers, Glob eTP sustains 120 more EBs than full replication, which accounts for 57% of throughput improvement. Given that RUBBoS generates very few UDI queries, little improvement can b e gained by further reducing their numb er with partial replication. In this case, the ma jor reason for throughput improvement is the cost-aware query routing p olicy which takes the relative costs of different queries into account to b etter balance the load b etween servers. Table 1: Maximum throughput of different configurations. TPC-W RUBBoS Full replication (4 servers) 200 EBs 150 EBs Glob eTP (4 servers) 450 EBs 210 EBs Glob eTP (4 servers) + 1 cache 600 EBs 330 EBs 5.5 Effect of Query Caching As noted in Section 3.2, Glob eTP can easily b e coupled with a database query caching system as most query caching systems rely on the same assumption as Glob eTP regarding the explicit definition of query templates. However, Glob eTP focuses on improving the throughput of the application's origin database, while query caching systems aim at reducing the individual query execution latencies. We therefore consider that b oth typ es of system complement each other very well. As a side effect, a query caching system can also improve the system throughput, as it prevents a numb er of read queries from b eing issued to the origin database. In our exp eriments, we use our own in-house query caching solution, Glob eCBC [27]. Glob eCBC acts as a very simple ob ject cache: unlike other similar systems it does not attempt to merge multiple query results into a single view of the database. Instead, Glob eCBC stores the result of each query indep endently from the others. This allows for very fast processing, and facilitates the execution of the replacement p olicy when the size of the cached items exceeds a given limit. In our exp eriments, we limited the cache size to approximately 5% of the size of the database itself. 308 WWW 2007 / Track: Performance and Scalability Table 1 shows the effect of adding a single cache server in front of the query router when using four database servers. In TPC-W, the cache had a hit rate of 18%. This relatively modest hit rate is due to the fact that the standard TPCW workload has very low query locality compared to real e-commerce sites [3]. However, even in this case the system throughput is increased by 33%, from 450 to 600 EBs. Unlike TPC-W, the RUBBoS workload has quite high database query locality. In this case the query cache delivers 48% hit ratio, which effectively increases the throughput by 57%, from 210 to 330 EBs. This result is quite remarkable considering that search queries, which are by far the most exp ensive ones in the workload, are based on random strings and are therefore always passed to the origin database for execution. Session: Scalable Systems for Dynamic Content and availability in the presence of network partitions is imp ossible [5, 16]. On the other hand, if we ignore the p ossibility of network partitions and restrict ourselves to server failures, then the problem has an elegant solution. To guarantee that the partially replicated database remains able to serve all the exp ected query templates, it is essential that each query template b e available at one or more servers. Therefore, to tolerate the failure of at most N servers one only has to make sure that each query template is placed on at least N + 1 servers. This requires database replication algorithms suitable for fault-tolerance, which is a well-understood problem. Replicating the query router is also relatively simple as long as the applications do not need transactional guarantees. As the query router does not maintain any dynamically up dated state that is essential for application correctness, no state consistency b etween multiple instances of the query router needs to b e implemented. Starting from a configuration designed for throughput only, planning for fault-tolerance can b e done in two different ways. First, one may keep the numb er of servers unchanged but artificially increase the replication degree of table clusters across the existing machines. However, this will likely degrade system throughput as more UDI queries must b e processed p er server. Alternatively, one may provision for additional servers, and adjust table placement to keep the worst-case throughput constant. As long as not too many servers fail, this configuration will exceed its throughput requirements, which may have the desirable side-effect of protecting the Web site to a certain extent against unexp ected variations in load. 6. DISCUSSION 6.1 Potential of Query Rewriting This pap er demonstrates that relatively simple techniques allow to significantly improve the throughput of standard b enchmarks, without requiring any modification to the applications themselves. However, we b elieve that increased throughput can also b e gained from simple changes of the application implementation. The main limitation of the approach of table-granularity partial replication comes from database queries that span multiple tables. As the business logic of an application b ecomes more complex, we can exp ect that more join queries are introduced. Such queries oblige the table placement algorithm to place all relevant tables together on at least one server, which in turn increases the replication degree and reduces the maximum throughput. Of course, queries spanning multiple tables are occasionally indisp ensable to the application. But we have observed from the TPC-W and RUBBoS b enchmarks that many such queries can easily b e rewritten as a sequence of simpler queries spanning one table each. One simple example is the following query from TPCW, which aims at obtaining the most recent order issued by a particular customer: "SELECT o id FROM customer, orders WHERE customer.c id = orders.o c id AND c uname = ? ORDER BY o date, orders.o id DESC LIMIT 1." This query spans two tables. However the customer table is used here only to convert a full-text username into a user ID, after which the most recent order issued from this ID can b e researched in the order table. It is then trivial to rewrite the application to first issue a query to customer table alone, then another one to search for the most recent order. We found such unnecessarily complex queries to b e very frequent in the applications that we studied. Rewriting them into multiple simpler queries can only reduce the constraints put on the table placements, and therefore result in higher throughput. 7. CONCLUSION In this pap er we have presented Glob eTP, which exploits table-granularity partial database replication to optimize the system throughput. This solution relies on the fact that the query workload of Web applications is comp osed of a restricted numb er of query templates, which allows us to determine efficient data placements. In addition, the identification of query templates allows for efficient query routing algorithms that take the resp ective query execution costs to b etter balance the query load. In our exp eriments, these two techniques allow to increase the system throughput by 57% to 150% compared to full replication, while using identical hardware configuration. To increase the system's scalability even further, a natural extension is to combine Glob eTP with a database query caching system such as Glob eCBC, as b oth systems rely on the same definition of query templates. These systems complement each other very well: query caching improves the execution latency in a wide-area setting by filtering out many read queries, while Glob eTP shows its b est p otential for improving throughput under workloads that contain many UDI queries. In our exp eriments, the addition of a single query cache allows to improve the achievable throughput by approximately 30% to 60%. The work presented in this pap er does not take into account the long-term load variations that must b e exp ected when op erating a p opular dynamic Web site. Adapting the system capacity without any service interruption requires dynamic database provisioning, which is a very difficult problem that is only b eginning to b e addressed [8]. In the near future we plan to study whether knowledge of query templates can help here. 6.2 Fault-Tolerance Although the main focus of this pap er is not replication for availability, one cannot ignore this issue. With increased numb er of server machines involved in a given application, the probability that one of them fails grows quickly. However, the most general form of fault-tolerance for this kind of system cannot b e realized, as providing b oth consistency 309 WWW 2007 / Track: Performance and Scalability Session: Scalable Systems for Dynamic Content [17] B. Kemme and G. Alonso. Don't b e lazy, b e consistent: Postgres-R, a new way to implement database replication. In Proc. Intl. Conf. on Very Large Data Bases, pages 134­143, Cairo, Egypt, Septemb er 2000. [18] W.-S. Li, O. Po, W.-P. Hsiung, K. S. Candan, and D. Agrawal. Engineering and hosting adaptive freshness-sensitive web applications on data centers. In Proc. Intl. WWW Conf., pages 587­598, May 2003. [19] C. Olston, A. Manjhi, C. Garrod, A. Ailamaki, B. Maggs, and T. Mowry. A scalability service for dynamic web applications. In Proc. Conf. on Innovative Data Systems Research, pages 56­69, Asilomar, CA, USA, January 2005. [20] G. Pierre and M. van Steen. Globule: a collab orative content delivery network. IEEE Communications Magazine, 44(8):127­133, August 2006. [21] C. Plattner and G. Alonso. Ganymed: Scalable replication for transactional web applications. In Proc. ACM/IFIP/USENIX Intl. Midd leware Conf., Toronto, Canada, Octob er 2004. [22] M. Rabinovich and A. Aggarwal. RaDaR: a scalable architecture for a global web hosting service. In Proc. Intl. WWW Conf., May 1999. [23] M. Rabinovich, Z. Xiao, and A. Agarwal. Computing on the edge: A platform for replicating internet applications. In Proc. Intl. Workshop on Web Content Caching and Distribution, pages 57­77, Hawthorne, NY, USA, Septemb er 2003. [24] M. Ronstrom and L. Thalmann. MySQL cluster architecture overview. MySQL Technical White Pap er, April 2004. [25] Rubb os: Bulletin b oard b enchmark. http://jmob.objectweb.org/rubbos.html. [26] S. Sivasubramanian, G. Pierre, and M. van Steen. Glob eDB: Autonomic data replication for web applications. In Proc. Intl. WWW Conf., Chiba, Japan, May 2005. [27] S. Sivasubramanian, G. Pierre, M. van Steen, and G. Alonso. Glob eCBC: Content-blind result caching for dynamic web applications. Technical Rep ort IR-CS-022, Vrije Universiteit, Amsterdam, The Netherlands, June 2006. http: //www.globule.org/publi/GCBRCDWA_ircs022.html. [28] W. D. Smith. TPC-W: Benchmarking an ecommerce solution. White pap er, Transaction Processing Performance Council. [29] B. Urgaonkar, G. Pacifici, P. Shenoy, M. Spreitzer, and A. Tantawi. An analytical model for multi-tier internet services and its applications. In Proc. ACM SIGMETRICS, pages 291­302, June 2005. [30] W. Zhao and H. Schulzrinne. Enabling on-demand query result caching in DotSlash for handling web hotsp ots effectively. In Proc. Workshop on Hot Topics in Web Systems and Technologies, Boston, MA, USA, Novemb er 2006. 8. REFERENCES [1] Akamai EdgeSuite. http://www.akamai.com/en/ html/services/edgesuite.html. [2] K. Amiri, S. Park, R. Tewari, and S. Padmanabhan. DBProxy: A dynamic data cache for Web applications. In Proc. Intl. Conf. on Data Engineering, pages 821­831, March 2003. [3] M. Arlitt, D. Krishnamurthy, and J. Rolia. Characterizing the scalability of a large web-based shopping system. ACM Transactions on Internet Technology, 1(1):44­69, August 2001. [4] C. Bornh¨vd, M. Altinel, C. Mohan, H. Pirahesh, and o B. Reinwald. Adaptive database caching with DBCache. Data Engineering, 27(2):11­18, June 2004. [5] E. A. Brewer. Towards robust distributed systems (abstract). Proc. ACM Symp. on Principles of Distributed Computing, July 2000. [6] E. Cecchet. C-JDBC: a middleware framework for database clustering. Data Engineering, 27(2):19­26, June 2004. [7] J. Challenger, P. Dantzig, A. Iyengar, and K. Witting. A fragment-based approach for efficiently creating dynamic web content. ACM Transactions on Internet Technologies, 5(2):359­389, May 2005. [8] J. Chen, G. Soundarara jan, and C. Amza. Autonomic provisioning of databases in dynamic content web servers. In Proc. Intl. Conf. on Autonomic Computing, Dublin, Ireland, June 2006. [9] Z. Chen, Z. Huang, B. Ling, and J. Li. P2P-Join: A keyword based join op eration in relational database enabled p eer-to-p eer systems. In Proc. Intl. Conf. on Database and Expert Systems Applications, Sept. 2006. [10] A. Datta, K. Dutta, H. Thomas, D. VanderMeer, Suresha, and K. Ramamritham. Proxy-based acceleration of dynamically generated content on the world wide web: an approach and implementation. In Proc. ACM SIGMOD/PODS Conf., pages 97­108, June 2002. [11] J. Dilley, B. Maggs, J. Parikh, H. Prokop, R. Sitaraman, and B. Weihl. Globally distributed content delivery. IEEE Internet Computing, 6(5), Septemb er-Octob er 2002. [12] B. Fitzpatrick. Inside LiveJournal's backend, or "holy hell that's a lot of hits!". Presentation at the O'Reilly Op en Source Convention, July 2004. http: //www.danga.com/words/2004_oscon/oscon2004.pdf. [13] W. Fontijn and P. Boncz. AmbientDB: P2P data management middleware for ambient intelligence. In Proc. PERWARE Workshop, Mar. 2004. [14] M. Freedman, E. Freudenthal, and D. Mazi`res. e Democratizing content publication with Coral. In Proc. Symp. on Networked Systems Design and Implementation, pages 239­252, San Francisco, CA, USA, March 2004. [15] L. Gao, M. Dahlin, A. Nayate, J. Zheng, and A. Iyengar. Application sp ecific data replication for edge services. In Proc. Intl. WWW Conf., May 2003. [16] S. Gilb ert and N. Lynch. Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, 33(2):51­59, June 2002. 310