Replacement policies for a proxy cache 1 Introduction - CiteSeerX

9 downloads 233 Views 378KB Size Report
Aug 2, 1996 - A second-level cache is provided by caching proxy servers ..... maximum achievable BHR and HR versus the document keep-alive threshold. 5 ...
Replacement policies for a proxy cache Paolo Lorenzetti Luigi Rizzo Lorenzo Vicisano Dipartimento di Ingegneria dell'Informazione Universita di Pisa { via Diotisalvi 2 { 56126 PISA email: p.lorenzetti, l.rizzo, [email protected] VERY Draft version { please take the updated version from http://www.iet.unipi.it/~luigi/caching.ps.gz

August 2, 1996 Abstract

In this paper we analyze Web access traces to a proxy, in order to derive useful information for the development of a good replacement policy for documents held in the cache. The analysis reveals a number of interesting properties on the lifetime and statistics of access to documents, which are discussed in this paper. These properties show why LRU works reasonably well but can be improved. We propose a simple policy called LRV which selects for replacement the document with the Lowest Relative Value among those in cache. The value of a document is computed basing on information readily available to the proxy server, and the computations associated with the replacement policy requires only a small constant time. We show how LRV outperforms LRU and other policies, and can signi cantly improve the performance of the cache.

1 Introduction The caching of Web documents is widely used to reduce both latency and network trac in accessing data. A rst-level cache is usually built into the browser, which allocates a small amount of memory and disk space to store frequently accessed documents. The browser's cache has a limited size and it is only used by a single client, thus being mainly useful to store images (backgrounds, icons, etc.) which occur frequently in a set of related documents. A second-level cache is provided by caching proxy servers (proxies), which retrieve documents from the original site (or another proxy) on behalf of the client. Proxies serve a large set of clients, so they can use the history of requests for a document as hints on its popularity and decide if it is worthwhile storing it in the local cache. The large storage area which is available on the proxy can give large improvements in the access to remote documents. Several proxy caches have been developed in recent years. Among them, cern httpd, harvest [2], and its successor squid [4] are widespread programs which are available in source format. The latter two use a number of ingenious solutions to improve performance, and introduce the concept of cooperating proxies to make the system perform even better. There are however some limitation in the performance of a proxy. Some Web documents are intrinsically uncacheable, usually because they are the output of a program's execution, or they are marked as uncacheable by the originating server. We will not consider uncacheable documents in this paper, as a proxy has no way of optimizing accesses to them. Once uncacheable documents have been discarded, it turns out that a relatively large fraction of documents is accessed only once by a set of clients; storing them on the proxy is a waste of space, but they cannot be easily distinguished from more popular documents with only one access this far. In our logs, roughly 2/3 of documents is accessed only once, although these accesses account for only 1/3 of the total. Since the proxy has nite storage, some strategy must 1

be devised to periodically purge documents which are not interesting, in favor of more popular ones. The problem is essentially one of guessing, among the documents currently in cache, which one is best to remove in order to make room for new documents. The problem has received limited attention this far, for a number of reasons: the apparent similarity of this problem with other caching problems (e.g. in processor architecture); the relatively recent introduction of the Web; and, especially, the fact that the most common technique to manage caches { LRU { has proved to work reasonably well in the context of proxy caches. Related work is discussed further in Section 4. In this paper we develop a cost/bene t model to determine the relative value of each document in the cache, so as to allow the replacement algorithm to select the document with the Lowest Relative Value (LRV algorithm). The statistical parameters used in the algorithm are derived from a thorough analysis of the logs of our proxy server, totaling approximately 1.300.000 cacheable accesses over a period of ve months. As a result, we determine a formula for the computation of the value of a document which allows the LRV algorithm to perform signi cantly better than LRU and other algorithms. The rest of the paper is organized as follows. In Section 2 we introduce the problem formally, and de ne the basic cost/bene t metric. Section 3 analyzes the probabilistic parameters which in uence the value of each document, discussing how they a ect already-known replacement algorithms. A discussion on related work is presented in Section 4. In Section 5 we present the algorithm together with a discussion of its implementation. The performance of LRV, compared to other algorithms, is then presented basing on real traces and for a number of di erent cache sizes. LRV

2 Problem's de nition A proxy aims to minimize two parameters: the response time in serving documents to a client, and the network trac to the outside. The hit rate (HR) and byte hit rate (BHR) are two parameters which describe reasonably the e ectiveness of the proxy. They indicate the fraction of documents and bytes, respectively, which are served from the cache instead of requesting them from the network. BHR is a direct measure of the savings in network trac. In computing these two parameters, uncacheable documents are usually discarded. The response time in serving documents also depends strongly on both HR and BHR. We call history the history of accesses to a proxy. The history holds information on the URL, size, type, requestor, transfer time for a document. The history is usually available through the access logs recorded by the proxy server. We call future history the history of the accesses issued after a given time 0 ; the remaining part of the history is the past history. With respect to a given time 0 , the accessed documents are all documents whose URLs appear in the past history. We call dead documents the subset of accessed documents which do not appear in the future history. Document die because they are removed at the source, change their content (thus becoming new documents), or simply because nobody accesses them anymore. Live documents are the subset of accessed documents which also appear in the future history. Let i be the set of documents accessed at least times. A parameter which we will often use is kDi+1 k , corresponding to the probability that a document is accessed again after the ? access. = i kDi k 1 is a direct indication of the percentage of documents for which caching is useful. The i 's and i 's can be computed on a subset of documents, in which case we add a sux to the names to indicate the selection criterion. In order to increase the hit rate, a proxy might try to prefetch documents, anticipating clients' requests, so that even the rst request for a given document is served by the cache. In principle, this approach might have no in uence on the total amount of network trac, bring the hit rate to 100%, and reduce the latency experienced by clients in accessing documents. In practice, anticipating clients' requests is non trivial, prone to errors, and it usually has high costs in terms of additional network trac and storage use. The only case in which prefetching can be done easily { that of embedded documents such as icons, images and the like { only anticipates clients' requests by a small amount of time, so that the relatively t

t

D

i

P

i

P

D

2

th

P

large increase in the hit rate does not re ect a corresponding increase in performance as perceived by clients. In this paper we will not consider prefetching proxies. Both our traces and other data available in the literature show that the maximum hit rate { both in documents (HRmax ) and in bytes (BHRmax ) { that we are dealing with is much lower than unity. Given the history of accesses to a proxy, we can easily compute (HRmax ) and (BHRmax ). Such hit rates will be easily achieved by a cache with sucient storage to hold accessed documents (in principle, it suces that the cache is large enough to hold all live documents). With cache of nite size, we generally need to purge documents from the cache to make room for new ones. Given sucient disk space, an optimal replacement policy would purge dead documents and keep all live documents. Clearly, such a decision can only be taken if the future history of accesses is known. Such a caching policy is able to achieve both BHRmax and HRmax if there is enough room to store the set of all live documents. If the cache is not large enough to store all live documents, optimal policies can still be de ned as those maximizing HR or BHR, and can be used as a reference in evaluating the goodness of a cache replacement algorithm. Note that even if future history were known { which is not { an optimal policy is computationally too hard to implement.

2.1 A cost/bene t model

Purging a documents from the cache has a bene t and a cost. The bene t essentially comes from the amount of space freed, which is roughly proportional to the size of the document plus its metadata, possibly rounded to a multiple of the le-system's block-size. The cost can be expressed as the cost, , of fetching the document from the original site, multiplied by r , the probability that the document is accessed again in the future. Several di erent metrics are commonly used to compute : connections Each retrieval of a document has the same cost, so we can assume = 1. This is only appropriate when the cost of setting up a communication with the server dominates over other costs. trac In this case =size(document)+overhead, i.e. the amount of trac needed to establish a communication, request the document and get the response. This metric is appropriate when the communication speed is independent from the source of a document, or communication costs are based on the number of bytes transferred. time In this case the cost of a retrieval is proportional to the time needed to fetch the document, so = fetch . This number corresponds roughly to product of the size of the document and the inverse of the bandwidth towards the server. This metric is useful when the bandwidth to di erent servers has large variations, or we want to minimize the delay in serving documents to clients. Given the cost and bene t in purging a document, we want to determine how valuable is the document for the proxy. To this purpose, we can de ne the value, , of a document (relative to other documents in the cache) as = r The computation of and is relatively easy, as it only depends on the size of the document and possibly the bandwidth towards the server. These values are readily available to the proxy. Also, using metrics based on trac or time, often is approximately constant and independent of the size of the document. A metric based on connections tends to overestimate small documents. Since a cache is usually sized in bytes, not in number of documents, this metric improves the HR, but certainly not the BHR (which corresponds to saving communications, something of interest to those who run the proxy), or the response time to clients (which is of interest to clients). The computation of r is more complex. r is in general di erent for each document, and is timedependent. What we try to do in the next sections is to determine if it can be estimated by looking at the document itself (size, type, server etc.) and at the history of previous accesses to the document, as seen by the proxy. B

C

P

C

C

C

C

t

V

V

C

C

B

B

C=B

P

P

3

P

8e+09

7e+09

10 20 40 80 160 320 640 903

6e+09

bytes

5e+09

4e+09

3e+09

2e+09

1e+09

0 8.22e+08

8.24e+08

8.26e+08

8.28e+08 time (sec.)

8.3e+08

8.32e+08

8.34e+08

Figure 1: The total size of requested documents, depending on time and on the number of clients

3 Trace evaluation We have based our analysis on the history of cacheable accesses recorded by our departmental proxy over a period of ve months. The trace { about 1,300,000 accesses { includes about 1,000 clients, 20,000 servers, 450,000 di erent URLs. We are well aware of the existence, on the Internet, of some large sites running proxies with 10..100 times our trac. We do not expect all of our results to hold for these very large proxies (although some of the properties might well be the same). Nevertheless, our traces are signi cant because they are representative of a probably much larger number of installations where a proxy serves a relatively small community of local users. The set of all accessed documents in our data set amounts to about 7GB of data. Only a small fraction of the documents is accessed more than once ( 1 = 0 325), but HRmax = 0 66, meaning that documents accessed more than once feature a large number of accesses. The value of BHRmax = 0 50 means that short documents are more popular. Considering the relatively small size of our data set, it is unreasonable to think that a cache can store all of the accessed documents, no matter how big are its disks. The number (and the total size) of the accessed documents grows with time and with the number of clients. This is shown by Figure 1, where the various curves are computed by considering the rst clients to the proxy, starting with = 10 and doubling the number each time. Actually, it is not necessary that the cache holds all accessed documents to achieve the maximum hit rate. P

:

:

:

n

n

Live documents are much fewer, and they reach { in our dataset { a maximum occupation of 450MB (see Figure 2), when computed on the whole length of our traces. It is probably more appropriate to consider dead those documents which have not been requested for more than some reasonably large time (e.g.1-2 weeks). For reference, Figure 2 also shows the size of live documents when limiting the lifetime of a document to 1..10 weeks. It is interesting to note that the curves with a shorter lifetime are relatively

at in certain areas. The number of active clients over time is shown by Figure 3 (by considering inactive a client which does not issue requests for longer than some time = 1 10 weeks). As it can be seen from the curves with the shorter lifetimes, the set of clients is approximately constant for the rst 400,000 t

4

::

4.5e+08 1 week 2 weeks 5 weeks 10 weeks no limit

4e+08 3.5e+08

bytes

3e+08 2.5e+08 2e+08 1.5e+08 1e+08 5e+07 0 0

200000

400000

600000 800000 requests

1e+06

1.2e+06

Figure 2: Total size of live documents vs. requests.

keep-alive 1 week 2 weeks 5 weeks 10 weeks no limit

BH R

max

0.418 0.449 0.483 0.499 0.505

HR

max

0.537 0.581 0.628 0.649 0.657

Table 1: The maximum achievable BHR and HR versus the document keep-alive threshold

5

300 1 week 2 weeks 5 weeks 10 weeks no limit

250

200

150

100

50

0 0

200000

400000

600000 800000 requests

1e+06

1.2e+06

Figure 3: Active host clients vs. requests. requests, and then roughly doubles. By comparing Figure 2 and 3, we tend to believe that the size of live documents is only a function of the number of clients; as such, it should remain relatively constant over time for a xed number of clients. Table 3 also shows the values of max and max with reduced lifetimes. It is noticeable that a lifetime of 2 weeks brings the size of live document to 1/3 but only reduces max and max by roughly 10%. A study of the relation between live documents, clients and hit rates is still a subject of investigation. It is interesting to note that, over time, new documents are generated and old ones die with an approximately constant rate. This can be noted in the central part of the curves of Figure 4 (the initial and nal part of these curves are not signi cant: at the beginning, because the cache is `cold', thus new object are generated at an higher rate, and old objects die at a lower rate because there are not enough old objects; at the end, because all objects die leading to an higher death rate). The slope of the curves in Figure 4 depends on the type and number of clients. In theory, a proxy having a single client should only get requests corresponding to misses for the rst-level cache. The limited size of the latter, and the frequent use of the \reload" button, tend to make the slope of the curves much lower than unity. As the number of clients increases, their sets of interesting documents intersect, and we can expect a further reduction in the slope of the curve. Our data, however, suggest that there is relatively little overlap among the interests of our 1000 clients: even by choosing relatively large (10..50 units) groups of clients and merging their requests, it appears that the subset of common documents is quite small, usually between 5 and 10%. We believe this to depend on three reasons. First, the di erent types of people served by the proxy: undergrad students, PhD students, faculty, people from a couple of other universities in di erent regions, and a number of users scattered across the network who have come to know of the presence of our proxy and use it. Second, at our site (and we believe in most places) the Web appears to be used a lot for personal rather than professional use, thus reducing the set of common requests. Third, most local services are implemented as CGI-BIN programs, thus appear as uncacheable. Further analysis (not shown in the graph) reveals that there are huge di erences in the number of accesses done by single clients, and even by large groups of clients. This again derives from the presence, among our clients, of di erent types of systems, including some computers in a student lab used almost HR

HR

BH R

6

BH R

500000 400000 300000 births deaths x/3 −x/3

200000 100000 0 −100000 −200000 −300000 −400000 −500000 0

200000

400000

600000 800000 requests

1e+06

1.2e+06

1.4e+06

Figure 4: Cumulative number of births (new documents) and deaths (plotted in negative) vs. the requests issued (event counter). exclusively as Web browsers are shared by many di erent people, and personal workstations mostly used for more \serious" work. Having characterized the information contained in our log les, we can start now the analysis of the parameters which in uence the probability r of a new access to the same document. In order to make guesses at which documents to keep, we need to nd estimators able to select among documents. Hence, we compute the conditioned probabilities of a document being accessed again, depending on various known information such as the time from the previous access, the number of previous accesses, the server of the document, the client originating the rst request, etc. P

3.1 Time

Figure 5 shows the distribution of times between consecutive requests to the same document ( ( )). The time axis is logarithmic to ease the reading of the graph. We have identi ed at least two distinct areas: the rst one, covering up to one day from the previous access, while the second one covers the remaining time scale. Globally, about 60% of accesses occur within one day, with a marked peak around 24 hours, and a relatively at area between 8 and 18 hours. This is not surprising because our users reside within the same timezone, and most of accesses occur during oce hours. About 20% of accesses occur in the rst 15 minutes, and 10% in the rst minute. This is an indication of frequent reloads, and possibly a consequence of a feature of harvest and squid, which can retrieve documents in the background: some users often do a rst pass touching interesting documents, then request them again at a second time when the transfer is presumably complete. In the second area (from one day to the end) there is an approximately exponential decay of interaccesses. The daily peaks are still visible but have a decreasing amplitude. Drawing the graph of Figure 5 with events instead of time on the -axis smoothes out the daily perturbations, but makes the position of the knee dependent on the trac on the server. Figure 5 also shows the distribution of interaccesses for the -th/ + 1-th pair. Although some unD t

x

i

7

i

1 all 1−2 2−3 3−4 10−11

0.8

0.6

0.4

0.2

0 1

10

100

1000 10000 time (sec.)

100000

1e+06

1e+07

Figure 5: Distribution of interaccess times, D(t): i/(i+1)th access and all accesses to the same document.

0.1 0.01 0.001 0.0001 1e−05 1e−06 1e−07 1e−08 1e−09 1e−10 1

10

100

1000

10000 time (sec)

100000

1e+06

1e+07

Figure 6: Probability density function of interaccess times, d(t). 8

avoidable di erences exist (documents with more accesses also have them closer to each other), the curves retain approximately the same shape. The dependency of r on , the time from the last access, can be expressed as r ( ) = 1 ? ( ). Being ( ) a distribution function, r ( ) always decreases with time, independently from the shape of ( ); thus, a policy such as LRU is always the best one if we only consider the time from the last access. However, we are interested in the actual shape of ( ) because we want to use other parameters in the computation of r . To have a suggestion on how to approximate ( ) we can look at Figure 6, showing its derivative ( ) (the p.d.f. of interacces times). It turns out that ( ) can be reasonably approximated with in the rst part ( 1 day), while in the last part it is better approximated by ? t . Integrating the two functions yields the following approximation for ( ): P

P

t

D t

t

D t

P

t

D t

D t

P

D t

d t

d t

k=t

t


P

5.1 When to do garbage collection

Many proxies do garbage collection (GC) periodically, starting when cache occupation passes an upper threshold and continuing until it goes below a lower threshold. The main motivation for such a behavior 14

BHR 100M 250M 500M 750M

LRU FIFO SIZE NREF LRV

0.320 0.308 0.212 0.313 0.347 HR 100M LRU 0.406 FIFO 0.381 SIZE 0.419 NREF 0.456 LRV 0.482

0.374 0.359 0.277 0.368 0.393 250M 0.483 0.454 0.525 0.518 0.539

0.413 0.397 0.331 0.402 0.429 500M 0.538 0.508 0.592 0.558 0.577

0.434 0.418 0.365 0.419 0.446 750M 0.566 0.537 0.620 0.576 0.597

1G 0.446 0.431 0.392 0.434 0.458 1G 0.584 0.558 0.635 0.592 0.610

Table 3: The values of BHR and HR for di erent policies and cache sizes . is that selecting the document(s) to replace often requires a sort operation, and keeping documents always sorted might be expensive. However, such a technique is wasteful: if thresholds are too close to each other, garbage collection is frequent and a lot of time is spent in sorting documents; otherwise, less time is wasted but a fraction of the cache space is left unused. In both cases, the impact is higher in small caches. If possible, the replacement policy should be implemented so as to make g.c. a constant cost operation, in order to fully exploit the available storage. As a matter of fact, for many policies, such as LRU, size, FIFO, keeping documents sorted all the time is a relatively inexpensive operation, requiring constant or logarithmic (in the number of documents) time at each request. As we will show, LRV requires constant time operations to keep documents sorted, so that incremental GC is possible and storage can be fully used.

5.2 Performance

The performance of LRV, compared to other algorithms, such as LRU, size, FIFO and others, has been simulated basing on traces of accesses to our proxy. Although a general purpose simulator for networks of cooperative proxies has been developed [6], the latter has only been used for initial evaluation. Most experiments have been done with a few short C programs. One of them in particular contains a full implementation of (as well as the other policies which have been evaluated). The various programs process sanitized traces where strings (e.g. URL, client names etc.) are replaced by unique numbers, thus allowing a very fast processing. The performance, for di erent cache sizes, is shown in Table 5. It must be kept in mind that the set of documents accessed at least once amounts to 7Gb of data, while the set of live documents is always lower than 450MB (see Figure 2), hence we have used comparable cache sizes (100MB..1GB). For simplicity, we have run garbage collection between two thresholds set at 90% and 100% of the cache size. Table 5 and Figure 11 show the values of HR and BHR for di erent policies and cache sizes. As it can be seen, LRV shows a consistently higher BHR than other policies in all conditions. The same happens for the HR, except in the case of the SIZE policy with caches  500 ; however this is not accompanied by a comparable BHR. Reducing the cache size causes the SIZE policy to worsen because of the \pollution" of the cache with small documents which are never replaced (see Figure 13). This phenomenon does not appear with larger cache sizes just because our traces are not long enough. It is interesting to evaluate the number of errors made by the various replacement policy in discarding documents which will be accessed again. Figure 12 shows these values for documents with di erent number of accesses, obtained using a cache size of 500Mb. On the right, we show the cumulative number LRV

MB

15

0.65

0.45

0.6

0.4

0.55

HR

BHR

0.5

0.35

0.3

Nref fifo lru size lrv rand

0.25

0.2 100

200

300

400

500 600 cache size (Mb)

700

800

900

0.5

Nref fifo lru size lrv rand

0.45

0.4

0.35 100

1000

200

300

400

500 600 cache size (Mb)

700

800

900

1000

Figure 11: The values of BHR and HR for di erent policies vs. the cache size.

0.9

200000

0.8

lrv lru Nref size rand

lrv lru Nref size rand

0.7

150000

0.6

0.5

100000

0.4

0.3

50000

0.2

0.1

0 1

10

1

access #

10 access #

Figure 12: The percentage and the cumulative number of wrong choices in discarding documents vs. the number of accesses issued to the document at the moment of the choice; the cache size is 500Mb.

16

1 fifo lru Nref size lrv

0.95 0.9 0.85

HR

0.8 0.75 0.7 0.65 0.6 0.55 8.22e+08

8.24e+08

8.26e+08

8.28e+08 8.3e+08 time (sec.)

8.32e+08

8.34e+08

Figure 13: Cache pollution: temporal trend of HR, normalized to the maximum theoretical hit rate (HRmax ), in a 250Mb cache. of errors. The NREF policy does no errors for documents with a large number of accesses, but does many of them for documents with fewer accesses. The SIZE policy does fewer errors than other policies, yielding to a higher HR. However, it discards, on average, larger documents than other policies, hence its lower BHR. The graph on the left shows the relative number of errors for each class of documents with the same number of accesses, . Here we see that policies which do not consider the number of previous accesses have a comparable behavior, with an increasing percentage of errors with . This happens because these documents have a larger i , making it easier to take the wrong decision when purging one of them. Both NREF and LRV work instead much better because they give the correct importance to . i

i

P

i

6 Conclusions and future work We have shown how the probability of a document being reaccessed depends upon a number of parameters, and derived an easy-to-compute formula to determine the relative value of each document. Basing on these dependences, and on a cost/bene t model for Web documents, we have designed a replacement policy called Lowest Relative Value (LRV) to achieve a better selection of documents to purge. We have shown how LRV overcomes some limitations of LRU and other policies, and that it can outperform them in all cases. LRV proves to be particularly useful in presence of small caches. The implementation of LRV is trivial, and it can be easily inserted in publicly-available proxy servers such as squid. With a proper organization of data, keeping documents sorted by value requires constanttime operations on data held in main memory, thus allowing replacement on demand rather than periodical. Future work will be devoted to the application of the methods shown in this paper to traces coming from other proxies. Also, the statistical properties of live documents, dead documents and of errors made by the replacement policy need to be studied in more detail. to be completed... 17

References [1] M.F.Arlitt, C.L.Williamson, \Web Server Workload Characterization: The Search for Invariants", Proc. of SIGMETRICS 96, May 1996, Philadelphia, PA, USA [2] A.Chankhunthod, P.Danzig, C.Neerdaels, M.Schwartz and K.Worrell, \A Hierarchical Internet Object Cache", Proc. of the 1996 USENIX Technical Conference, January 1996, San Diego, CA, USA. [3] R.Karedla, J.S.Love, B.G.Wherry, \Caching Strategies to Improve Disk System Performance", IEEE Computer, pp.38-46, v.27, n.3, March 1994. [4] \Squid Internet Object Cache", http://www.nlanr.net/Squid/. [5] S.Williams, M.Abrams, C.R.Standridge, G.Abdulla, and E.A.Fox, \Removal Policies in Network Caches for World-Wide Web Documents", Proc. of ACM Sigcomm96, August 1996, Stanford University, CA, USA. [6] P.Lorenzetti \Simulatore per cluster di proxy", Degree Thesis, University of Pisa, 1996 (In Italian)

18