UU CS

Collecting garbage in shared tmp file space

introduction

In shared tmp file space many users can create as many files as they like. Users expect the system to clean up.

In an organisation with various groups of users, confined to selected machines and networks, it may be useful to provide means for transporting data between users, machines and networks. When users are restricted to a certain amount of filespace (their disk quota), it may be useful to provide users with 'tmp' file space where no quota limits are enforced.

Both problems can be solved by implementing 'global tmp space' (TMP): a shared file tree, available on all cliens on all networks where users can create their own file trees at will. This is easy to implement. The problem is cleaning up.

Without a cleanup mechanism, TMP will completely fill up with garbage. This garbage has to be managed; in a simple, effective manner. Who wants to spend time managing garbage?

requirements

It would appear that an investigation of the cleanup procedure is central, because this procedure determines the kind of services that can be realised and the acceptable usage policies that can be enforced.

requirements: a file server with a suitable amount of disk space

The right value for 'suitable' depends on the kind of service you want to provide to your users, and the cleanup habits you can expect from your users. Suppose users consume 1 GB of free space each day and never cleanup, and you want to provide a minimal lifetime of 5 days, then you will need 5 GB at least, or 10 GB to be marginally safe in case of peaks. The predictablility of consumption and cleanup behaviour grows with group size.

requirements: a set of rules stating acceptable usage

A shared system is only useful, if it is predicatable. To make a system predictable, there have to be rules and the rules must be enforced.
If users abide by the rules, they should get the service they were promised; if they don't play by the rules, they should be excluded from the service.

An acceptable usage policy serves to exclude usage that disturbs the predictability of the system.

requirements: a description of the service that will be provided

It seems reasonable to indicate the minimum and maximum lifetime of files. There may be other rules and proviso's like

requirements: some form of cleanup procedure

Not all users can be expected to cleanup after themselves. That shouldn't be a problem. Note that Unix/Linux lacks some useful primitives (file creation date, access counters), so some extra bookkeeping will be required.

requirements: some means of 'attitude adjustment' of misbehaving users

One must be able to detect and act upon unwanted behaviour of users. Temporary exclusion from the service should solve emerging and future problems.

cleanup

Cleanup of TMP is reminiscent of garbage collection and caching techniques. However, mistakes are not just costly because they can not be undone. Because of the nature of the problem, the cleanup procedure can at best be a working, sensible strategy and not the implementation of some nice semantic framework. There are no clear goals in managing garbage.

Two per-file parameters we can work with are file size S, and (with a little effort) file age T. Note that Unix/Linux has no 'file creation date' and other filesystem timestamps can't be trusted. By periodically scanning the filesystem we can spot new arrivals and note their time of entry.
The group id (gid) of a file can be used to parametrize the cleanup procedure for different groups of users.

When free disk space reaches a lower threshold, we have to remove some files, so we have to make a choice.

For a file F, let X(F) denote the expendability of file F (expendability is just the opposite of priority, high expendability means low priority). Let X(S,T) denote the expendability of a file with size S and age T.
The cleanup procedure determines X(F) for all files F, and throws away the most expendable files, until there is 'enough' free space.

Common sense suggests that X(F) should be a rising function in size(F) and age(F).

X(S,T¹) > X(S,T²)   if   T¹ > T²
X(S¹,T) > X(S²,T)   if   S¹ > S²
Note: superscripts should be read as subscripts :-).

cleanup: variations on X

In comparing possiblilities, lets consider small files (± 1KB), medium (± 1MB) and big files (± 1GB). Note that file size spans many orders of magnitude; storing one big file requires as much disk space as 1 million small files.

old files first: X(S,T) = T

This strategy is also known as first in, first out (FIFO). Filespace is maintained as a queue. In the stable situation, TMP will be contain relatively few files because relatively many big files will consume a lot of space.

big files first : X(S,T) = S

This strategy maximises the number of files in the filesystem, but it is rather unfair for the users who happen to have the biggest files. Also, many medium sized files can occupy the archive for the maximum term, while big files are thrown out right after they appear.

counting costs : X(S,T) = S × T

S × T can be interpreted as the cost incurred for storing a file of size S for a time T; the cost incurred each day is proportional to the size of the file.
This strategy favors small files, so we can keep a lot of those. Medium size files, as they grow older, have to compete more and more with newly arrived big files. This creates more room for new big files, at the cost of old medium sized files.

Imagine the list of files, ordered from high to low expendability. All files start at the bottom (because T=0), but the big files rise faster than the small files. After a while (as in septic tanks and organisations) the big chunks float on top.

further variations

In the previous example, X is linear in time and space. For S and T one can substitute any rising function
X(S,T) = S × ( T × T )
X(S,T) = S × log(S) × T
In the first example, time is made more important (files age more quickly); in the second example size is more important.

If X is linear in space or time, it doesn't matter which units you use for space (byte, KB, MB) or time (seconds, days), since X will only change a constant factor.

directories

Directories are special. Empty directories are useless and can be regenerated at will. Non-empty directories can't be thrown out even if they are very old. A strategy that favors small files, will keep empty, useless directories long after it has thrown away all files contained in such directories.

It seems useful to treat the directory as a special case: simply remove empty directories after a short, fixed period.

toplevel files

If users pollute the top level of the file tree, the toplevel directory listing becomes unwieldy. One could introduce extra 'expendability points' for toplevel files to encourage users to organise their stuff in subtrees.

parametrisation for different groups

You may want different strategies for different groups (say 'staff' and 'students') sharing the TMP server. If you define
X(S,T) = S × T × 1000 if gid (file) == 'student'
X(S,T) = S × T if gid (file) == 'staff'
then big staff files will compete with medium size student files, medium size staff files will compete with small student files, and small staff files wil have no competion from student files. Change the 'student expendability factor' to 1.000.000 and staff files will only have to compete with small student files and other staff files.

support

You must choose a expendability function that suits your needs. Choose one that works in general, with regards to available disk space and general user behaviour.
Then you must define what behaviour would disturb the predictability of the system and publish an acceptable usage policy. Now you must weed out the non-complying users.

support: crime and punishment

Big files are a nuisance, but they are few and easy to find. It's easy to create a procedure that lists recent big files. If these files violate accepted usage, the files must be removed and the user must be (temporarily) excluded from the service.

One possible implementation is maintaining a blacklist. Users on the blacklist are prevented from creating files in TMP. If prevention is not feasible, in periodic scans of TMP one can simple remove all files owned by users on the blacklist. Maintenance can be simplified by setting an expiration date for new blacklist entries. Furthermore, a per/user count could be used to compute the expiration date, so if a user gets blacklisted for the N-th time, he/she gets blacklisted for N weeks (or N×N weeks).

experience

At the computer science department, Utrecht University, we have for some months experimented with cleaning up. Some traits: At first we only experimented with the expendibility function; for years we (rather crudely) threw away all student files every night. Then we tried X(S,T) = T and later X(S,T) = S and were not satisfied with either one of them.
Then we implemented the blacklist. It seemed pointless to improve management of a system where 'rogue' users could go on unchecked. The blacklist proved very succesful.
We then changed the expendibility function to X(S,T) = S × T. However, since users were so much more wellbehaved than before, free diskspace was hardly ever a problem.

Now, even if we use only 4 GB of the available 18 GB, almost all files sit out their two week lifetime, where before very often 18 GB was not enough. All this makes the choice of X(S,T) rather irrelevant.
However, if we run 'what if' tests simulating a 2 GB disk, the results are satisfactory.

references


GCS Projecten
UU CS INFO FIND INDEX
Henk Penning, Sun Feb 3 22:35:35 MET 2002