File Caching - Preventing Running Out of Space

On one project I take care of it creates a large number of sizable files (which can be reproduced on demand). Each file take about a minute to create and are typically only accessed once. However, some files are more popular than others and once created are accessed repeatedly. In the early days of this project I had 1TB of space available, didn’t think about what could happen and left the project to run. Gradually all space was eaten up by the storing of these files eventually leading to the question:

How to efficiently store as many files as possible without running out of space?

If you’re familiar with linux commands the answer may be obvious, but if you’re not (or maybe even if you are and it’s pre-coffee time) this is what I’d like to document today.

In the beginning

Let’s call all these files what they are: a file cache. They are files which can be reproduced (at a cost, time) but are left lying around to reduce/prevent duplicate effort. When this file cache occupies 1% of the available space on a server it’s hardly important to think about how to get rid of them, however when this file cache occupies 100% of the available space on a server it’s a different matter.

In my own case the rise from 1% to say 50% was slow - it took months. However the rise from 50% to 100% was overnight, caught me a little by surprise and caused “some problems” which were quickly solved with a call to:

# Delete all files not accessed in a couple of weeks
find /some/path -type f -atime +14 | xargs rm

Pop that in a daily crontab and done. Fixed. Time goes by and whuh-oh the server has had another blip of traffic and it’s filling up with files again faster than they can be deleted. So 2 weeks is too long, let’s go for 1 week:

# Delete all files not accessed in a week
find /some/path -type f -atime +7 | xargs rm

Done. Fixed. Whuh-oh holy moly another blip of traffic…. going to have to get more space and change that crontab to be much less time…

# Delete all files not accessed in 2 days
find /some/path -type f -atime +2 | xargs rm

Done. Fixed.

Boo traffic has gone down again, now there’s 95% free space available, it’s not necessary to be so agressive deleting those cached files - but increasing the time they are allowed to hang around risks filling the disk and running out of space again.

There has to be a better way.

A better way

After some thought on the matter this was the process I needed:

Determine which physical drive a folder is on
Derive the amount of free space needed as a % of drive size
Construct a list of files orderd by last access time (oldest first)
Construct a list of files to delete
Delete files until free space is below the threshold

Each of those tasks is very simple, and some googling will result in a few useful results and plenty that probably aren’t.

Determine which physical drive a folder is on

I’ve used df countless times yet didn’t know until addressing this task that you can use it to tell you which drive a folder is on:

$ df -h /some/path/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       2.0T  1.4T  494G  75% /some/path

Well that was easy!

We won’t be wanting human-readable output so drop the -h flag and to strip off the first line of the response, we can simply use awk:

$ df /some/path/ | awk 'NR>1{print}'
/dev/sdb1      2113786796 1488804468 517608200  75% /some/path

Derive the amount of free space needed as a % of drive size

The previous command gives a result which is easily parsable.

To get the drive size in bytes:

$ df /some/path/ | awk 'NR>1{print}' | awk '{print $2}'
2113786796

To get the current free space in bytes:

$ df /some/path/ | awk 'NR>1{print}' | awk '{print $4}'
517608200

Let’s say we want 30% of the drive free, using a couple of variables that becomes:

$ STATS=`df /some/path/ | awk 'NR>1{print}'`
$ SIZE=`echo $STATS | awk '{print $2}'`
$ FREE=`echo $STATS | awk '{print $4}'`
$ TARGET_FREE=`expr $SIZE \* 3 / 10`

$ echo $TARGET_FREE
634136038

How many bytes to delete? This many:

$ BYTES_TO_DELETE=`expr $TARGET_FREE - $FREE`

$ echo $BYTES_TO_DELETE
116527838

Rock and roll, only need to delete 116,527,838 or 116.5MB today.

Construct a list of files order by last access time

This isn’t quite so simple as it sounds, if you want to do it in an efficient way. However, a read of find’s man page yields the following:

$ # find
$ #     %A+ - Last access time YYYY-MM-DD:HH:MM:SS.X
$ #     %p  - filename
$ #     %s  - Size in bytes
$ find /some/path -printf "%A+::%p::%s\n"
2015-01-04+00:28:05.7512236840::/some/path/52/52/some-file.zip::3803767
...

Pipe that through sort and we have a list of files order by last accessed, the path and the size in bytes:

$ find /some/path -printf "%A+::%p::%s\n" | sort > /tmp/all-files

Construct a list of files to delete

With a parsable list of files to delete there are many ways inwhich to trim that list to just the files to delete - one intriguing post I found did this with awk, adapting that to this use case was easy:

$ cat /tmp/all-files | awk -v bytesToDelete="$BYTES_TO_DELETE" -F "::" '
  BEGIN { bytesDeleted=0; }
  {
  bytesDeleted += $3;
  if (bytesDeleted < bytesToDelete) { print $2; }
  }
  ' > /tmp/files-to-delete

Looking at the contents of /tmp/files-to-delete it should be the same list of files as the input, truncated at the point where freespace reaches the desired amount.

Delete files until free space is below the threshold

The final step is very simple:

$ cat /tmp/files-to-delete | rm

Tada, free space is now 30% of the drive.

Complete script

If you’d like a drop in and use script to do what is described here here you go:

#!/bin/bash ################################################################################ # # Delete least used files # ################################################################################ PROGNAME=${0##*/} PROGDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" VERSION="0.1" DRYRUN=0 error_exit() { echo -e "${PROGNAME}: ${1:-"Unknown Error"}" >&2 exit 1 } graceful_exit() { exit } usage() { echo -e "Usage: $PROGNAME [-h|--help|-n|--dryrun] directory" } help_message() { cat <<- _EOF_ $PROGNAME ver. $VERSION Find least-accessed files, and delete them Will ensure that the relevant disk is less than 70% full $(usage) Example: $PROGNAME /some/path/ Drive /dev/sdb1 has xxx free bytes need yyy bytes available Finding files to delete ... Deleting the following files: /some/path/foo.zip /some/path/bar.zip Options: -h, --help Display this help message and exit. -n, --dryrun Simulate only _EOF_ return } function driveSize { stats=`df -P $DIR | awk 'NR>1{print}'` DRIVE=`echo $stats | awk '{print $1}'` DRIVESIZE=`echo $stats | awk '{print $2}'` DRIVEUSED=`echo $stats | awk '{print $3}'` TARGETUSED=`expr $DRIVESIZE \* 7 / 10` DRIVEFREE=`expr $DRIVESIZE - $DRIVEUSED` TARGETFREE=`expr $DRIVESIZE \* 3 / 10` BYTESTODELETE=`expr $DRIVEUSED - $TARGETUSED` } function findToDelete { FILEPATH=`tempfile` # find # %A+ - Last access time YYYY-MM-DD:HH:MM:SS.X # %p - filename # %s - Size in bytes find "$DIR" -printf "%A+::%p::%s\n" \ | sort > "$FILEPATH.raw" cat "$FILEPATH.raw" | awk -v todelete="$BYTESTODELETE" -F "::" ' BEGIN { deleted=0; } { deleted += $3; if (deleted < todelete) { print $2; } } ' > "$FILEPATH.processed" } function main { driveSize; echo "Drive $DRIVE has $DRIVEFREE free bytes" echo " need $TARGETFREE bytes available" if [ $DRIVEFREE -lt $TARGETFREE ]; then echo "Finding files to delete ..." findToDelete if [ $DRYRUN == 1 ]; then echo "The following files would be deleted:" echo "" cat "$FILEPATH.processed" else echo "Deleting the following files:" echo "" cat "$FILEPATH.processed" | tee | rm fi else echo "No action required at this time" fi } # Parse command-line while [[ -n $1 ]]; do case $1 in -n | --dryrun) DRYRUN=1;; -h | --help) help_message; graceful_exit ;; -* | --*) usage error_exit "Unknown option $1" ;; *) DIR=$1;; esac shift done if [ "$DIR" == "" ]; then usage; graceful_exit; fi main $DIR

It’s functionally the same as the steps described here (i.e. each step is done atomically, mostly to make it easier to understand and/or see what the script will or did do), with a few niceties thrown in (such as a dry-run mode, and doing nothing at all unless necessary) and makes a handy crontab addition:

# Ensure there's 30% free space on the drive every hour
0 * * * * /usr/local/sbin/purgePath /some/path

Conclusion

It would be very easy to lazily continue to use a “delete anything not accessed in x days” approach as I’m sure I’ve done many times in the past for any similar scenario - but using tools available on any linux system, and a few Useless use of cat’s - now any folder used as a file cache can be used optimally without the risk of swallowing all disk space killing a server.

AD7six.com

Keyboard, coffee and code.