On one project I take care of it creates a large number of sizable files (which can be reproduced on demand). Each file take about a minute to create and are typically only accessed once. However, some files are more popular than others and once created are accessed repeatedly. In the early days of this project I had 1TB of space available, didn’t think about what could happen and left the project to run. Gradually all space was eaten up by the storing of these files eventually leading to the question:
How to efficiently store as many files as possible without running out of space?
If you’re familiar with linux commands the answer may be obvious, but if you’re not (or maybe even if you are and it’s pre-coffee time) this is what I’d like to document today.
In the beginning
Let’s call all these files what they are: a file cache. They are files which can be reproduced (at a cost, time) but are left lying around to reduce/prevent duplicate effort. When this file cache occupies 1% of the available space on a server it’s hardly important to think about how to get rid of them, however when this file cache occupies 100% of the available space on a server it’s a different matter.
In my own case the rise from 1% to say 50% was slow - it took months. However the rise from 50% to 100% was overnight, caught me a little by surprise and caused “some problems” which were quickly solved with a call to:
# Delete all files not accessed in a couple of weeks
find /some/path -type f -atime +14 | xargs rm
Pop that in a daily crontab and done. Fixed. Time goes by and whuh-oh the server has had another blip of traffic and it’s filling up with files again faster than they can be deleted. So 2 weeks is too long, let’s go for 1 week:
# Delete all files not accessed in a week
find /some/path -type f -atime +7 | xargs rm
Done. Fixed. Whuh-oh holy moly another blip of traffic…. going to have to get more space and change that crontab to be much less time…
# Delete all files not accessed in 2 days
find /some/path -type f -atime +2 | xargs rm
Done. Fixed.
Boo traffic has gone down again, now there’s 95% free space available, it’s not necessary to be so agressive deleting those cached files - but increasing the time they are allowed to hang around risks filling the disk and running out of space again.
There has to be a better way.
A better way
After some thought on the matter this was the process I needed:
- Determine which physical drive a folder is on
- Derive the amount of free space needed as a % of drive size
- Construct a list of files orderd by last access time (oldest first)
- Construct a list of files to delete
- Delete files until free space is below the threshold
Each of those tasks is very simple, and some googling will result in a few useful results and plenty that probably aren’t.
Determine which physical drive a folder is on
I’ve used df countless times yet didn’t know until addressing this task that you can use it to tell you which drive a folder is on:
$ df -h /some/path/
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 2.0T 1.4T 494G 75% /some/path
Well that was easy!
We won’t be wanting human-readable output so drop the -h
flag and to strip off the first line of the response, we can simply use awk:
$ df /some/path/ | awk 'NR>1{print}'
/dev/sdb1 2113786796 1488804468 517608200 75% /some/path
Derive the amount of free space needed as a % of drive size
The previous command gives a result which is easily parsable.
To get the drive size in bytes:
$ df /some/path/ | awk 'NR>1{print}' | awk '{print $2}'
2113786796
To get the current free space in bytes:
$ df /some/path/ | awk 'NR>1{print}' | awk '{print $4}'
517608200
Let’s say we want 30% of the drive free, using a couple of variables that becomes:
$ STATS=`df /some/path/ | awk 'NR>1{print}'`
$ SIZE=`echo $STATS | awk '{print $2}'`
$ FREE=`echo $STATS | awk '{print $4}'`
$ TARGET_FREE=`expr $SIZE \* 3 / 10`
$ echo $TARGET_FREE
634136038
How many bytes to delete? This many:
$ BYTES_TO_DELETE=`expr $TARGET_FREE - $FREE`
$ echo $BYTES_TO_DELETE
116527838
Rock and roll, only need to delete 116,527,838 or 116.5MB today.
Construct a list of files order by last access time
This isn’t quite so simple as it sounds, if you want to do it in an efficient way. However, a read of find’s man page yields the following:
$ # find
$ # %A+ - Last access time YYYY-MM-DD:HH:MM:SS.X
$ # %p - filename
$ # %s - Size in bytes
$ find /some/path -printf "%A+::%p::%s\n"
2015-01-04+00:28:05.7512236840::/some/path/52/52/some-file.zip::3803767
...
Pipe that through sort and we have a list of files order by last accessed, the path and the size in bytes:
$ find /some/path -printf "%A+::%p::%s\n" | sort > /tmp/all-files
Construct a list of files to delete
With a parsable list of files to delete there are many ways inwhich to trim that list to just the files to delete - one intriguing post I found did this with awk, adapting that to this use case was easy:
$ cat /tmp/all-files | awk -v bytesToDelete="$BYTES_TO_DELETE" -F "::" '
BEGIN { bytesDeleted=0; }
{
bytesDeleted += $3;
if (bytesDeleted < bytesToDelete) { print $2; }
}
' > /tmp/files-to-delete
Looking at the contents of /tmp/files-to-delete
it should be the same list of files as the input, truncated at the point where freespace reaches the desired amount.
Delete files until free space is below the threshold
The final step is very simple:
$ cat /tmp/files-to-delete | rm
Tada, free space is now 30% of the drive.
Complete script
If you’d like a drop in and use script to do what is described here here you go:
It’s functionally the same as the steps described here (i.e. each step is done atomically, mostly to make it easier to understand and/or see what the script will or did do), with a few niceties thrown in (such as a dry-run mode, and doing nothing at all unless necessary) and makes a handy crontab addition:
# Ensure there's 30% free space on the drive every hour
0 * * * * /usr/local/sbin/purgePath /some/path
Conclusion
It would be very easy to lazily continue to use a “delete anything not accessed in x days” approach as I’m sure I’ve done many times in the past for any similar scenario - but using tools available on any linux system, and a few Useless use of cat’s - now any folder used as a file cache can be used optimally without the risk of swallowing all disk space killing a server.