Interactive R for File System Stats
I recently wrote guidelines for others to use when developing policy and procedure documents. Part of these guidelines addressed image use. I’ve learned from experience that it is far too easy to end up with a massive document file if the images used were too large for the document. I’ve also found the easiest way to check for this—or other things taking up too much space in a file—is to look at the file size.
The problem I ran into is that I can eyeball a file size and know if it’s “not right”, but I need to give a size number for others who don’t have as much experience. Thankfully, I have a reasonable sample set: the source files for our large (over 400 pages) clinical protocol set. These documents are reasonably similar to most policy and procedure documents; some are single page, others are more involved, and while mostly text, there are images throughout.
The primary challenge was getting the dataset into a crunchable format. The rest is relatively easy number crunching.
Approach One
My first approach, which did work, was cumbersome and complex. I wouldn’t recommend it, but wanted to share it anyway in case pieces of it might have value.
I started with a shell command (in Zsh, on macOS): ls -lR | pbcopy
. ls
lists directory contents; the -l flag shows all of the file info, and the -R flag recursively tracks through all the subdirectories. I piped this into the pbcopy
command which placed on the clipboard, and then pasted the output into Sublime Text (my text editor of choice).
The result looked something like this:
total 29280
drwxr-xr-x 7 samuelkordik staff 224 Apr 28 2021 1 - Introduction
drwxr-xr-x 11 samuelkordik staff 352 Apr 29 2021 10 - Appendices
drwxr-xr-x 19 samuelkordik staff 608 Apr 26 2021 2 - Clinical Policies
drwxr-xr-x 18 samuelkordik staff 576 Dec 14 2020 3 - Operational Policies
drwxr-xr-x 9 samuelkordik staff 288 Dec 13 2020 4 - Transport Destination Determination
...
./1 - Introduction:
total 872
-rwxr-xr-x@ 1 samuelkordik staff 131297 Jun 18 2020 1 - Introduction.docx
-rwxr-xr-x 1 samuelkordik staff 31051 Jun 18 2020 2 - Signature Page.docx
-rwxr-xr-x@ 1 samuelkordik staff 37465 Jun 18 2020 3 - Delegation.docx
-rwxr-xr-x 1 samuelkordik staff 189106 Jun 18 2020 3 - Delegation.pdf
-rwxr-xr-x 1 samuelkordik staff 44044 Jun 18 2020 4 - Definitions.docx
...
One of my favorite tools in Sublime Text is the powerful regex find-and-replace functionality. Regex (short for “regular expression”) is a concise way to define a pattern to look for in text. Think of it as wildcards on steroids. I was really only interested in the Word files; the fastest way to get just those was to “find all” (using a regex pattern), copy, then paste into a new document. This simple pattern did the trick: ^.*\.docx$
. In this, the “^” and “$” mark the beginning and end of a line. The .
matches any character except a new line, and the *
means an unlimited number of them (up until the .docx extension).
Once in the new file, I used a different pattern to find and replace: ^.*staff\s*(\d*).*\d{2}\s+\d{4} (.*\.docx)
. In regex, putting items in parentheses identifies specific groups, which can then be referenced in the replace criteria. The first group ((\d*)
) grabs the size data; the second group ((.*\.docx)
) gets the filename. In this regex, the first section (^.*staff\s*
) matches everything up through the “staff” and the following the spaces. Then the size group, then .*\d{2}\s+\d{4}
gets the month and matches specifically for a two digit number (the day) and a four digit number (the year) separated by a space. Then the filename.
My replacement was $2\t$1
, which puts the second group (the filename) first, then a tab, then the first group (size). The tab puts this into two columns in Excel when I paste it in.
Once in Excel, I could use standard stats formulas to look at the distribution. That’s a complex process with a lot of moving parts, plus, Excel didn’t really give me what I was looking for in a timely manner. Enter Approach Two
Approach Two
After some time, I realized a better way to do this would be using the computing power of R paired with the useful fs package. For this, I created a shell script with a set of piped R functions that resulted in a far more useful data table. In additiona wide range of descriptive stat values, this approach also used the fs::by
function to get human-readable size definitions.
Here’s the shell script, with comments explaining each line.:
#!/usr/bin/env Rscript --vanilla
fs::dir_info(here::here(), type = "file", recurse = TRUE) |> # get file info. Same info as `ls -lR`, but in a dataframe.
dplyr::mutate(ext = fs::path_ext(path)) |> # add a column with just the file extension
dplyr::group_by(ext) |> # group by this column to get a summary row for each extension type.
dplyr::summarize(n = dplyr::n(), # raw count of files
min = min(size), # minimum size
q25 = quantile(size, probs = .25), # 25th percentile
med = median(size), # median (50th percentile)
q75 = quantile(size, probs = .75), # 75th percentile
max = max(size), # maximum
mean = fs::as_fs_bytes(mean(size)), # mean (average)
sd = fs::as_fs_bytes(sd(size))) |> # standard deviation
dplyr::mutate(sd_1 = mean + sd, sd_2 = mean + sd*2, sd_3 = mean + sd*3) |> # add two and three standard deviations above mean
dplyr::arrange(desc(n)) |> # sort in descending order of the count.
print.data.frame(sigfig = 2)
The result of this was a very readable table full of useful statistics:
ext n min q25 med q75 max mean sd sd_1
1 docx 577 18.08K 34.24K 35.98K 38.01K 35.52M 361.99K 2.31M 2.66M
2 pdf 297 50.85K 112.72K 137.51K 452.85K 128.71M 1.63M 10.68M 12.31M
3 jpg 24 52.34K 291.33K 384.71K 471.34K 531.1K 365.52K 124.26K 489.78K
4 dotx 10 28.51K 36.26K 72.41K 86.28K 86.37K 63.49K 26.41K 89.89K
5 pptx 6 517.79K 1.32M 5.66M 14.23M 28.44M 9.54M 11.05M 20.59M
6 zip 5 35.13M 38.21M 76.48M 83.03M 178.49M 82.27M 58M 140.27M
7 xlsx 4 22.33K 22.44K 22.59K 42.56K 102.14K 42.41K 39.82K 82.23K
8 wmv 3 81.73M 85.06M 88.4M 138.02M 187.63M 119.25M 59.31M 178.56M
9 png 2 205.95K 205.95K 205.95K 205.95K 205.95K 205.95K 0 205.95K
10 pub 2 150.5K 150.62K 150.75K 150.88K 151K 150.75K 362.04 151.1K
11 xls 2 36K 73K 110K 147K 184K 110K 104.65K 214.65K
12 mp4 1 9.17M 9.17M 9.17M 9.17M 9.17M 9.17M NA NA
13 svg 1 107.99K 107.99K 107.99K 107.99K 107.99K 107.99K NA NA
sd_2 sd_3
1 4.98M 7.29M
2 23M 33.68M
3 614.04K 738.3K
4 116.3K 142.71K
5 31.64M 42.69M
6 198.26M 256.26M
7 122.05K 161.87K
8 237.87M 297.18M
9 205.95K 205.95K
10 151.46K 151.81K
11 319.3K 423.96K
12 NA NA
13 NA NA