| July 16, 2022
Today I finally find myself needing to extract information from files delivered in the PDF format. I have heard good things about the
tabulizer package, so we will try that out today.
Whelp, it turned out that I needed to ensure a 64-bit installation of
Java before I could install
tabulizer. Also, I used
remotes::install_github(...) command from [the package’s GitHub page[(https://github.com/ropensci/tabulizer)] to force the installation (as there appears to be issues with installing a package through CRAN where there are concerns of Java dependency).
# load packages library("tabulizer") library("tidyverse")
Now let’s try to load the data. Today, I am using anonymized aggregate data from my classroom, but I am still not going to provide a public version of the data file for the blog post.
# https://cran.r-hub.io/web/packages/tabulizer/vignettes/tabulizer.html raw_data <- tabulizer::extract_tables("report-UCMerc-Fall2021Pre-2022-02-07.pdf")
So far, the algorithm (subroutine:
Tabula) is finding some of the information, but perhaps we need to be more specific. It looks like I will need to tell the algorithm where on the pages the tables are. Fortunately, the
locate_areas() function in the
tabulizer package runs an interactive app inside RStudio to quickly extract the bounding-box pixel values that we need.
# https://firstname.lastname@example.org/how-to-extract-pdf-tables-in-r-e994c0fe4e28 areas_list <- list( c(597, 139, 658, 477), c(616, 76, 735, 539), c(67, 76, 127, 539), c(170, 73, 301, 542), c(414, 73, 588, 539) ) raw_data <- tabulizer::extract_tables( "report-UCMerc-Fall2021Pre-2022-02-07.pdf", pages = c(1, 2, 3, 4, 4), #there were two tables on page 4 area = areas_list, guess = FALSE, output = "data.frame")
Success! Now the tables I want are in the following variables
and are in the
data.frame format that I like.
In my next blog post, I shall recombine the data and start performing some calculations.