Extracting Tables from a PDF

| July 16, 2022

Today I finally find myself needing to extract information from files delivered in the PDF format. I have heard good things about the tabulizer package, so we will try that out today.

Whelp, it turned out that I needed to ensure a 64-bit installation of Java before I could install tabulizer. Also, I used remotes::install_github(...) command from [the package’s GitHub page[(https://github.com/ropensci/tabulizer)] to force the installation (as there appears to be issues with installing a package through CRAN where there are concerns of Java dependency).

# load packages

Now let’s try to load the data. Today, I am using anonymized aggregate data from my classroom, but I am still not going to provide a public version of the data file for the blog post.

# https://cran.r-hub.io/web/packages/tabulizer/vignettes/tabulizer.html
raw_data <- tabulizer::extract_tables("report-UCMerc-Fall2021Pre-2022-02-07.pdf")

So far, the algorithm (subroutine: Tabula) is finding some of the information, but perhaps we need to be more specific. It looks like I will need to tell the algorithm where on the pages the tables are. Fortunately, the locate_areas() function in the tabulizer package runs an interactive app inside RStudio to quickly extract the bounding-box pixel values that we need.

# https://medium.com/@ketanrd.009/how-to-extract-pdf-tables-in-r-e994c0fe4e28

areas_list <- list(
  c(597, 139, 658, 477),
  c(616, 76, 735, 539),
  c(67, 76, 127, 539),
  c(170, 73, 301, 542),
  c(414, 73, 588, 539)

raw_data <- tabulizer::extract_tables(
  pages = c(1, 2, 3, 4, 4), #there were two tables on page 4
  area = areas_list, guess = FALSE,
  output = "data.frame")

Success! Now the tables I want are in the following variables

  • raw_data[[1]]
  • raw_data[[2]]
  • raw_data[[3]]
  • raw_data[[4]]
  • raw_data[[5]]

and are in the data.frame format that I like.

In my next blog post, I shall recombine the data and start performing some calculations.