| July 16, 2022

Today I finally find myself needing to extract information from files delivered in the PDF format. I have heard good things about the `tabulizer`

package, so we will try that out today.

Whelp, it turned out that I needed to ensure a 64-bit installation of `Java`

before I could install `tabulizer`

. Also, I used `remotes::install_github(...)`

command from [the package’s GitHub page[(https://github.com/ropensci/tabulizer)] to force the installation (as there appears to be issues with installing a package through CRAN where there are concerns of Java dependency).

```
# load packages
library("tabulizer")
library("tidyverse")
```

Now let’s try to load the data. Today, I am using anonymized aggregate data from my classroom, but I am still not going to provide a public version of the data file for the blog post.

```
# https://cran.r-hub.io/web/packages/tabulizer/vignettes/tabulizer.html
raw_data <- tabulizer::extract_tables("report-UCMerc-Fall2021Pre-2022-02-07.pdf")
```

So far, the algorithm (subroutine: `Tabula`

) is finding some of the information, but perhaps we need to be more specific. It looks like I will need to tell the algorithm where on the pages the tables are. Fortunately, the `locate_areas()`

function in the `tabulizer`

package runs an interactive app inside RStudio to quickly extract the bounding-box pixel values that we need.

```
# https://medium.com/@ketanrd.009/how-to-extract-pdf-tables-in-r-e994c0fe4e28
areas_list <- list(
c(597, 139, 658, 477),
c(616, 76, 735, 539),
c(67, 76, 127, 539),
c(170, 73, 301, 542),
c(414, 73, 588, 539)
)
raw_data <- tabulizer::extract_tables(
"report-UCMerc-Fall2021Pre-2022-02-07.pdf",
pages = c(1, 2, 3, 4, 4), #there were two tables on page 4
area = areas_list, guess = FALSE,
output = "data.frame")
```

Success! Now the tables I want are in the following variables

`raw_data[[1]]`

`raw_data[[2]]`

`raw_data[[3]]`

`raw_data[[4]]`

`raw_data[[5]]`

and are in the `data.frame`

format that I like.

In my next blog post, I shall recombine the data and start performing some calculations.