Pdftools 1.3

0 views

Skip to first unread message

Granville Turley

unread,

Aug 4, 2024, 2:16:04 PM8/4/24

to sibornpupe

Isanyone here used this version for the pdftools? I'm trying to get this version but always ended up in the older version of it. so I can try if its the solution to the error of the working that is using a PDF Input tool running in the Alteryx Gallery.

I used to run this workflow previously on Alteryx version 2019, R version 3.5.3. There it used to search for the mentioned R version which i could see in the messages and if the pdftools packages were there then it used to return the output else used to return an error message.

The end goal is to use the pdftools package to efficiently move through a thousand pages of pdf documents to consistently, and safely, produce a useable dataframe/tibble. I have attempted to use the tabulizer package, and pdf_text functions, but the results were inconsistent. Therefore, started working through the pdf_data() function, which I prefer.

For those unfamiliar with the pdf_data function, it converts a pdf page into a coordinate grid, with the 0,0 coordinate being in the upper-left corner of the page. Therefore, by arranging the x,y coordinates, then pivoting the document into a wide format, all of the information is displayed as it would on the page, only with NAs for whitespaces

It would be nice to not use a dozen or so unite functions in order to rename the various columns. I used the janitor package row_to_names() function at one point to convert row 1 to column names, which worked well but maybe someone has a better thought?

The only information I had on the pdf_data() function going into this is from here... -20/Any additional resources would also be greatly appreciated (apart from the pdftools package help documentation/literature).

I am currently writing a pdf text / table extraction package from scratch in C++ with R bindings, which has required many months and many thousands of lines of code. I started writing it pretty much to do what you are looking to do: reliably extract tabular data from pdfs. I have got it to the point where it can quickly and reliably extract the text from a pdf document, with the associated positions and font of each text element (similar to pdftools).

I assumed that the technical part of reading the xrefs, handling encryption, writing a deflate decompressor, parsing the dictionaries, tokenizing and reading the page description programs would be the real challenges, and that figuring out a general algorithm to extract tabular data was just a detail I would figure out at the end.

For the pdf example you provided, something like the following works fairly well. It falls into the "twiddling parameters" category, and works by cutting the text into columns and rows based on the density function of the x and y co-ordinates of the text elements.

I am using R server on Windows and I need to extract the text of thousands of pdf documents (in order to extract specific data). Thus, I need to install the package "pdftools", though when I do the command to install it :

Did this happen to anyone before? Also I get that the error comes from "poppler-cpp". Thus I tried to install the package "poppler-cpp" but I get the following error cause this is not available for the 3.4.1 R version

I am running R under Anaconda on a Ubuntu 18.04 Precision laptop. I wanted to install readtext (R package) to support some corpora linguistics study and attempted that from R terminal window. I had already installed quanteda without a problem so was surprised when readtext tripped on a pdftools requirement which in turn tripped

Package libpoppler-cpp-dev was not found in the pkg-config search path. Perhaps you should add the directory containing `libpoppler-cpp-dev.pc' to the PKG_CONFIG_PATH environment variable. No package 'libpoppler-cpp-dev' found.

I used the PPA with backports of Poppler 0.74.0 for Ubuntu 18.04 (Bionic) recommended by @jeroen in the cited SE post above, but it appears there is some confusion between the Ubuntu proper and Anaconda R, since the R install is looking for poppler-cpp, but libpoppler-cpp-dev seems to be the appropriate target. Because I am running R under Anaconda, my system seems unaware of R:

I realize this is all somewhat of a mess, but I am hoping someone will recognize an obvious problem and tell me what to hack up, e.g., modify the poppler-cpp.pc to point at the new library or the like.

The solution is to remain within the Anaconda/conda environment, since R is running there and this is a bit of an island within the surrounding Ubuntu sea (if the metaphor is not making you seasick). Just do

It is often the case that data is trapped inside pdfs, but thankfully there are ways to extractit from the pdfs. A very nice package for this task ispdftools (Github link)and this blog post will describe some basic functionality from that package.

As you can see, this is a very long character string with some line breaks (the "\n" character).So first, we need to split this string into a character vector by the "\n" character. Also, it mightbe difficult to see, but the table starts at the line with the following string:"Prevalence of diabetes" and ends with "National response to diabetes". Also, we need to getthe name of the country from the text and add it as a column. As you can see, a whole lotof operations are needed, so what I do is put all these operations into a function that I will applyto each element of raw_text:

Take a closer look at the error messages. If you want to use R-packages on Linux you sometimes need to install a dependency first. Here libcurl4-openssl-dev. Use the following command from a terminal to install this package. Then you should be able to install pdftools from R.

NOTE: the code above only works if you have your working directory set to the folder where you downloaded and extracted the PDF files. A quick way to do this in RStudio is to go to Session...Set Working Directory.

Each element is a vector that contains the text of the PDF file. The length of each vector corresponds to the number of pages in the PDF file. For example, the first vector has length 81 because the first PDF file has 81 pages. We can apply the length() function to each element to see this:

And we're pretty much done! The PDF files are now in R, ready to be cleaned up and analyzed. If you want to see what has been read in, you could enter the following in the console, but it's going to produce unpleasant blocks of text littered with escape characters such as \r and \n.

The Corpus() function creates a corpus. The first argument to Corpus() is what we want to use to create the corpus. In this case, it's the vector of PDF files. To do this, we use the URISource() function to indicate that the files vector is a URI source. URI stands for Uniform Resource Identifier. In other words, we're telling the Corpus() function that the vector of file names identifies our resources. The second argument, readerControl, tells Corpus() which reader to use to read in the text from the PDF files. That would be readPDF(), a tm function. The readerControl argument requires a list of control parameters, one of which is reader, so we enter list(reader = readPDF). Finally we save the result to an object called corp.

It turns out that the readPDF() function in the tm package actually creates a function that reads in PDF files. The documentation tells us it uses the pdftools::pdf_text() function as the default, which is the same function we used above (?readPDF).

Now that we have a corpus, we can create a term-document matrix, or TDM for short. A TDM stores counts of terms for each document. The tm package provides a function to create a TDM called TermDocumentMatrix().

The first argument is our corpus. The second argument is a list of control parameters. In our example we tell the function to clean up the corpus before creating the TDM. We tell it to remove punctuation, remove stopwords (e.g., the, of, in, etc.), convert text to lowercase, stem the words, remove numbers, and only count words that appear in at least 3 documents. We save the result to an object called opinions.tdm.

We see words preceded with double quotes and dashes even though we specified removePunctuation = TRUE. We even see a series of dashes being treated as a word. What happened? It appears the pdf_text() function preserved the unicode curly-quotes and em-dashes used in the PDF files.

One way to take care of this is to manually use the removePunctuation() function with tm_map(), both functions in the tm package. The removePunctuation() function has an argument called ucp that, when set to TRUE, will look for unicode punctuation. Here's how we can use use it to remove punctuation from the corpus:

We see, for example, that the term "abandon" appears in the third PDF file 8 times. Also notice that words have been stemmed. The word "achiev" is the stemmed version of "achieve," "achieved," "achieves," and so on.