I am seeking insights and best practices for data preprocessing and cleaning in PDF documents. I am interested in extracting only the body text content from a PDF and discarding everything else, such as page numbers, footnotes, headers, and footers (see attached image for an example of semantically meaningful sections).

I have noticed that in Microsoft Word, a user can simply drag in a PDF and Word seems to automatically understand which parts are headers, footnotes, etc. I am speculating that Word may be utilizing machine learning techniques to analyze the layout and formatting of the PDF and classify different sections accordingly. Alternatively, Word may be utilizing pre-defined rules or patterns to identify common elements such as headers and footnotes. I know of related techniques for example to extract layout information from receipts and the like (LayoutLM, Xu et al., https://arxiv.org/abs/1912.13318) and tabular data (TableNet, Paliwal et al., https://ieeexplore.ieee.org/document/8978013), but nothing to solve layout extraction in this particular domain.

I am curious to know if there are any techniques or algorithms that can replicate this behavior in Word. Any suggestions or recommendations for data cleaning in PDF documents, would be greatly appreciated.

Image of PDF with semantically meaningful sections

Comments

CatalyzeX_code_bot t1_j2jbyzw wrote on January 1, 2023 at 7:41 PM

#1,259,423

Found relevant code at https://github.com/microsoft/unilm/tree/master/layoutlm + all code implementations here

To opt out from receiving code links, DM me

lopnax t1_j2jt94g wrote on January 1, 2023 at 9:33 PM

#1,259,856

Did you try using PyMuPDF? Maybe you could discard the parts using some RegEx. https://pymupdf.readthedocs.io/en/latest/index.html

Also you you can crop the page to only get certain square with PyPDF2 and extract the text. https://pypdf2.readthedocs.io/en/stable/user/cropping-and-transforming.html

low_effort_shit-post t1_j2ks3qv wrote on January 2, 2023 at 1:39 AM

#1,260,805

I'm a data engineer by trade, usually we say no. Pdf isn't a data type or type of storage it is a print format. Go to the source and ask for the source. Once the $$$ is discussed and an understanding of how much harder pdfs are to work with and maintain a process it only makes sense to grab the data from wherever pdf does.

[deleted] t1_j2lbha5 wrote on January 2, 2023 at 4:11 AM

#1,261,498

[removed]

avatarOfIndifference t1_j2lhflc wrote on January 2, 2023 at 5:03 AM

#1,261,713

Replying to [deleted] (#1,261,498)

This is definitely a chatGPT response

Terrible-List-1653 t1_j2lhtzw wrote on January 2, 2023 at 5:07 AM

#1,261,723

Hoorah! I’ve been testing this across a few sectors that I have 0 experience in. My “experiment” is to see how well AI can answer questions in a group chat with professionals in their field. As an artist/director/entrepreneur, myself and the people I work with are been effected in real-time…wondering what the effect is in other sectors.

SupplyChainNext t1_j2lik4g wrote on January 2, 2023 at 5:14 AM

#1,261,746

As someone who’s done this extensively - I less it was made in word the pdf can be complete gibberish or have massive paragraph / sentence errors. Heck - the OCR can outright misread words or take a 3 letter word and make it 2 with 15 random ascii characters in between.

It’s a crap shoot.

God speed.

30katz t1_j2lilwg wrote on January 2, 2023 at 5:15 AM

#1,261,748

Replying to low_effort_shit-post (#1,260,805)

Our company is stuck with PDF’s but it’s actually not too hard to work with using Amazon’s textract or Adobe Extract API. But maybe that’s a sign that it is hard because the technology is owned by the two biggest tech giants in the space.

Borrowedshorts t1_j2lnqf6 wrote on January 2, 2023 at 6:07 AM

#1,261,912

2023 and we still can't automate working with PDF documents. Sad.

[deleted] t1_j2lv15a wrote on January 2, 2023 at 7:32 AM

#1,262,130

[deleted]

VacuousWaffle t1_j2m3shr wrote on January 2, 2023 at 9:29 AM

#1,262,385

Replying to low_effort_shit-post (#1,260,805)

I remember at a hospital job I was asked to mine text from PDFs that were generated internally after another team spent a few months trying to build a solution themselves. The source data they were trying to mine was already in the data warehouse in a surprisingly well-formatted table.

Disastrous_Elk_6375 t1_j2m49ih wrote on January 2, 2023 at 9:36 AM

#1,262,405

Replying to avatarOfIndifference (#1,261,713)

Oh yeah. "There are several", "some of the most ... include", "it may be necessary", "overall the best approach" are 100% markers of chatgpt that I've seen in most answers that I got.

ai-lover t1_j2m5u8s wrote on January 2, 2023 at 9:57 AM

#1,262,440

You can use a PDF parsing library: There are several libraries available that can help you extract text and data from PDF documents. Some popular ones include pdfminer, PyPDF2, and PDFMiner.

low_effort_shit-post t1_j2mpzz5 wrote on January 2, 2023 at 1:58 PM

#1,263,131

Replying to VacuousWaffle (#1,262,385)

We get pdf feeds all the time with promises that have financial implications to get a proper data feed. Usually we kick the can down the road and when we get the feed just pull it in and it moves along out usual etl process within a day. PDFs are to be ignored

cm_34978 OP t1_j2n0cym wrote on January 2, 2023 at 3:22 PM

#1,263,638

Update for the interested - after trying a few different packages suggested in the comments, I settled on the inelegant, yet functional solution of automating the import of PDFs to Microsoft Word, saving the PDF as a Word file, then using a library to extract only the body text from the Word file.

Definitely not ideal since this will not work on Linux and will only run as fast as Microsoft Word can open, convert, and save them. But it works.

30katz t1_j2n9hpj wrote on January 2, 2023 at 4:27 PM

#1,264,135

Replying to Terrible-List-1653 (#1,261,723)

Dude, stop. No one needs more garbage information. We can use ChatGPT and Google without your help. You’re not being an AI entrepreneur by spamming ChatGPT responses.

niszozz t1_j2nf8j2 wrote on January 2, 2023 at 5:06 PM

#1,264,433

I believe we can open pdfs through Google docs and save as a doc file

marineman4808ny t1_j2nfe7m wrote on January 2, 2023 at 5:07 PM

#1,264,447

pdfminer, PyPDF2, and PDFMiner.

johnnydozenredroses t1_j2njvbx wrote on January 2, 2023 at 5:36 PM

#1,264,666

Replying to Borrowedshorts (#1,261,912)

The PDF standard optimizes portability over everything else, so it throws away a lot of semantic information that later needs to be reconstructed (like "this piece of text is a page number", "this other piece is a caption that is attached to this image", etc).

To complicate matters, there are too many softwares that can be used to generate PDFs and humans have ingenious ways of fitting things onto a Word document or powerpoint page and converting to PDF.

So it's a hard AI problem.

ypanagis t1_j2nkyk0 wrote on January 2, 2023 at 5:43 PM

#1,264,715

Replying to cm_34978 (#1,263,638)

I was about to propose the same. For those who are interested, this seems to work for MacOS, too, but Windows is definitely a goto. A VBA script can also come in handy, for someone to get several PDFs, open them from Word and save as TXT.

cm_34978 OP t1_j2nsi8g wrote on January 2, 2023 at 6:31 PM

#1,265,083

Replying to ypanagis (#1,264,715)

Definitely. With windows, you get the advantage of the win32com library whereas with MacOS, you need need to play with AppleScript, which (in my hands) can be brittle and finicky.

[deleted] t1_j31awa1 wrote on January 5, 2023 at 10:57 AM

#1,286,263

[removed]