Submitted by cm_34978 t3_100rbhp in MachineLearning
I am seeking insights and best practices for data preprocessing and cleaning in PDF documents. I am interested in extracting only the body text content from a PDF and discarding everything else, such as page numbers, footnotes, headers, and footers (see attached image for an example of semantically meaningful sections).
I have noticed that in Microsoft Word, a user can simply drag in a PDF and Word seems to automatically understand which parts are headers, footnotes, etc. I am speculating that Word may be utilizing machine learning techniques to analyze the layout and formatting of the PDF and classify different sections accordingly. Alternatively, Word may be utilizing pre-defined rules or patterns to identify common elements such as headers and footnotes. I know of related techniques for example to extract layout information from receipts and the like (LayoutLM, Xu et al., https://arxiv.org/abs/1912.13318) and tabular data (TableNet, Paliwal et al., https://ieeexplore.ieee.org/document/8978013), but nothing to solve layout extraction in this particular domain.
I am curious to know if there are any techniques or algorithms that can replicate this behavior in Word. Any suggestions or recommendations for data cleaning in PDF documents, would be greatly appreciated.
CatalyzeX_code_bot t1_j2jbyzw wrote
Found relevant code at https://github.com/microsoft/unilm/tree/master/layoutlm + all code implementations here
--
To opt out from receiving code links, DM me