lopnax t1_j2jt94g wrote on January 1, 2023 at 9:33 PM Reply to [D] Data cleaning techniques for PDF documents with semantically meaningful parts by cm_34978 Did you try using PyMuPDF? Maybe you could discard the parts using some RegEx. https://pymupdf.readthedocs.io/en/latest/index.html Also you you can crop the page to only get certain square with PyPDF2 and extract the text. https://pypdf2.readthedocs.io/en/stable/user/cropping-and-transforming.html Permalink 14
lopnax t1_j2jt94g wrote
Reply to [D] Data cleaning techniques for PDF documents with semantically meaningful parts by cm_34978
Did you try using PyMuPDF? Maybe you could discard the parts using some RegEx. https://pymupdf.readthedocs.io/en/latest/index.html
Also you you can crop the page to only get certain square with PyPDF2 and extract the text. https://pypdf2.readthedocs.io/en/stable/user/cropping-and-transforming.html