Pdf text extractor python

8/4/2023

With the necessary library installed we must first create an empty document object and then build that empty object by doing the following steps. !pip install python-docx from docx import Document First, we install and import it into our environment. The library we will be using is called python-docx. We need one more library now so that we can create our word document. clean_text = executive_summary.replace("\n","") Our Python Code: Making our word document We can remove this with a simple one-liner. You’ll notice that the text has many instances of “\n” within it when you print it out. # Getting Executive Summary page_obj1 = pdf_reader.getPage(12) page_obj2 = pdf_reader.getPage(13) executive_summary = page_obj1.extractText() page_obj2.extractText() Now let’s pull all the text from pages 12 and 13 and combine them to get the executive summary. If you print the page_obj you will get something quite unreadable to the human eye. # How to create a page objec page_obj = pdf_reader.getPage(12) We can pull out an individual page using the following method. We know from looking at the original PDF that we are interested in pages 12 and 13 where the Executive Summary resides.

# Converting the object into a PDF Reader Object pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj) # If you want to find out the number of pages in the PDF use this # command print(pdf_reader.numPages) Now we need to convert pdf_file_obj into a PyPDF2 object so that we can use the library to search through the Indonesia Energy Outlook to extract our text of interest. pdf_file_obj = open("/content/content-indonesia-energy-outlook-2019-english-version.pdf","rb") We must save the PDF as an object before we can start using PyPDF2 on it. !pip install PyPDF2 import PyPDF2īefore we move to the next step make sure you have loaded the PDF document into the file repository on the left of the colab environment. This library isn’t pre-installed in the Google colab environment so we will have to install it before importing the PyPDF2 into our code. PyPDF2 can do much more than just extract text and if you are curious about its other capabilities, you can read about them here. The library we will use to extract the PDF text is called PyPDF2. Note: The following code explanation is designed for the Google colab environment. With the PDF and text identified let’s move on to using python to extract the Executive Summary. For the purpose of this post, I am only going to focus on extracting the text from the Executive Summary on pages xii and xiii. If you open the link to the PDF you will find a long report with many pages and figures. Since pdfminer

0 Comments

Pdf text extractor python

Leave a Reply.

Author

Archives

Categories