How to Extract Table Data from PDFs Using 3 Python Libraries (tabula-py, pdfplumber, PyPDF2)
Table of Contents
Extracting table data from PDFs can be a daunting task, but Python provides several powerful libraries to help you get the job done efficiently. In this article, we’ll explore seven different Python libraries and demonstrate how to extract table data from a sample PDF document.
Sample PDF
The PDF used for this tutorial contains a table with the following structure:
Item | Description | Quantity | Price |
---|---|---|---|
1 | Laptop | 1 | $1200 |
2 | Mouse | 2 | $25 |
3 | Keyboard | 1 | $45 |
4 | Monitor | 2 | $200 |
5 | Printer | 1 | $150 |
You can download it here. sample pdf
1. Tabula-py
Tabula is one of the most popular libraries for extracting tables from PDFs. It works particularly well for PDFs that contain clearly defined tables.
Installation:
pip install tabula-py jpype1
This is the code how to extract the table data using tabula-py.
import tabula
# Read PDF and extract tables
tables = tabula.read_pdf("data/sample.pdf", pages='all')
# Output the extracted table
print(tables[0])
This is the printed result.
- tabula.read_pdf reads the table directly from the PDF and outputs it as a Pandas DataFrame.
- We use pages=‘all’ to ensure that the table is extracted from all pages, but since the sample PDF is short, it will capture it on the first page.
2. pdfplumber
pdfplumber is highly flexible and provides detailed control over what elements to extract from a PDF, including text and tables.
Installation:
pip install pdfplumber
This is the extracting code.
import pdfplumber
# Open the PDF and extract the table
with pdfplumber.open("data/sample.pdf") as pdf:
page = pdf.pages[0]
table = page.extract_table()
# Print the extracted table
for row in table:
print(row)
This is the printed result.
- extract_table() retrieves the table directly from the PDF page.
- The table is returned as a list of lists, with each inner list representing a row in the table.
3. PyPDF2
While PyPDF2 is a more general-purpose PDF manipulation library, we can extract text and attempt to structure it into a table format.
Installation:
pip install PyPDF2
This is the extracting code.
from PyPDF2 import PdfReader
# Load the PDF and extract text from the first page
reader = PdfReader("data/sample.pdf")
page = reader.pages[0]
text = page.extract_text()
# Split the text into lines (each line represents a row in the table)
lines = text.split('\n')
# Filter lines that contain table data (ignoring headers, page numbers, etc.)
table_lines = [line for line in lines if any(char.isdigit() for char in line)]
# Now we need to further process each line to split it into columns
table_data = []
for line in table_lines:
# Split each row by spaces (or use a more advanced method if needed)
columns = line.split()
table_data.append(columns)
# Print the table data (as a list of lists)
for row in table_data:
print(row)
This is the printed result.