Structure PDF Table Data for AI Applications with GMFT
When extracting data from PDFs, tabular data extraction can become a challenge. This issue gets even worse when dealing with complex tables. While several tools exist for table extraction, most of them miss intricate details, such as merged cells and complex table structures.
Large Language Models (LLMs) are not designed to understand visual layout and struggle with reading unformatted or tables extracted from PDFs. Unformatted tables typically do not have explicit separators to inform the LLM about rows, columns or headers. So, we might end up with an inaccurate output.
As a general best practice, businesses should look for professional tools designed for unstructured data when they work with AI and tables. Although LLMs have their advantages and can be very useful for some processes, special table extractions tools drastically improve precision. In complex use cases, it may be easier to parse unformatted pages like these using solutions such as Optical Character Recognition (OCR) tools or libraries specifically created for table extraction. It preprocesses the data to be structured in an easy-to-analyze manner by AI models using these tools.
GMFT (Give Me Formatted Tables!) is one such tool, providing a fast and reliable way to extract tables from PDFs having both simple as well as more complex structures without the need of complicated setups or losing accuracy.
What is GMFT?
GMFT (Give Me Formatted Tables!) is an innovative toolkit that improves the way we extract tables from PDFs. Leveraging Microsoft's cutting-edge Table Transformer (TATR), GMFT excels in transforming even the most intricate table formats into structured data with remarkable precision. Whether you're dealing with complex scientific reports or financial statements, GMFT provides a seamless solution for all your table extraction needs.
Key Features of GMFT
- Lightweight and Fast - Designed for efficiency, GMFT operates smoothly on standard CPUs—no heavy GPU dependencies required. With an average page processing time of just 1.38 seconds and table conversion to DataFrames taking around 1.16 seconds, GMFT is 10 times faster than traditional alternatives, making it the ideal choice for time-sensitive tasks.
- Versatile Export Options - GMFT understands that different projects have different needs. That's why it allows you to export extracted tables in a variety of formats, including Pandas DataFrame, CSV, JSON, Markdown, HTML, and LaTeX. You can even obtain cropped images of tables for visual verification or further processing through OCR—perfect for those situations where a quick glance is needed.
- Exceptional Extraction Quality - It skillfully manages tables with multi-level headers, merged cells, and rotated layouts, delivering top-notch extraction quality. Whether you're working with scientific data or financial figures, GMFT ensures that every detail is captured accurately.
- Efficient Use of OCR - GMFT minimizes the need for OCR by utilizing the embedded text in PDFs whenever possible, significantly speeding up the extraction process. However, OCR capabilities are available for scanned documents when necessary, ensuring that GMFT can handle image-based PDF you throw its way.
Overcoming the Challenges of PDF Table Extraction
A recent project required the extraction of tables from a variety of PDF documents, including some with intricate features like merged cells and multi-level headers. Although there are many tools available for extraction but proved ineffective in our use case. A tool was needed that could provide following:
- Quickly extract tables from both text-based and image-based PDFs.
- Handle multi-level headers, merged cells, and large tables efficiently.
- Function without internet access in air-gapped environments.
GMFT shines in this situation — In order to satisfy these demands, we investigated and contrasted several table extraction tools, finding that GMFT is the best at offering a dependable and quick solution.
Feature Comparison
Feature | GMFT | Camelot | PyPDF2 | Tesseract | Cloud-Based Models |
Handles Multi-Level Headers | ✅ | ❌ | ❌ | ❌ | ✅ |
Extracts Merged Cells | ✅ | ❌ | ❌ | ❌ | ✅ |
Image-Based Table Extraction | ✅ | ❌ | ❌ | ✅ | ✅ |
Text-Based Table Extraction | ✅ | ✅ | ✅ | ❌ | ✅ |
Direct Table Detection | ✅ | ❌ | ❌ | ❌ | ✅ |
Offline Operation | ✅ | ❌ | ✅ | ✅ | ❌ |
Ideal for Air-Gapped Environments
A key requirement was a solution that worked in air-gapped environments—where no internet access is available. GMFT is a self-contained, CPU-based solution that doesn’t require internet access or external cloud resources, making it the perfect choice for offline use.
Seamless Extraction of Text and Image-Based Tables
GMFT is a versatile solution designed to extract tables from both text-based and image-based PDFs with ease:
- Integrated Functionality: GMFT effectively recognizes and extracts tables using advanced machine learning models, such as the Table Transformer. This allows for precise table extraction without the need for traditional OCR, ensuring that the table structure is preserved.
- Simultaneous Processing: With GMFT, users can extract text and image tables from the same document simultaneously. This means you get comprehensive results in one streamlined process, eliminating the hassle of separate workflows.
See the flow diagram below for a detailed view of how GMFT extracts tables from PDFs:
Transforming Unformatted Tables into Structured Data
In addition to the previous task, which includes extracting tables from PDFs, another important task is the process of taking unformatted, raw tables and converting them into a format that is structured for the various workflows and applications it is intended for:
- Unstructured vs. Structured Tables: In most cases, unstructured tables generally have no apparent organization. They contain non-existent spacing, cells that are merged and other formatting that is a haphazard. This makes these tables challenging even to utilize within machine learning applications since they do not have a defined format structure. Structured tables on the contrary, are clearly defined, have columns and rows, and are quite simple to read and understand by downstream systems.
- Gmft Processing: GMFT receives from text and image PDF sources, non-formatted tables and processes them into organized and defined tables. This approach helps to keep data free from errors and assists in shaping the final data which is meant for the analysis stage.
- Facilitating Embedding Workflows: Structured data is essential for embedding processes, where AI models often rely on clean, organized inputs for generating embeddings. By converting messy tables into structured formats, GMFT ensures that the data can be efficiently embedded and analyzed, making the workflow smoother and more accurate.
- Simplified Data Integration: The conversion from unstructured to structured eliminates the need for manual cleaning, saving significant time and effort. Structured tables can be consistently integrated into broader data workflows, reducing the chance of errors and ensuring smoother operation.
How GMFT Works?
Below is a quick explanation of how GMFT works and its simple setup process:
Simple Installation:
Getting started with GMFT (Give Me Formatted Tables!) is easy. Unlike other PDF extraction tools that require complex setups with dependencies like Detectron2 or Tesseract, you only need to install a few packages:
pip install transformers torch
pip install gmft
Lightweight and Efficient:
GMFT runs on pypdfium2 and transformers. On the first run, it downloads Microsoft's Table Transformer (TATR) from Hugging Face, taking up about 270 MB of disk space. This model is cached locally for quick access.
Flexible Table Output Formats:
GMFT offers flexibility in exporting extracted tables. You can work with Pandas DataFrames for further analysis or export data directly as CSV or JSON. Here's a snippet demonstrating how to save tables in multiple formats:
import pandas as pd
def save_tables_in_multiple_formats(tables):
for index, table in enumerate(tables):
table.to_csv(f"output_table_{index}.csv", index=False) # Save as CSV
table.to_json(f"output_table_{index}.json", orient='records') #Save as JSON
df = table.to_pandas() # Convert to Pandas DataFrame for further manipulation
# You can add additional processing here if needed
Efficient Handling of Complex Tables:
GMFT excels at extracting data from complex tables featuring merged cells and multi-level headers. For example, when working with financial reports, GMFT achieved impressive accuracy where other tools struggled:
from gmft.auto import AutoTableFormatter
def extract_complex_tables(tables):
# Use formatter for complex table extraction
formatter = AutoTableFormatter(support_multi_level=True, support_merged_cells=True)
formatted_tables = [formatter.extract(table) for table in tables]
return formatted_tables
QuickStart:
Below is a complete code example that demonstrates how to extract both a text-based table and an image-based table from a sample PDF:
from gmft.auto import AutoTableDetector, AutoTableFormatter
from gmft.pdf_bindings import PyPDFium2Document
def extract_tables(pdf_path):
# Initialize the table detector and formatter
detector = AutoTableDetector()
formatter = AutoTableFormatter()
# Load the PDF document
doc = PyPDFium2Document(pdf_path)
tables = []
# Iterate through each page to extract tables
for page in doc:
# Extract tables from the page
tables += detector.extract(page)
# Format the extracted tables
formatted_tables = [formatter.extract(table) for table in tables]
return formatted_tables
# Example usage
pdf_path = "path/to/sample_pdf_with_text_and_image_table.pdf" # Update with your PDF path
tables = extract_tables(pdf_path)
# Save tables in multiple formats
save_tables_in_multiple_formats(tables)
How to Use the Code:
- Update the pdf_path variable with the path to your sample PDF file that contains both a text table and an image table.
- Run the script to extract and save the tables.
Final Thoughts
For anyone working with PDF table extraction—especially in air-gapped environments or when dealing with complex table layouts—GMFT is a great tool. It’s fast, accurate, and integrates seamlessly into existing workflows without the overhead of heavy GPU dependencies or internet access. This allows it to handle a wide variety of documents without requiring separate tools for text and image-based extraction.