Many organizations rely on technical documents stored in PDF format, including manuals, reports, specifications, and drawings. These documents often contain not just text, but also tables, images, captions, headers, and complex layouts. Extracting useful information from them is far from simple.
Traditional OCR (Optical Character Recognition) tools were designed to recognize printed text from scanned images. While they do a good job converting readable text into digital characters, they struggle to understand how a document is structured. They can’t always tell which text belongs to a table, which paragraph is a caption, or how images and their description connect.
Our client faces the same challenges in their technical document extraction process. The manual work is too time-consuming, while traditional AI solutions can’t help solve their complex cases.
Our AI team jumped in to build a customized object detection model to let the client accurately extract structured data from complex technical PDFs and convert it into their special format, the Flex format.
The Limitations of Traditional OCR
As a matter of fact, traditional OCR is good at reading but not understanding. In other words, it focuses on what the text says without knowing where it belongs. This difference matters when dealing with technical or structured documents. The following are what OCR usually misses.
The structure of documents
OCR sees lines of text, not the logical layout of a document. It doesn’t know that some text is a heading, a table cell, or a list item. As a result, the extracted output may look like a wall of text, losing the organization that makes documents understandable.
Poor Handling of Non-Text Elements
Images, captions, and tables often carry meanings that depend on each other. OCR can ignore these, extract them separately, or treat them as random blobs of pixels, but it never understands their relationship. For example, a graph corresponds to the caption below it.
Weak Table Analysis
OCR can see that a table exists, but can’t always tell where one cell ends and another begins, especially if the table doesn’t have visible borders. Then, data that should be organized in rows and columns ends up as inconsistent blocks of text.
Inconsistent Ready Order
Plus, when documents have multi-column layouts (like many technical manuals do), OCR struggles to decide which text to read first. The result: sentences come out jumbled, with sections from columns mixed together.
Lack of Customization
Most OCR systems produce a fixed output format (like plain text or JSON). They can’t be easily adapted to a client’s data structure or downstream system requirements.
Our Customized Object Detection Approach
Understanding the client’s pain point, we approached the problem differently. Instead of treating documents as a cloud of words, our model focused on detecting all meaningful elements on each page, as individual objects.
Think of a document as a visual scene. Just like an object detection model can find cars, trees, and people in a photo, our model finds text blocks, tables, images, headers, footers, and captions in a PDF page.
The process looks like this:
1. The document (PDF) is received and broken down into individual page images.
2. Each page is analyzed using object detection to identify visual and textual components.
3. Those detected objects (text blocks, tables, images, headers, footers, etc.) are processed and organized into a structured data model.
4. Finally, the system converts that structured data into the client’s required format, in this case, the Flex format.
How Our Custom OCR Model Works Under the Hood
At the heart of our Model is a mix of advanced object detection and multi-modal AI technologies. A few standouts include:
Zolo for Object Detection – We use Zolo to detect 11 different types of objects, from text and images to tables, section headers, list items, and captions. This goes well beyond OCR, which typically only captures text and images.
CLIP for Vision-Language Understanding – We integrate CLIP, a vision-language model that helps the system understand how text and images relate to each other. For instance, it can correctly link a figure with its corresponding caption, something OCR could never do automatically.
Smart Reading Order – The model can read documents with multiple columns, understanding that in a two-column layout, it should read left-to-right within each row of text, not straight down the page. This ensures the extracted text preserves the same logical flow as the original.
Advanced Table Reconstruction – Using Zolo’s detection, Skribenta can rebuild the structure of tables, recognizing cell boundaries even in borderless designs. The result is a clean, machine-readable table ready for conversion.
Customizable for Any Client Use Case – Unlike general-purpose OCR, Skribenta’s pipeline can be tailored to each client. It can adapt object classes, reading logic, or data output formats, ensuring it fits perfectly into existing workflows, such as converting extracted data into the Flex format.
Head-to-Head: Our Custom OCR Pipeline vs. Traditional OCR
To evaluate performance, we compared our customized pipeline with DeepSeek OCR, a well-known OCR-based extraction system.
Feature
Our Custom Pipeline
DeepSeek OCR
Process Flow
Detect objects → Process → Structure → Convert to Flex.
Reads top-to-bottom, left-to-right; adapts to multi-columns.
Similar, but less adaptive.
Table Extraction
Rebuilds tables, even without borders.
Needs borders; limited accuracy.
Customization
Fully customizable for client needs.
Fixed output; limited flexibility.
The following are how DeepSeek OCR extracts elements in PDF documents:
The results were clear:
Our pipeline extracted more complete document content.
It handled complex tables and multi-column layouts better.
And most importantly, it was customizable – tailored to client workflows and data formats.
Results and Observations of Custom OCR
After extensive testing, we found that our model delivers consistent improvements over OCR-based pipelines:
Improved table extraction, even for borderless or complex tables.
More complete document understanding, capturing captions, headers, and section structure.
Accurate reading order, even across mixed one-, two-, and three-column layouts.
Smooth integration into the client’s Flex format for automated processing.
Our custom approach matters as it changes how we think about document understanding. Traditional OCR tools see a document as a flat surface filled with text. As mentioned, they can recognize characters and words, but they often lose sight of how different pieces of information relate to one another. In contrast, our model treats a document as a structured composition, a collection of interconnected elements such as text blocks, tables, images, headers, and captions. This shift in perspective allows the system to preserve not only the content but also the organization and meaning behind it.
By combining object detection with multimodal learning models like CLIP, our pipeline doesn’t just read text; it understands the relationship between visual and textual elements. For example, it can identify an image, find its corresponding caption, and maintain that connection in the extracted data. This capability makes a significant difference in technical documents, where context and relationships are essential.
Another reason our approach stands out is its customizability. Instead of producing one fixed output format, our system adapts to each client’s needs. We can tailor the extraction process and output structure to match the client’s internal Flex format or other data models. This flexibility ensures that the information extracted from PDFs can flow directly into existing workflows without requiring extensive post-processing or manual correction.
Ultimately, our model delivers cleaner, richer, and more usable data. It bridges the gap between raw visual information and structured digital intelligence, turning static PDF documents into meaningful, machine-readable resources. In other words, it helps organizations move beyond simple text recognition toward true document understanding.
Learning from Docling: The Research Behind Smarter Extraction
While developing the custom model, the team studied various document AI systems, including Docling, an open approach that applies vision-language models (VLMs) to document analysis.
Docling’s use of VLMs allows it to:
Understand the relationship between images and captions.
Maintain correct text reading order, even with multiple columns.
Perform better at table detection than traditional OCR.
However, we found major drawbacks. It requires significant hardware resources, making it hard to deploy at scale. Also, it still needs fine-tuning for each client’s use case, similar to other complex AI models.
The Future of Document Intelligence
While our current pipeline already offers strong performance and flexibility, we see many opportunities to make it even more capable. One of our main focuses moving forward is improving how the model detects and reconstructs complex tables, especially those without clear borders or consistent layouts. Tables are among the most information-rich elements in technical documents, and refining their extraction accuracy will significantly enhance overall data quality.
We also plan to integrate more advanced vision-language models into our system. These models can reason about both images and text simultaneously, enabling the pipeline to better understand visual context, complex document hierarchies, and semantic relationships. By doing so, the system will become smarter at recognizing how different elements on a page work together to convey meaning.
Another area of development is performance optimization. As we work with increasingly large document batches, efficiency becomes crucial. We aim to reduce processing time and hardware demands without compromising accuracy, making the system scalable for enterprise-level use.
Finally, we’re continuing to expand the customization capabilities of our pipeline. Every client has unique document types and data structures, so we want our system to adapt even more seamlessly to new formats and specialized use cases.
In short, our next steps are all about deepening understanding, improving efficiency, and ensuring adaptability, building toward a future where intelligent document processing becomes effortless, precise, and universally accessible.
Final Thoughts
Traditional OCR has been a useful tool for decades, but it was never designed for the complexity of today’s technical documents.
By combining object detection, image processing, and multi-modal understanding, our customized model fills the gaps, reading not just the text, but the structure and meaning behind every element.
Where OCR stops, our system continues, transforming complex PDFs into structured, intelligent data ready for the digital workflow.
Trinh Nguyen
I'm Trinh Nguyen, a passionate content writer at Neurond, a leading AI company in Vietnam. Fueled by a love of storytelling and technology, I craft engaging articles that demystify the world of AI and Data. With a keen eye for detail and a knack for SEO, I ensure my content is both informative and discoverable. When I'm not immersed in the latest AI trends, you can find me exploring new hobbies or binge-watching sci-fi
Content Map AI Agents Are More Than Just Virtual Assistants AI Agent Use Cases in Customer Experience AI Agent Use Cases for Internal Operations AI Agent Use Cases Across Industries Getting Started: A Practical Implementation Guide The Future of AI Agents Common Concerns and Misconceptions Artificial intelligence has developed beyond simple chatbots and virtual assistants. […]