Full-Text Search
Retrieving textual content from documents is a vital part of many PDF workflows. For various reasons, text extraction isn’t always straightforward, but here is how we make it easy for you with Foxit PDF SDK.
What is Full-Text Search?
PDF documents consist of many parts with the textual content often being the most important. Text search, and text extraction, are two common tasks required by both PDF developers as well the end users of PDF software. In order to search for text in a document, the text content must first be extracted from the PDF which can be difficult without our SDK!
Using an index allows a document to be searched quickly as the text extraction phase only needs to be completed once. This allows the search operation to be scaled up to allow searching of large sets of documents.
Simplifying the search process with PDF SDK
PDF SDK offers the fastest text search technology in the market. The biggest challenge with text extraction, or text searching, is that the PDF format allows any character to be displayed at any position on the page, in any size and with any rotation. Characters on the page are not required to appear in any certain order, either within a page, on a line, or even within a word!
Words can be split (for example, with a hyphen at the end of a line) and certain characters can be combined into one (for example, the fi character instead of separate f and i). When searching for a phrase, the word of the phrase could be on different lines.
The visual representation of the characters (known as glyphs) can be sourced from any number of fonts stored within the PDF. In addition, many different font formats are allowed, eg. TrueType or PostScript. Foxit PDF SDK has already accounted for all of this, and, out of the box, helps you set up foolproof full text search capabilities for your application.
ELEMENTS OF FULL-TEXT SEARCH

SEARCH A STRING OF TEXT ACROSS ANY/ALL DOCUMENTS

HIGHLIGHT ALL INSTANCES OF A STRING ON A DOCUMENT

NAVIGATE THROUGH PREVIOUS/NEXT SEARCH RESULTS

ABILITY TO SEARCH META INFORMATION

COMPLETE FILE SEARCH IN SECONDS

KEYWORD, STRING OR PHRASE SEARCH
Tagging to help Full-text search
The PDF format offers full tagging support for blocks of text and other items in the page, which allow items to be identified, read, searched and rendered properly. Foxit PDF SDK offers full support for programmatic tagging of phrases, paragraphs, and all other PDF items, which serves a double purpose:
1. Faster, more streamlined PDF searching
2. Enhanced accessibility and compliance with many document accessibility standards
Why USE FULL-TEXT SEARCH IN PDFS?
NEVER LOSE INFORMATION AGAIN
SEARCH COLLECTIONS OF DOCUMENTS IN SECONDS
SEARCH FULL PDFS, INCLUDING METADATA AND ANNOTATIONS
Full-Text Search
and Metadata
When creating documents information can be organized and managed in a way that full-text search can be done easily and logically. This involves editing document metadata to ensure it is all present and updating document tags to outline the topics discussed in files. Using frequent topics or department names as tags, companies can easily search folders full of hundreds of files to find the information they seek quickly and accurately. Foxit PDF SDK allows you to set, remove, and edit all metadata in your documents programmatically, based on preset logic and workflows.
Full Text Search and Redact information
In many industries, particularly after the approval of GDPR in Europe, searching and deleting customer information (such as the name of a customer) across all documents in any one or all your document management systems has become a nightmare. Just think about all the information you hold right now on any one of your customers: contracts, support tickets, archived emails, financial records… Now imagine receiving a GDPR request for removal of all their information.
Foxit PDF SDK turns this manual, multi-hour nightmare task into a quick search and remove by allowing you to search all instances of any given string of text (such as the name of your customer) across all your records, select it and securely redact it while maintaining the integrity of the original document.
USE CASES

SECURELY SEARCHING LEGAL DOCUMENTS

ACHIEVE GDPR COMPLIANCE
GDPR allows your European customers to request all personally identifiable information you hold on them. Have you asked yourself how you would search all the information you hold on any given customer across all your document management systems if you were to get a request like that? With PDF SDK, you can search, select and redact all instances of your customers’ information across all documents quickly and securely.
Sample Code
Full-Text Search
public void doPdfToText() { int indexPdf = mFilePath.lastIndexOf("."); int indexSep = mFilePath.lastIndexOf("/"); String outputFilePath = "output.txt"; String strText = ""; PDFDoc doc = Common.loadPDFDoc(mContext, mFilePath, null); PDFPage page = null; try { int pageCount = doc.getPageCount(); // Traverse pages and get the text string. for (int i = 0; i < pageCount; i++) { page = Common.loadPage(mContext, doc, i, PDFPage.e_ParsePageNormal); TextPage textSelect = new TextPage(page, TextPage.e_ParseTextNormal); strText += textSelect.getChars(0, textSelect.getCharCount()) + "\r\n"; page.delete(); } } catch (PDFException e) { // PDF to text error return; } finally { Common.releaseDoc(mContext, doc); } // Output the text string to the text file FileWriter fileWriter = null; try { File fileTxt = new File(outputFilePath); fileWriter = new FileWriter(fileTxt); fileWriter.write(strText); fileWriter.flush(); fileWriter.close(); } catch (IOException e) { // Error return; } }