Browse Prior Art Database

Indexing PDF Documents without file size creep

IP.com Disclosure Number: IPCOM000012579D
Original Publication Date: 2003-May-16
Included in the Prior Art Database: 2003-May-16
Document File: 8 page(s) / 74K

Publishing Venue

IBM

Abstract

Indexing Adobe Portable Document Format (PDF) documents for database archival and retrieval will create documents that are 10 to 20 times larger than the original Adobe PDF document supplied by the customer. The process used to index these Adobe PDF files is the cause of this problem and the following process will allow Adobe PDF documents to be indexed while keeping the original file size or creating indexed Adobe PDF documents with file sizes smaller than the original. This disclosure assumes the readers has working knowledge of the Adobe Acrobat PDF development library.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 25% of the total text.

Page 1 of 8

Indexing PDF Documents without file size creep

The Current Indexing Process:

    The user selects the fields ,words, in the document they want to index via the Minimum Bounding Rectangle (MBR), which wholly contains the field/word. By selecting these fields, the user uses these fields, i.e. Account Number, to break the large document into smaller individualized documents. In the example case, the customer has a 10,000 page Adobe PDF document containing 1500 customer bills that needs to be separated into 1500 different individual Adobe PDF documents and then stored into one Binary Large Object (BLOB) in the database. Indexing this Adobe PDF file is accomplished via a parameter file that has been created using the MBR's of the fields selected:

FIELD1=ul(0.28,0.69),lr(3.6,1.03),0
FIELD2=ul(1.4,0.19),lr(2.07,0.51),0
FIELD3=ul(1.37,0.37),lr(2.1,0.65),0
INDEX1='name',FIELD1
INDEX2='sdate',FIELD2
INDEX3='acctnum',FIELD3

(So the name, date and account number index fields must fit within the MBR described by
field1, field2 and field3.)

    The field descriptors are used to build a text index file to store in the database along with the BLOB of Adobe PDF data. The text located at the field position for each index is selected and then a new Adobe PDF document is built each time a new account number, name or a new date is found. The new concatenated Adobe PDF document is stored in the BLOB and the text index file contains the byte offset to that Adobe PDF document located in the BLOB for retrieval means.

Here is an example of this index file:

COMMENT:
COMMENT: Generic Indexer Format
COMMENT:
COMMENT:
COMMENT: Code Page of the Index Data
CODEPAGE:5348
COMMENT: Index Field(s)
GROUP_FIELD_NAME:name
GROUP_FIELD_VALUE:Chucks Trust

Page 2 of 8

GROUP_FIELD_NAME:sdate
GROUP_FIELD_VALUE:12/31/99
GROUP_FIELD_NAME:acctnum
GROUP_FIELD_VALUE:5005645
COMMENT: Index Offsets and Length
GROUP_OFFSET:0
GROUP_LENGTH:747364
GROUP_FILENAME:fred.out
COMMENT: Index Field(s)
GROUP_FIELD_NAME:name
GROUP_FIELD_VALUE:Johns Trust
GROUP_FIELD_NAME:sdate
GROUP_FIELD_VALUE:12/31/99
GROUP_FIELD_NAME:acctnum
GROUP_FIELD_VALUE:5012089
COMMENT: Index Offsets and Length
GROUP_OFFSET:747364
GROUP_LENGTH:747200
GROUP_FILENAME:fred.out
COMMENT: Index Field(s)
GROUP_FIELD_NAME:name
GROUP_FIELD_VALUE:Gregs Trust
GROUP_FIELD_NAME:sdate
GROUP_FIELD_VALUE:12/31/99
GROUP_FIELD_NAME:acctnum
GROUP_FIELD_VALUE:5005806
COMMENT: Index Offsets and Length
GROUP_OFFSET:1494564
GROUP_LENGTH:747306
GROUP_FILENAME:fred.out

    Here is an example of code that could be used to extract the pages to build a flat Adobe PDF file to load into a BLOB. Notice the number of bytes written is returned back to the calling function so that the byte offsets for each document can be written into the index file.

long extractPages ( PDDoc docP, /*Doc Handle to original customer PDF Document */ char* tempName, /* Temporary storage name for extracted PDF Pages */ char* flatFileName, /* Name of the flat file to contained the 'stored' PDF Docs */ Int32 pgNumBeg, /* Beginn...