Browse Prior Art Database

Apparatus for, extracting eastern european special characters from a PDF file using the Adobe Acrobat PDF Library.

IP.com Disclosure Number: IPCOM000016549D
Original Publication Date: 2003-Jun-27
Included in the Prior Art Database: 2003-Jun-27
Document File: 3 page(s) / 47K

Publishing Venue

IBM

Abstract

Apparatus for extracting eastern european special characters to ASCII text from an Adobe PDF file using the Adobe Acrobat PDF Library. This disclosure assumes the user has a working knowledge of the Adobe Acrobat PDF library development environment.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 54% of the total text.

Page 1 of 3

  Apparatus for, extracting eastern european special characters from a PDF file using the Adobe Acrobat PDF Library.

         This disclosure publication provides a procedure/method to extract eastern European text containing special characters from an Adobe PDF file using the native Adobe Acrobat PDF library. Without this method, special characters cannot be extracted to (ISO8859-2) ASCII text using the standard Adobe Acrobat library function calls of PDDocCreateWordFinder and PDWordGetString. The PDDocCreateWordFinder function assumes all PDF text to be in WinAnsiEncoding unless specifically stated otherwise. In the event the PDDocCreateWordFinder function is used on an eastern European document and the font encoding is not specifically specified with the font; when the text is extracted from the Adobe PDF document, the following list of characters cannot be extracted to ASCII text (ISO8859-2) because they are not represented by the same code points in the ASCII (ISO8859-2)code page and in the Adobe PDF Document.

1. T Caron
2. Z Acute
3. S Caron
4. E Caron
5. C Caron
6. R Caron
7. U Ring
8. D Caron
9. N Caron
10. L Acute
11. L Caron In order to process the characters listed above correctly one must use the Adobe Acrobat library function call of PDDocCreateWordFinderUCS and PDWordGetString along with some other source code to extract these characters to ASCII (ISO8859-2) text in the correct fashion. The PDDocCreateWordFinderUCS function call allows the conversion of Adobe PDF text within the document to the Unicode (UCS-2) character set. Once retrieving the text, using the PDWordGetString function call, from, the Adobe PDF document, after creating the word finder with PDDocCreateWordFinderUCS, the Unicode character string is passed into the wcstombs standard, ANSI, C function which converts the string from Wide Character Set (UCS-2) to platform specific multibyte character set, in other words it is converted from UCS-2 (Unicode) to the current multibyte code page supported on the current running operating system. In the example case the software is running in the Windows/DOS environment the before mentioned 11 characters must first be converted from the standard ISO8859-2 code page to the windows code page 1250 before calling the wcstombs function as the code points for these characters are different for Windows Code Page 1250 and Code Page ISO8859-2. Our example test case is running in the DOS/Windows environment so the first step would be to byte swap the text string from the Adobe PDF library function, PDWordGetString(), before passing it to the wcstombs function. This byte swap must take place after converting the special characters from ISO8859-2 to Windows code page 1250 in the event the software is running on Windows/DOS as the string is stored in the Adobe PDF document in Motorola processor format, better known as Big Endian format. Intel, DOS/Windo...