Browse Prior Art Database

Method of optimising parsing by using multiple threads

IP.com Disclosure Number: IPCOM000249254D
Publication Date: 2017-Feb-14
Document File: 4 page(s) / 72K

Publishing Venue

The IP.com Prior Art Database

Abstract

This article presents a multi-threaded solution to document parsing, enabling parser performance to increase as CPU power is horizontally scaled. This is beneficial to the parsing of large documents.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 50% of the total text.

Method of optimising parsing by using multiple threads

Software or hardware based parsers parse a document in a format such as XML, JSON, CSV, or HL7 (or any other binary or text format) into a structure suitable for consumption by a software program. Such parsers can also validate the structure and data inside the document according to a schema, such as XML Schema, JSON Schema, or DFDL Schema.

Parsers are traditionally single threaded, and with current trends focusing on the horizontal scaling of CPU power by adding more cores/threads rather than vertical scaling by increasing clock speeds, parser performance may not continue to increase.

It is possible to distribute a parsers work across multiple threads so that parser performance can increase as CPU power is horizontally scaled. By using a multi-threaded approach to parsing, it is possible to increase parser performance for large documents. A example for a multi-threaded parser pipeline shall now be presented, describing the operational flow that may achieve increased parser performance:

1. Scan the document for markup that indicates the location of values inside the document:

For example curly brace { and } characters indicate the start and end of a JSON object. Double quote characters indicate the start and end of a JSON string.

2. Extract the values in the between the markup: Collect everything in-between the double quote characters as the value of the

JSON string. 3. Convert the values into a format suitable for consumption by the application:

Many applications store parsed value in a general purpose encoding such as UTF-16, so the JSON string may be encoded from EBCDIC to UTF-16 at this point.

4. Validate the value in accordance with a schema: A JSON Schema might state that the value of the string is an enumeration, and

must have the value CUSTOMER, ORDER, or OPPORTUNITY. 5. Pass the value to the consuming application:

Many parsers implement a streaming (SAX) style interface where values are passed as events to the consuming application.

At this point the consuming application may create application specific objects to store the values.

Since this is traditionally done by the thread doing the parser, this will block further parsing from occurring until complete.

The requirement is to convert these steps into a multi-threaded pipeline, where a thread is responsible for one or more steps (but not all of them). The threads pass work between the different stages of the pipeline using in-memory queues. Since performance is the focus of this idea, a lock-free queue such as SPSC (single-provider single-consumer) queue would be desirable.

By means of example, by scanning the (basic) example JSON document: { "element1": "value1", "element2": "value2", "element3": "value3", "element4": "value4", "element5": "value5" }

2

The steps would look as follows:

Time Thread 1 (scanner)

Thread 2 (extractor)

Thread 3 (converter)

Thread 4 (validator)

Thread 5 (consumer)

0 Identifies markup surroundin...