Browse Prior Art Database

GPU Accelerated Tokenization In Natural Language Processing Systems

IP.com Disclosure Number: IPCOM000239417D
Publication Date: 2014-Nov-05
Document File: 3 page(s) / 114K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed are a system and method for Graphics Processing Unit (GPU) accelerated tokenization in Natural Language Processing (NLP) systems. The system and method stream in multiple character streams representing raw textual documents and then simultaneously stream out result vectors in a special form that represents a tokenized stream based on the given input delimiter rules.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 46% of the total text.

Page 01 of 3

GPU Accelerated Tokenization In Natural Language Processing Systems

Systems that perform unstructured information retrieval from a textual document corpus contain an early processing step that is fundamentally an act of taking a document from said corpus, treating it as a "character stream", and processing this into something called a "token stream". This step is known as tokenization.

Tokenization may include early/upstream processing steps, wherein very primitive conversions of the raw input forms are processed or converted into others. This conversion operation is highly input/output (I/O) intensive (e.g., a case in which the byte stream may be pulled/streamed in from a disk or across a network or even both simultaneously). Usually, processing is a document-centric, single threaded approach. This consumes a significant amount of Central Processing Unit (CPU) cycles on the host.

One way of reducing the cycles and obtaining multi-document parallelism, as in a Single Instruction, Multiple Data (SIMD) processing model, is to stream multiple documents into a general purpose Graphics Processing Unit (GPU), and then complete tokenization. This enables the user to obtain a stream of tokenized forms from the GPU, thereby offloading significant amounts of processing from the CPU/HOST to the GPU.

A system and method is disclosed herein for GPU accelerated tokenization in Natural Language Processing (NLP) systems. The system and method stream in multiple byte streams representing raw textual documents and then simultaneously stream out result vectors in a special form that represents a token stream based on some delimiter rules. The GPU can fundamentally stream in the form of multiple streaming Multiprocessors (SMs) processing warps (reference Compute Unified Device Architecture (CUDA) hardware as an embodiment ) over byte array representations of multiple documents simultaneously with minimal or very short divergence (reference CUDA hardware as an embodiment ), wherein the GPU side has appropriate kernels for processing.

It is advantageous that embodiments of the present invention provide GPU offload with full SIMD processing of multiple documents in parallel in the act of tokenization. This makes the CPU host available to the rest of the document processing system as a

whole for more downstream information retrieval tasks that are not conducive to SIMD processing forms.

For added parallelism, document byte streams can be arbitrarily chunked/chopped with appropriate tailing overlap processing to achieve higher GPU thread level parallelization. The CPU/HOST side processes the final chopped point token resolution. This may be done by determining the last token and first token from adjacent chunks, which have been tokenized by the GPU. The chunk boundary might or might not fall on a delimiter. If it does not, then the process merges the last token from the left adjacent chunk with the first token from the right adjacent chunk, resulting in...