Browse Prior Art Database

Method and System for Implementing a Wildcard Search Engine Disclosure Number: IPCOM000257036D
Publication Date: 2019-Jan-12
Document File: 5 page(s) / 216K

Publishing Venue

The Prior Art Database

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 34% of the total text.

Method and System for Implementing a Wildcard Search Engine

Disclosed is a method and system for indexing terms broken down for use with wildcard search, with minimal memory overhead, and with no extra filtering steps required for false positive intermediate results. The method and system ranks terms derived from tokenized text strings for use with an index of character sequences. The method and system collects a set of terms from a document in a corpus of documents and further collects a set of character sequences corresponding to a term in the set of terms. The method and system then determines a ranking associated with each character sequence in the set of character sequences. Based on the determination, the method and system stores a reference to one or more terms and the document, for each character sequence. The ranking associated with the term is determined based on one or more of a placement of a character sequence in the term, a number of occurrences of the character sequence in the term, and a ranking of an item associated with the term in the corpus of documents. In accordance with an embodiment, the method and system indexes a set of documents such as, but not limited to, web pages, via the sorted inverted index strategy that is used in conventional search engines. An index of complete search terms (that is, space-delimited tokens) is stored, for example, in a search tree, skip list, or other structure organized for quick term lookup. The term index is sorted according to the alphabetical order of the terms. An index of references to characters within the terms (that is, delimited by non-space characters) is also stored in a structure similar to that of the aforementioned term index. The index of character sequences is also sorted according to the alphabetical order of the character sequences referenced in the index. In an example, the character sequences in the character sequence index can be limited to four characters in length, for scalability. Entries in the term index, and/or references to those entries, are ranked according to relevance to entries in the character sequence index. An amount of references per character sequence are limited to those references of the highest rankings, for scalability. Thereafter, a search query that includes wildcards returns the results of the highest rankings associated with the relevant entry (or entries) in the character sequence index. Further, when a wildcard search is performed, using the character sequence index and term index, the method and system performs the search in one of the ways described in accordance with various embodiments as follows. Each entry in the character sequence index refers directly to the document(s) containing the corresponding sequence or refers to one or more entries in the term index. FIG. 1 illustrates direct sequence inverse indexing for documents, based on a document cache in accordance with an embodiment of the method and system.

FIG. 2 illustrates direct...