Browse Prior Art Database

Publish Content for deletion and bulk updates through constrains while crawling and indexing by a Search Engine Disclosure Number: IPCOM000236890D
Publication Date: 2014-May-21
Document File: 3 page(s) / 26K

Publishing Venue

The Prior Art Database


The size and number of content systems is growing in the enterprise and it is mandatory that all content much be searchable. This disclosure improves upon prior search engine crawling by supporting bulk operations for deletion, access control or visibility and other updates, especially at a library or personal level.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 38% of the total text.

Page 01 of 3

Publish Content for deletion and bulk updates through constrains while crawling and indexing by a Search Engine

Different content systems have different organization of their
content management, however in most systems the content is
organized in one way or the other and we can refer to this
organization as libraries, folders or categories and it could be
also hierarchical. The content systems must publish their content
to the enterprise search engine in some format and they must
report also about deleted documents. When a specific item is
deleted it is easier to track that event and to publish it to the
search engine, however when deleting a full library, folder or
category with all it content its, it is much harder to track down
the full list of effected content items. Sometime the list of
items in a library could be huge and tracking all the list and
sending that list to the crawler in the next crawling session
could be very expensive.

This innovation is focused on solving the problem of full
libraries deletion from content systems and how that can be
reflected to the enterprise search engine in the most effective
manner. Libraries is a general term that can also refer to
categories and folders. The assumption here is that we have a
central enterprise search engine that needs to crawl many content
systems and so each content system needs to reply with the list
of relevant content each time the crawl is approaching it.

The obvious solution is to track down all the deleted libraries
and their content items in the content management system and to
store those to some period of time until the crawl is requesting
for those updates by the search index ( in this case deletion
events). This obvious solution is very expensive from the
perspective of resources consumption as you need to store the
full list of content items and to publish them to the search
engine when requested. In addition, this obvious solution is
tricky from implementation perspective in some systems. You may
get notification of the library deletion, but all the content
items may be deleted right away. You may have the opportunity to
extract the full list of content in the library.

The proposed solution in this invention is to track just the
deleted libraries Ids and when the crawl approaches the content
system, it will return just a list of the deleted libraries and
instruct the search engine to delete all content items that are
matching and associated with those libraries. The obvious
advantage of this solution is that you need to track minimum
information. The information passed to the search engine is
minimal and also the search engine can optimize its code while
deleting all relevant content items. This approach can be
referred to as "Deletion by Filter", as the search engine deletes
all the documents that match specific list of fields/filters. We
can also abbreviate it to "DbF".

Page 02 of 3

Another example for a scenario where DbF can be very effective is
publishing content system that supports the nota...