Browse Prior Art Database

Method on detect duplicate files based on different encoding

IP.com Disclosure Number: IPCOM000236501D
Publication Date: 2014-Apr-30
Document File: 2 page(s) / 71K

Publishing Venue

The IP.com Prior Art Database

Abstract

The data can be stored in different encoding scheme like Unicode, EBCDIC, ASCII, etc. If the data in different encoding scheme are stored in database. It leads to a duplication of storage,and more importantly it will affect the performance and outcome accuracy when searching in the database. This duplication can easily be avoided by checking all the data's encoding scheme before storing into the database.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 61% of the total text.

Page 01 of 2

Method on detect duplicate files based on different encoding

http://office.microsoft.com/en-sg/access-help/find-and-remove-duplicate-data-HA010341696.aspx

http://download.cnet.com/Auslogics-Duplicate-File-Finder/3000-2248_4-10964299.html

http://www.pcworld.com/article/2032515/how-to-find-and-remove-duplicate-files.html

http://www.google.com.hk/patents/WO2011062387A2?cl=en&dq=file+duplicate&hl=zh-CN&sa=X&ei=iHjGUfurO_CyiQePhIC4CA&ved= 0CDYQ6AEwAA

not same method

2) http://www.google.com.hk/patents/US6938083?dq=file+duplicate&hl=zh-CN&sa=X&ei=iHjGUfurO_CyiQePhIC4CA&ved=0CFoQ6AEwBA

duplication

                                                               of the files in a database. With the new mechanism, the database administrator can get rid of much more unwanted data just carry out a conversion to Unicode before the regular comparison .

The flow would be like this.

1

In this data explosive age, the data stored in a database is increased exponentially year by year. However many of them are essentially "same", which means the binary content of the files are not same, but when convert the content to Unicode, they may exactly the same or 90% same. This is obviously common in the database behind the website has some reward mechanism for the subscriber to upload much more database. This situation is not only waste a lot of storage, but also affects users who search the files in those database. With this new method, we can check the content of the file and decode to Unicode, then make a comparison with the existing ones. If the d...