SPS2010 documents are indexed as duplicate

In search results sometimes different documents are indexed as duplicates. If you have the option Remove Duplicate Results in Search Core WebPart some content can be missed.

Cause:

This is the normal behavior for some particular scenarios.

Explanation:

The algorithm that is used in order to differentiate the content is shingling http://en.wikipedia.org/wiki/W-shingling

In SharePoint 2010 we have the MSSDocSdids table containing the DuplicateHashes column .

The DuplicateHashes columns stores a Duplicate Identifier Block used to identify a portion of an item.

Duplication identifiers are used for duplicate result removal if their value is not zero. If two items have the same non zero duplication identifier there is a high probability that the documents are similar.

Workaround:

Uncheck Remove Duplicate Results in Search Core WebPart.

For SharePoint 2013 scenarios you can find some interesting information in the following article https://blogs.realdolmen.com/experts/2015/04/09/sharepoint-deep-dive-exploration-explaining-duplicate-detection-in-sharepoint-server-2013/

Sources:

http://blogs.technet.com/b/harikumh/archive/2008/11/14/some-interesting-facts-about-sharepoint-2007-search.aspx

http://blogs.technet.com/b/jpradeep/archive/2010/09/29/moss-2007-duplicate-search-results.aspx

SharePoint Boco

Nothing else, just SharePoint

SPS2010 different documents are indexed as duplicate

One thought on “SPS2010 different documents are indexed as duplicate”