Massively scalable near duplicate detection in streams of documents using MDSH

Paul Logasa Bogen, Christopher T. Symons, Amber McKenzie, Robert M. Patton, Robert E. Gillen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In a world where large-scale text collections are not only becoming ubiquitous but also are growing at increasing rates, near duplicate documents are becoming a growing concern that has the potential to hinder many different information filtering tasks. While others have tried to address this problem, prior techniques have only been used on limited collection sizes and static cases. We will briefly describe the problem in the context of Open Source analysis along with our additional constraints for performance. In this work we propose two variations on Multi-dimensional Spectral Hash (MDSH) tailored for working on extremely large, growing sets of text documents. We analyze the memory and runtime characteristics of our techniques and provide an informal analysis of the quality of the near-duplicate clusters produced by our techniques.

Original languageEnglish
Title of host publicationProceedings - 2013 IEEE International Conference on Big Data, Big Data 2013
PublisherIEEE Computer Society
Pages480-486
Number of pages7
ISBN (Print)9781479912926
DOIs
StatePublished - 2013
Event2013 IEEE International Conference on Big Data, Big Data 2013 - Santa Clara, CA, United States
Duration: Oct 6 2013Oct 9 2013

Publication series

NameProceedings - 2013 IEEE International Conference on Big Data, Big Data 2013

Conference

Conference2013 IEEE International Conference on Big Data, Big Data 2013
Country/TerritoryUnited States
CitySanta Clara, CA
Period10/6/1310/9/13

Keywords

  • Big Data
  • MDSH
  • Near Duplicate Detection
  • Open Source Intelligence
  • Streaming Text

Fingerprint

Dive into the research topics of 'Massively scalable near duplicate detection in streams of documents using MDSH'. Together they form a unique fingerprint.

Cite this