Abstract
The rapid growth of documents across the web has necessitated finding means of discarding redundant documents and retaining novel ones. Capturing redundancy is challenging as it may involve investigating at a deep semantic level. Techniques for detecting such semantic redundancy at the document level are scarce. In this work we propose a deep Convolutional Neural Network (CNN) based model to classify a document as novel or redundant with respect to a set of relevant documents already seen by the system. The system is simple and does not require manual feature engineering. Our novel scheme encodes relevant and relative information from both source and target texts to generate an intermediate representation for which we coin the name Relative Document Vector (RDV). The proposed method outperforms the existing benchmark on two document-level novelty detection datasets by a margin of ∼5% in terms of accuracy. We further demonstrate the effectiveness of our approach on a standard paraphrase detection dataset where the paraphrased passages closely resembles semantically redundant documents.
Original language | English |
---|---|
Title of host publication | COLING 2018 - 27th International Conference on Computational Linguistics, Proceedings |
Editors | Emily M. Bender, Leon Derczynski, Pierre Isabelle |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 2802-2813 |
Number of pages | 12 |
ISBN (Electronic) | 9781948087506 |
State | Published - 2018 |
Externally published | Yes |
Event | 27th International Conference on Computational Linguistics, COLING 2018 - Santa Fe, United States Duration: Aug 20 2018 → Aug 26 2018 |
Publication series
Name | COLING 2018 - 27th International Conference on Computational Linguistics, Proceedings |
---|
Conference
Conference | 27th International Conference on Computational Linguistics, COLING 2018 |
---|---|
Country/Territory | United States |
City | Santa Fe |
Period | 08/20/18 → 08/26/18 |
Funding
The first author, Tirthankar Ghosal, acknowledges Visvesvaraya PhD Scheme for Electronics and IT, an initiative of Ministry of Electronics and Information Technology (MeitY), Government of India for fellowship support. Asif Ekbal acknowledges Young Faculty Research Fellowship (YFRF), supported by Visvesvaraya PhD scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia). We thank the anonymous reviewers for their valuable feedback and Prof. Donia Scott, University of Sussex for her advice in the Writing Mentoring Program as part of COLING 2018. We also thank Elsevier Center of Excellence for Natural Language Processing, Indian Institute of Technology Patna for adequate help and support to carry out this research.