Abstract
Detecting novelty of an entire document is an Artificial Intelligence (AI) frontier problem. This has immense importance in widespread Natural Language Processing (NLP) applications ranging from extractive text document summarization to tracking development of news events to predicting impact of scholarly articles. Although a very relevant problem in the present context of exponential data duplication, we are unaware of any document level dataset that correctly addresses the evaluation of automatic novelty detection techniques in a classification framework. To bridge this relative gap, here in this work, we present a resource for benchmarking the techniques for document level novelty detection. We create the resource via topic-specific crawling of news documents across several domains in a periodic manner. We release the annotated corpus with necessary statistics and show its use with a developed system for the problem in concern.
Original language | English |
---|---|
Title of host publication | LREC 2018 - 11th International Conference on Language Resources and Evaluation |
Editors | Hitoshi Isahara, Bente Maegaard, Stelios Piperidis, Christopher Cieri, Thierry Declerck, Koiti Hasida, Helene Mazo, Khalid Choukri, Sara Goggi, Joseph Mariani, Asuncion Moreno, Nicoletta Calzolari, Jan Odijk, Takenobu Tokunaga |
Publisher | European Language Resources Association (ELRA) |
Pages | 3541-3547 |
Number of pages | 7 |
ISBN (Electronic) | 9791095546009 |
State | Published - 2019 |
Externally published | Yes |
Event | 11th International Conference on Language Resources and Evaluation, LREC 2018 - Miyazaki, Japan Duration: May 7 2018 → May 12 2018 |
Publication series
Name | LREC 2018 - 11th International Conference on Language Resources and Evaluation |
---|
Conference
Conference | 11th International Conference on Language Resources and Evaluation, LREC 2018 |
---|---|
Country/Territory | Japan |
City | Miyazaki |
Period | 05/7/18 → 05/12/18 |
Funding
Tirthankar Ghosal and Asif Ekbal gratefully acknowledge Visvesvaraya PhD scheme for Electronics and Information Technology, an initiative of Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia). Tirthankar is a PhD scholar under Visves-varaya PhD scheme and Asif is the recipient of Sir Visves-varaya Young Faculty Research Fellow Award. The work is also generously supported by Elsevier Centre of Excellence for Natural Language Processing, Department of Computer Science and Engineering, Indian Institute of Technology Patna. We would also like to thank the anonymous reviewers for their valuable suggestions in improving this work.
Keywords
- Classification
- Corpus
- Document level novelty detection
- Web crawling