TY - JOUR
T1 - SciND
T2 - a new triplet-based dataset for scientific novelty detection via knowledge graphs
AU - Gupta, Komal
AU - Ahmad, Ammaar
AU - Ghosal, Tirthankar
AU - Ekbal, Asif
N1 - Publisher Copyright:
© 2024, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.
PY - 2024
Y1 - 2024
N2 - Detecting texts that contain semantic-level new information is not straightforward. The problem becomes more challenging for research articles. Over the years, many datasets and techniques have been developed to attempt automatic novelty detection. However, the majority of the existing textual novelty detection investigations are targeted toward general domains like newswire. A comprehensive dataset for scientific novelty detection is not available in the literature. In this paper, we present a new triplet-based corpus (SciND) for scientific novelty detection from research articles via knowledge graphs. The proposed dataset consists of three types of triples (i) triplet for the knowledge graph, (ii) novel triplets, and (iii) non-novel triplets. We build a scientific knowledge graph for research articles using triplets across several natural language processing (NLP) domains and extract novel triplets from the paper published in the year 2021. For the non-novel articles, we use blog post summaries of the research articles. Our knowledge graph is domain-specific. We build the knowledge graph for seven NLP domains. We further use a feature-based novelty detection scheme from the research articles as a baseline. Moreover, we show the applicability of our proposed dataset using our baseline novelty detection algorithm. Our algorithm yields a baseline F1 score of 72%. We show analysis and discuss the future scope using our proposed dataset. To the best of our knowledge, this is the very first dataset for scientific novelty detection via a knowledge graph. We make our codes and dataset publicly available at https://github.com/92Komal/Scientific_Novelty_Detection .
AB - Detecting texts that contain semantic-level new information is not straightforward. The problem becomes more challenging for research articles. Over the years, many datasets and techniques have been developed to attempt automatic novelty detection. However, the majority of the existing textual novelty detection investigations are targeted toward general domains like newswire. A comprehensive dataset for scientific novelty detection is not available in the literature. In this paper, we present a new triplet-based corpus (SciND) for scientific novelty detection from research articles via knowledge graphs. The proposed dataset consists of three types of triples (i) triplet for the knowledge graph, (ii) novel triplets, and (iii) non-novel triplets. We build a scientific knowledge graph for research articles using triplets across several natural language processing (NLP) domains and extract novel triplets from the paper published in the year 2021. For the non-novel articles, we use blog post summaries of the research articles. Our knowledge graph is domain-specific. We build the knowledge graph for seven NLP domains. We further use a feature-based novelty detection scheme from the research articles as a baseline. Moreover, we show the applicability of our proposed dataset using our baseline novelty detection algorithm. Our algorithm yields a baseline F1 score of 72%. We show analysis and discuss the future scope using our proposed dataset. To the best of our knowledge, this is the very first dataset for scientific novelty detection via a knowledge graph. We make our codes and dataset publicly available at https://github.com/92Komal/Scientific_Novelty_Detection .
KW - Data preparation
KW - Information extraction
KW - Novelty detection
KW - Scientific knowledge graph
UR - http://www.scopus.com/inward/record.url?scp=85181720672&partnerID=8YFLogxK
U2 - 10.1007/s00799-023-00386-x
DO - 10.1007/s00799-023-00386-x
M3 - Article
AN - SCOPUS:85181720672
SN - 1432-5012
JO - International Journal on Digital Libraries
JF - International Journal on Digital Libraries
ER -