DataQuest: An Approach to Automatically Extract Dataset Mentions from Scientific Papers

Sandeep Kumar, Tirthankar Ghosal, Asif Ekbal

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

The rapid growth of scientific literature is presenting several challenges for the search and discovery of research artifacts. Datasets are the backbone of scientific experiments. It is crucial to locate the datasets used or generated by previous research as building suitable datasets is costly in terms of time, money, and human labor. Hence automated mechanisms to aid the search and discovery of datasets from scientific publications can aid reproducibility and reusability of these valuable scientific artifacts. Here in this work, utilizing the next sentence prediction capability of language models, we show that a BERT-based entity recognition model with POS aware embedding can be effectively used to address this problem. Our investigation shows that identifying sentences containing dataset mentions in the first place proves critical to the task. Our method outperforms earlier ones and achieves an F1 score of 56.2 in extracting dataset mentions from research papers on a popular corpus of social science publications. We make our codes available at https://github.com/sandeep82945/data_discovery.

Original languageEnglish
Title of host publicationTowards Open and Trustworthy Digital Societies - 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Proceedings
EditorsHao-Ren Ke, Chei Sian Lee, Kazunari Sugiyama
PublisherSpringer Science and Business Media Deutschland GmbH
Pages43-53
Number of pages11
ISBN (Print)9783030916688
DOIs
StatePublished - 2021
Externally publishedYes
Event23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021 - Virtual, Online
Duration: Dec 1 2021Dec 3 2021

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13133 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021
CityVirtual, Online
Period12/1/2112/3/21

Funding

Acknowledgement. Sandeep Kumar acknowledges the Prime Minister Research Fellowship (PMRF) program of the Government of India for its support. Asif Ekbal is a recipient of the Visvesvaraya Young Faculty Award and acknowledges Digital India Corporation, Ministry of Electronics and Information Technology, Government of India for supporting this research.

FundersFunder number
Digital India Corporation
Ministry of Electronics and Information Technology, Government of India
Prime Minister Research Fellowship

    Keywords

    • Dataset discovery
    • Dataset mention extraction
    • Deep learning
    • Publication mining

    Fingerprint

    Dive into the research topics of 'DataQuest: An Approach to Automatically Extract Dataset Mentions from Scientific Papers'. Together they form a unique fingerprint.

    Cite this