Towards Automatic Dataset Discovery From Scientific Publications

Sandeep Kumar, Tirthankar Ghosal, Asif Ekbal

Research output: Contribution to journalArticlepeer-review

Abstract

Datasets are a crucial artifact in research, and there is always a high demand for good datasets. Due to rapid scientific progress and exponential growth in the scientific literature, there has been a proportionate increase in the number of datasets. However, many datasets and research studies are left unexplored and under-utilized as these are not discoverable easily, leading to duplicate efforts. Building good datasets is costly in terms of time, money, and human effort. Hence, automated tools to facilitate the search and discovery of datasets are crucial to the scientific community. In this work, we investigate a deep neural network-based architecture to automate the dataset discovery from scientific publications. In this paper, we perform two tasks namely dataset mention extraction and entity linking. Our method outperforms the earlier ones and achieves an F1 score of 56.24 in extracting dataset mentions from research papers on a popular corpus of social science publications. Our approach also outperforms the prior research and achieves a precision score of 88.63 in linking research papers to a dataset knowledge base for another popular corpus of social science publications. We hope that this system will further promote data sharing, offset the researchers' workload in identifying the right dataset and increase the reusability of datasets.

Original languageEnglish
JournalIEEE Access
DOIs
StateAccepted/In press - 2025

Keywords

  • Discovering Datasets
  • Extracting Dataset References
  • Mining Publications
  • Natural Language Processing

Fingerprint

Dive into the research topics of 'Towards Automatic Dataset Discovery From Scientific Publications'. Together they form a unique fingerprint.

Cite this