TY - JOUR
T1 - Towards Automatic Dataset Discovery From Scientific Publications
AU - Kumar, Sandeep
AU - Ghosal, Tirthankar
AU - Ekbal, Asif
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2025
Y1 - 2025
N2 - Datasets are a crucial artifact in research, and there is always a high demand for good datasets. Due to rapid scientific progress and exponential growth in the scientific literature, there has been a proportionate increase in the number of datasets. However, many datasets and research studies are left unexplored and under-utilized as these are not discoverable easily, leading to duplicate efforts. Building good datasets is costly in terms of time, money, and human effort. Hence, automated tools to facilitate the search and discovery of datasets are crucial to the scientific community. In this work, we investigate a deep neural network-based architecture to automate the dataset discovery from scientific publications. In this paper, we perform two tasks namely dataset mention extraction and entity linking. Our method outperforms the earlier ones and achieves an F1 score of 56.24 in extracting dataset mentions from research papers on a popular corpus of social science publications. Our approach also outperforms the prior research and achieves a precision score of 88.63 in linking research papers to a dataset knowledge base for another popular corpus of social science publications. We hope that this system will further promote data sharing, offset the researchers' workload in identifying the right dataset and increase the reusability of datasets.
AB - Datasets are a crucial artifact in research, and there is always a high demand for good datasets. Due to rapid scientific progress and exponential growth in the scientific literature, there has been a proportionate increase in the number of datasets. However, many datasets and research studies are left unexplored and under-utilized as these are not discoverable easily, leading to duplicate efforts. Building good datasets is costly in terms of time, money, and human effort. Hence, automated tools to facilitate the search and discovery of datasets are crucial to the scientific community. In this work, we investigate a deep neural network-based architecture to automate the dataset discovery from scientific publications. In this paper, we perform two tasks namely dataset mention extraction and entity linking. Our method outperforms the earlier ones and achieves an F1 score of 56.24 in extracting dataset mentions from research papers on a popular corpus of social science publications. Our approach also outperforms the prior research and achieves a precision score of 88.63 in linking research papers to a dataset knowledge base for another popular corpus of social science publications. We hope that this system will further promote data sharing, offset the researchers' workload in identifying the right dataset and increase the reusability of datasets.
KW - Discovering Datasets
KW - Extracting Dataset References
KW - Mining Publications
KW - Natural Language Processing
UR - http://www.scopus.com/inward/record.url?scp=85216309933&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2025.3532767
DO - 10.1109/ACCESS.2025.3532767
M3 - Article
AN - SCOPUS:85216309933
SN - 2169-3536
JO - IEEE Access
JF - IEEE Access
ER -