Abstract
The rapid growth of scientific literature is presenting several challenges for the search and discovery of research artifacts. Datasets are the backbone of scientific experiments. It is crucial to locate the datasets used or generated by previous research as building suitable datasets is costly in terms of time, money, and human labor. Hence automated mechanisms to aid the search and discovery of datasets from scientific publications can aid reproducibility and reusability of these valuable scientific artifacts. Here in this work, utilizing the next sentence prediction capability of language models, we show that a BERT-based entity recognition model with POS aware embedding can be effectively used to address this problem. Our investigation shows that identifying sentences containing dataset mentions in the first place proves critical to the task. Our method outperforms earlier ones and achieves an F1 score of 56.2 in extracting dataset mentions from research papers on a popular corpus of social science publications. We make our codes available at https://github.com/sandeep82945/data_discovery.
Original language | English |
---|---|
Title of host publication | Towards Open and Trustworthy Digital Societies - 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Proceedings |
Editors | Hao-Ren Ke, Chei Sian Lee, Kazunari Sugiyama |
Publisher | Springer Science and Business Media Deutschland GmbH |
Pages | 43-53 |
Number of pages | 11 |
ISBN (Print) | 9783030916688 |
DOIs | |
State | Published - 2021 |
Event | 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021 - Virtual, Online Duration: Dec 1 2021 → Dec 3 2021 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 13133 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021 |
---|---|
City | Virtual, Online |
Period | 12/1/21 → 12/3/21 |
Funding
Acknowledgement. Sandeep Kumar acknowledges the Prime Minister Research Fellowship (PMRF) program of the Government of India for its support. Asif Ekbal is a recipient of the Visvesvaraya Young Faculty Award and acknowledges Digital India Corporation, Ministry of Electronics and Information Technology, Government of India for supporting this research.
Keywords
- Dataset discovery
- Dataset mention extraction
- Deep learning
- Publication mining