Abstract
Developing training data for predicting the relevance of research articles to scientific concepts is a resource-intensive process, and existing datasets are only available for limited subject domains. In this work, we investigate the possibility of weakly supervised data generation for developing relevance models. We approach this by generating document, query, and label triples in an automated manner and by using this data to create a training set for a classification model. Published documents were sampled from an open access repository, and the concepts appearing in these documents were used as queries. We use the location of occurrence of each query concept within a document to determine the relevance label. We find that a classification model trained on this synthetic data can learn to tag documents according to their relevance to a query surprisingly well, providing an 11% f-score improvement over a model trained on ground truth data.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2021 ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021 |
| Editors | J. Stephen Downie, Dana McKay, Hussein Suleman, David M. Nichols, Faryaneh Poursardar |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 338-339 |
| Number of pages | 2 |
| ISBN (Electronic) | 9781665417709 |
| DOIs | |
| State | Published - 2021 |
| Event | 21st ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021 - Virtual, Online, United States Duration: Sep 27 2021 → Sep 30 2021 |
Publication series
| Name | Proceedings of the ACM/IEEE Joint Conference on Digital Libraries |
|---|---|
| Volume | 2021-September |
| ISSN (Print) | 1552-5996 |
Conference
| Conference | 21st ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021 |
|---|---|
| Country/Territory | United States |
| City | Virtual, Online |
| Period | 09/27/21 → 09/30/21 |
Funding
Support for this research was provided by an Interagency Agreement with the National Institute of Environmental Health Sciences (AES 16002-001) and the U.S. Department of Energy at Oak Ridge National Laboratory.
Keywords
- classification
- natural language processing
- relevance tagging
- scholarly communication
- weak supervision