TY - GEN
T1 - Weak Supervision for Scientific Document Relevance Tagging
AU - Herrmannova, Drahomira
AU - Gunaratne, Chathika
AU - Walker, Vickie
AU - Rooney, Andrew
AU - Patton, Robert
AU - Wolfe, Mary
AU - Schmitt, Charles
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Developing training data for predicting the relevance of research articles to scientific concepts is a resource-intensive process, and existing datasets are only available for limited subject domains. In this work, we investigate the possibility of weakly supervised data generation for developing relevance models. We approach this by generating document, query, and label triples in an automated manner and by using this data to create a training set for a classification model. Published documents were sampled from an open access repository, and the concepts appearing in these documents were used as queries. We use the location of occurrence of each query concept within a document to determine the relevance label. We find that a classification model trained on this synthetic data can learn to tag documents according to their relevance to a query surprisingly well, providing an 11% f-score improvement over a model trained on ground truth data.
AB - Developing training data for predicting the relevance of research articles to scientific concepts is a resource-intensive process, and existing datasets are only available for limited subject domains. In this work, we investigate the possibility of weakly supervised data generation for developing relevance models. We approach this by generating document, query, and label triples in an automated manner and by using this data to create a training set for a classification model. Published documents were sampled from an open access repository, and the concepts appearing in these documents were used as queries. We use the location of occurrence of each query concept within a document to determine the relevance label. We find that a classification model trained on this synthetic data can learn to tag documents according to their relevance to a query surprisingly well, providing an 11% f-score improvement over a model trained on ground truth data.
KW - classification
KW - natural language processing
KW - relevance tagging
KW - scholarly communication
KW - weak supervision
UR - http://www.scopus.com/inward/record.url?scp=85124191459&partnerID=8YFLogxK
U2 - 10.1109/JCDL52503.2021.00060
DO - 10.1109/JCDL52503.2021.00060
M3 - Conference contribution
AN - SCOPUS:85124191459
T3 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
SP - 338
EP - 339
BT - Proceedings - 2021 ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021
A2 - Downie, J. Stephen
A2 - McKay, Dana
A2 - Suleman, Hussein
A2 - Nichols, David M.
A2 - Poursardar, Faryaneh
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 21st ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021
Y2 - 27 September 2021 through 30 September 2021
ER -