Weak Supervision for Scientific Document Relevance Tagging

Drahomira Herrmannova, Chathika Gunaratne, Vickie Walker, Andrew Rooney, Robert Patton, Mary Wolfe, Charles Schmitt

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Developing training data for predicting the relevance of research articles to scientific concepts is a resource-intensive process, and existing datasets are only available for limited subject domains. In this work, we investigate the possibility of weakly supervised data generation for developing relevance models. We approach this by generating document, query, and label triples in an automated manner and by using this data to create a training set for a classification model. Published documents were sampled from an open access repository, and the concepts appearing in these documents were used as queries. We use the location of occurrence of each query concept within a document to determine the relevance label. We find that a classification model trained on this synthetic data can learn to tag documents according to their relevance to a query surprisingly well, providing an 11% f-score improvement over a model trained on ground truth data.

Original languageEnglish
Title of host publicationProceedings - 2021 ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021
EditorsJ. Stephen Downie, Dana McKay, Hussein Suleman, David M. Nichols, Faryaneh Poursardar
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages338-339
Number of pages2
ISBN (Electronic)9781665417709
DOIs
StatePublished - 2021
Event21st ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021 - Virtual, Online, United States
Duration: Sep 27 2021Sep 30 2021

Publication series

NameProceedings of the ACM/IEEE Joint Conference on Digital Libraries
Volume2021-September
ISSN (Print)1552-5996

Conference

Conference21st ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021
Country/TerritoryUnited States
CityVirtual, Online
Period09/27/2109/30/21

Funding

Support for this research was provided by an Interagency Agreement with the National Institute of Environmental Health Sciences (AES 16002-001) and the U.S. Department of Energy at Oak Ridge National Laboratory.

FundersFunder number
U.S. Department of Energy
National Institute of Environmental Health SciencesAES 16002-001
Oak Ridge National Laboratory

    Keywords

    • classification
    • natural language processing
    • relevance tagging
    • scholarly communication
    • weak supervision

    Fingerprint

    Dive into the research topics of 'Weak Supervision for Scientific Document Relevance Tagging'. Together they form a unique fingerprint.

    Cite this