TY - JOUR
T1 - Automating Data Extraction From Scientific Literature and General PDF Files Using Large Language Models and KNIME
T2 - An Application in Toxicology
AU - Moreira-Filho, José Teófilo
AU - Ranganath, Dhruv
AU - Tieghi, Ricardo S.
AU - Patton, Robert
AU - Sutherland, Vicki
AU - Schmitt, Charles
AU - Rooney, Andrew A.
AU - Fostel, Jennifer
AU - Walker, Vickie R.
AU - Saddler, Trey
AU - Reif, David
AU - Mansouri, Kamel
AU - Kleinstreuer, Nicole
N1 - Publisher Copyright:
© 2025 The Author(s). WIREs Computational Molecular Science published by Wiley Periodicals LLC.
PY - 2025/9/1
Y1 - 2025/9/1
N2 - The large and steadily increasing volume of scientific publications presents a challenge in accessing and utilizing data due to their unstructured nature. Toxicology, in particular, depends on structured data from diverse study types for study evaluation, weight-of-evidence chemical assessments, and validation of new approach methodologies (NAMs). Manual data extraction is time and labor-intensive. This work presents an automated data extraction workflow using large language models (LLMs) within the KNIME platform. The workflow integrates document parsing tools with LLMs to extract variables from scientific publications and general PDF files. Two execution modes are available: text mode and image mode. Text mode applies tools for extracting text and tables, while image mode uses multimodal LLMs to process non-linear layouts and graphical content. The workflow achieves 81.14% accuracy in text mode for scientific publications and up to 98.54% in image mode for general PDF files. The KNIME platform ensures accessibility through a user-friendly interface, allowing non-experts to use advanced data extraction methods. This automated approach facilitates toxicological research by improving the retrieval of structured data. By democratizing access to LLM-powered workflows, this approach paves the way for significant advancements in knowledge synthesis to support biomedical research. This article is categorized under: Data Science > Artificial Intelligence/Machine Learning Data Science > Computer Algorithms and Programming Data Science > Databases and Expert Systems.
AB - The large and steadily increasing volume of scientific publications presents a challenge in accessing and utilizing data due to their unstructured nature. Toxicology, in particular, depends on structured data from diverse study types for study evaluation, weight-of-evidence chemical assessments, and validation of new approach methodologies (NAMs). Manual data extraction is time and labor-intensive. This work presents an automated data extraction workflow using large language models (LLMs) within the KNIME platform. The workflow integrates document parsing tools with LLMs to extract variables from scientific publications and general PDF files. Two execution modes are available: text mode and image mode. Text mode applies tools for extracting text and tables, while image mode uses multimodal LLMs to process non-linear layouts and graphical content. The workflow achieves 81.14% accuracy in text mode for scientific publications and up to 98.54% in image mode for general PDF files. The KNIME platform ensures accessibility through a user-friendly interface, allowing non-experts to use advanced data extraction methods. This automated approach facilitates toxicological research by improving the retrieval of structured data. By democratizing access to LLM-powered workflows, this approach paves the way for significant advancements in knowledge synthesis to support biomedical research. This article is categorized under: Data Science > Artificial Intelligence/Machine Learning Data Science > Computer Algorithms and Programming Data Science > Databases and Expert Systems.
KW - KNIME
KW - LLMs
KW - generative artificial intelligence
UR - https://www.scopus.com/pages/publications/105016457748
U2 - 10.1002/wcms.70047
DO - 10.1002/wcms.70047
M3 - Article
AN - SCOPUS:105016457748
SN - 1759-0876
VL - 15
JO - Wiley Interdisciplinary Reviews: Computational Molecular Science
JF - Wiley Interdisciplinary Reviews: Computational Molecular Science
IS - 5
M1 - e70047
ER -