TY - GEN
T1 - SciTrust
T2 - 2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024
AU - Herron, Emily
AU - Yin, Junqi
AU - Wang, Feiyi
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - This work presents SciTrust, a comprehensive framework for assessing the trustworthiness of large language models (LLMs) in scientific contexts, with a focus on truthfulness, accuracy, hallucination, and sycophancy. The framework introduces four novel open-ended benchmarks in Computer Science, Chemistry, Biology, and Physics, and employs a multi-faceted evaluation approach combining traditional metrics with LLMbased evaluation. SciTrust was applied to five LLMs, including one general-purpose and four scientific models, revealing nuanced strengths and weaknesses across different models and benchmarks. The study also evaluated SciTrust's performance and scalability on high-performance computing systems. Results showed varying performance across models, with Llama3-70B-Instruct performing strongly overall, while Galactica-120B and SciGLM-6B excelled among scientific models. SciTrust aims to advance the development of trustworthy AI in scientific applications and establish a foundation for future research on model robustness, safety, and ethics in scientific contexts. We have open-sourced our framework, including all associated scripts and datasets, at https://github.com/herronej/SciTrust.
AB - This work presents SciTrust, a comprehensive framework for assessing the trustworthiness of large language models (LLMs) in scientific contexts, with a focus on truthfulness, accuracy, hallucination, and sycophancy. The framework introduces four novel open-ended benchmarks in Computer Science, Chemistry, Biology, and Physics, and employs a multi-faceted evaluation approach combining traditional metrics with LLMbased evaluation. SciTrust was applied to five LLMs, including one general-purpose and four scientific models, revealing nuanced strengths and weaknesses across different models and benchmarks. The study also evaluated SciTrust's performance and scalability on high-performance computing systems. Results showed varying performance across models, with Llama3-70B-Instruct performing strongly overall, while Galactica-120B and SciGLM-6B excelled among scientific models. SciTrust aims to advance the development of trustworthy AI in scientific applications and establish a foundation for future research on model robustness, safety, and ethics in scientific contexts. We have open-sourced our framework, including all associated scripts and datasets, at https://github.com/herronej/SciTrust.
KW - High Performance Computing
KW - Large Language Models for Science
KW - Trustworthy AI
UR - http://www.scopus.com/inward/record.url?scp=85217172758&partnerID=8YFLogxK
U2 - 10.1109/SCW63240.2024.00017
DO - 10.1109/SCW63240.2024.00017
M3 - Conference contribution
AN - SCOPUS:85217172758
T3 - Proceedings of SC 2024-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 72
EP - 78
BT - Proceedings of SC 2024-W
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 17 November 2024 through 22 November 2024
ER -