Abstract
The understanding of large-scale scientific software is a significant challenge due to its diverse codebase, extensive code length, and target computing architectures. The emergence of generative AI, specifically large language models (LLMs), provides novel pathways for understanding such complex scientific codes. This paper presents S3LLM, an LLM-based framework designed to enable the examination of source code, code metadata, and summarized information in conjunction with textual technical reports in an interactive, conversational manner through a user-friendly interface. S3LLM leverages open-source LLaMA-2 models to enhance code analysis through the automatic transformation of natural language queries into domain-specific language (DSL) queries. In addition, S3LLM is equipped to handle diverse metadata types, including DOT, SQL, and customized formats. Furthermore, S3LLM incorporates retrieval-augmented generation (RAG) and LangChain technologies to directly query extensive documents. S3LLM demonstrates the potential of using locally deployed open-source LLMs for the rapid understanding of large-scale scientific computing software, eliminating the need for extensive coding expertise and thereby making the process more efficient and effective. S3LLM is available at https://github.com/ResponsibleAILab/s3llm.
Original language | English |
---|---|
Title of host publication | Computational Science – ICCS 2024 - 24th International Conference, 2024, Proceedings |
Editors | Leonardo Franco, Clélia de Mulatier, Maciej Paszynski, Valeria V. Krzhizhanovskaya, Jack J. Dongarra, Peter M. A. Sloot |
Publisher | Springer Science and Business Media Deutschland GmbH |
Pages | 222-230 |
Number of pages | 9 |
ISBN (Print) | 9783031637582 |
DOIs | |
State | Published - 2024 |
Event | 24th International Conference on Computational Science, ICCS 2024 - Malaga, Spain Duration: Jul 2 2024 → Jul 4 2024 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 14834 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 24th International Conference on Computational Science, ICCS 2024 |
---|---|
Country/Territory | Spain |
City | Malaga |
Period | 07/2/24 → 07/4/24 |
Funding
This manuscript has been authored by UT-Battelle, LLC under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy. gov/downloads/doe-public-access-plan).
Keywords
- ChatGPT
- E3SM Land Model
- Large-Scale Scientific Software
- LLaMA
- LLM
- Research Software Analysis
- Retrieval Augmented Generation (RAG)