Benchmarking Automated Theorem Proving with Large Language Models

Vanessa Lama, Catherine Ma, Tirthankar Ghosal

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Theorem proving presents a significant challenge for large language models (LLMs) due to the requirement for formal proofs to be rigorously checked by proof assistants, such as Lean, eliminating any margin for error or hallucination. While existing LLM-based theorem provers attempt to operate autonomously, they often struggle with novel and complex theorems where human insights are essential. Lean Copilot is a novel framework that integrates LLM inference into the Lean proof assistant environment. In this work, we benchmark performance of several LLMs including general and math-specific models for theorem proving using the Lean Copilot framework. Our initial investigation suggests that a general-purpose large model like LLaMa-70B still has edge over math-specific smaller models for the task under consideration. We provide useful insights into the performance of different LLMs we chose for the task.

Original languageEnglish
Title of host publicationNLP4Science 2024 - 1st Workshop on NLP for Science, Proceedings of the Workshop
EditorsLotem Peled-Cohen, Nitay Calderon, Shir Lissak, Roi Reichart
PublisherAssociation for Computational Linguistics (ACL)
Pages208-218
Number of pages11
ISBN (Electronic)9798891761858
DOIs
StatePublished - 2024
Event1st Workshop on NLP for Science, NLP4Science 2024 - Miami, United States
Duration: Nov 16 2024 → …

Publication series

NameNLP4Science 2024 - 1st Workshop on NLP for Science, Proceedings of the Workshop

Conference

Conference1st Workshop on NLP for Science, NLP4Science 2024
Country/TerritoryUnited States
CityMiami
Period11/16/24 → …

Fingerprint

Dive into the research topics of 'Benchmarking Automated Theorem Proving with Large Language Models'. Together they form a unique fingerprint.

Cite this