Using LLMs to identify vulnerabilities in smart contracts

João Fiadeiro2023-07-28

LLMs can be powerful tools in the arsenal of security researchers.

The case for better security tooling

Web3 has seen over $4.7 billion lost to hacks since 2020, causing major harm to users, protocols, and the industry's reputation. Comprehensive auditing could prevent many issues, but remains prohibitively expensive at up to $100k per audit. Even audited protocols get breached as bugs slip through initial reviews or code changes later introduce new vulnerabilities.

Unlike traditional web2, web3 poses distinct security risks due to its immutable, exposed, and financially-motivated nature. Smart contracts are public and cannot be easily patched once deployed, allowing vulnerabilities to persist indefinitely. Further, the prominence of cryptocurrencies creates strong financial incentives for hackers, who can steal funds directly rather than target user data. With billions lost to bugs and scant resources for comprehensive auditing, the industry urgently needs tools to analyze contracts, generate test cases, and assist overburdened auditors. LLMs present an opportunity to amplify human efforts in securing web3 protocols and protecting users.

While (currently) unable to independently identify novel exploits, LLMs can help in several ways:

  • Parsing outputs from security tools: Static analyzers, fuzzers, and test generators produce massive outputs riddled with false positives. LLMs can help surface potentially problematic findings for human review. Their expanding context windows, now up to 200k tokens on models like Claude 2, allow analyzing entire codebases and potentially catching subtle bugs from distant interactions.
  • Accelerating test generation: LLMs can rapidly generate unit tests to achieve higher coverage and find edge cases, complementing existing fuzzing techniques. Tailored fine-tuning on domain data can further improve performance: for instance, Smart Contract VulnDB provides 26,000 examples of vulnerabilities in a relatively clean dataset. The industry standard is to publish audit reports openly which means we have a lot of data at our disposal. Web3 protocols are open-source by default: we have the code and the audit findings ready to use to new models.
  • Summarizing findings: LLMs can digest results from multiple tools and produce integrated summaries, reducing manual reporting needs.
  • Modular detection models: Recent multi-task learning models show promise detecting and classifying different bug types. Open models like Llama 2 enable community-led efforts to fine-tune specialized detectors on web3 data that’s crowd-sourced and maintained.

To demonstrate the potential of LLMs in auditing, I built a proof-of-concept tool called Robocop using Claude 2. It analyzes contracts, surfaces suspicious areas, generates unit tests, and summarizes its findings. While basic, this prototype achieved promising results detecting seeded vulnerabilities, showcasing how LLMs could enhance audits today and unlock further innovation through open models and fine-tuning.

With hackers constantly finding new ways to exploit contracts, the industry urgently needs more efficient tools to secure protocols. This article will highlight how LLMs are poised to accelerate audits and unit test generation, reducing risks for users and making web3 safer. Although human auditors cannot yet be replaced, LLMs present an opportunity to amplify their efforts. By open sourcing Robocop, I hope to move the community forward in adopting LLMs to better secure smart contracts.

Literature Review

Recent papers demonstrate machine learning models show increasing promise for assisting with smart contract auditing, but still require significant human review to filter false positives. Models have improved vulnerability detection accuracy, but manual oversight remains necessary for the foreseeable future.

One line of work focuses on better knowledge generation for vulnerability detection. Models struggle with complex contract semantics like global variables, but new techniques incorporate global variable analysis to improve results. Another approach is multimodal learning combining code and graph features, which shows state-of-the-art accuracy. This allows tailored strategies for each vulnerability type, outperforming black-box methods.

Additionally, heterogeneous graph learning methods extract semantic features between nodes, achieving strong results for line-level vulnerability detection. Multi-task learning jointly detects vulnerabilities and identifies types, improving over single-task models in accuracy, efficiency, and scalability.

In summary, combining semantic analysis, multimodal learning, graph methods, and multi-task training pushes the boundaries of ML-assisted auditing. However, human auditors are still essential to validate findings. Future directions could explore AI agents to explain results or adapt models to new smart contract languages beyond Solidity like Rust and Move. As blockchain ecosystems grow, developing language-agnostic techniques will be critical.

The key is striking the right balance between automation and human oversight. While ML models are unlikely to fully replace auditors soon, they present opportunities to accelerate and strengthen the process if thoughtfully integrated. Recent advances lay the foundation, but more research and tools like Robocop are needed to scale up adoption in practice.

Competitive Landscape

AuditWizard

AuditWizard provides an integrated auditing platform aimed at streamlining the workflow for auditors, developers, and security engineers. It combines multiple capabilities including visualization, manual review, and ML-powered insights into a single web-based IDE. Key features include auto-generated audit reports, integration with bug bounty/contest platforms like Code4rena and Hats Finance, and developer-focused pre-audit capabilities.

AuditWizard consolidates necessary security tools and leverages AI to surface findings for human validation. For example, automated report generation allows auditors to easily compile results by adding notes throughout review. The tool also enables developers to self-audit code before deployment by making security analysis more accessible. Overall, AuditWizard demonstrates the potential for LLMs to enhance productivity by integrating insights across a unified auditing workflow.

Metatrust

Metatrust utilizes a hybrid approach combining LLMs with program analysis for automated vulnerability detection in smart contracts. Their proprietary threat model covers 12 known vulnerability categories including issues in DeFi, access control, cryptography, and compilers.

The Metatrust static analyzer tool applies over 100 predefined rules to surface potential bugs in code for manual review. This showcases the ability of LLMs to ingest patterns and learn to flag suspicious constructs. However, human auditors are still needed to validate, interpret, and contextualize the findings. Metatrust exemplifies how LLMs can be blended with traditional program analysis techniques to augment detection capabilities.

Olympix

Olympix embeds LLM guidance directly into developer workflows by scanning code as it is written and suggesting security fixes. This enables developers to shore up vulnerabilities early before deployment. Olympix also offers an advanced anomaly detector combining statistical methods, static analysis, and LLM insights to uncover subtle bugs that may be missed by standard audits.

Early results suggest Olympix could have prevented major exploits such as the $200 million Euler Finance hack. By surfacing issues in real-time during development, Olympix demonstrates how thoughtfully integrating LLM assistants could strengthen prevention and make security fundamentally more proactive.

Salus

Salus is building LLM-powered tools focused on automated pre-audit code analysis and end-to-end security reviews. The company recently received funding to scale its AI assistant technology for smart contract auditing. While details are sparse, Salus seems to be pursuing a full platform approach leveraging LLMs for automated vulnerability detection, exploit generation, test case creation, and report summarization.

The level of true automation versus human augmentation remains to be seen. But Salus shows the appetite for commercial LLM security products and the range of capabilities being explored, including pre-audit, full review, and post-audit assistance. Their efforts will provide further validation of how impactful LLMs can be in enhancing real-world auditing workflows.

Building Robocop

Based on numerous conversations with security researchers, the excitement about LLMs is around the automation of certain parts of the workflow that lend themselves to it. Much of security engineers is to understand the code, use a robust set of tools (eg. using Slither or Semgrep to conduct static analysis) and then analyze the findings. Tools like static analysis engines tend to have a low signal-to-noise ratio: it turns out a lot of the engineers’ valuable time is in parsing the output and finding needles in the haystack. For that reason, I think the approach that’s taken by the companies above is the right one: integrate security analysis in the Dev(Sec)Ops workflows and make it a continuous process.

To be clear, I don’t think the goal should be to create a magic tool that produces an audit report. Security audits are high-stakes and we need to have a human-in-the-loop to verify the outputs, assess the impact, and refine the mitigation strategies.

But… wouldn’t it fun to try?

I wanted to get my hands dirty with LLMs and Langchain so I figured I might give it a shoot.

Components

UI - Streamlit

Robocop's user interface is built using Streamlit, an open-source Python framework for creating beautiful, performant web apps for machine learning and data science. Streamlit allows for rapidly building an intuitive frontend to collect parameters and display outputs from the underlying LLM engine. With Streamlit, the Robocop UI could be constructed with just a few lines of simple Python code rather than weeks of traditional web development. This enabled quick iterative prototyping and a focus on the core LLM capabilities rather than UI implementation details. Overall, Streamlit provided an ideal framework to create a polished, usable interface with minimal effort.

Dealing with LLMs - Langchain

Robocop leverages Langchain, an open source library for easily integrating large language models into Python applications. Langchain provides a simple, production-ready API for prompting and parsing responses from LLMs like Claude 2 without needing to handle API keys or other boilerplate. Key benefits are simplified prompting with auto-conversion to text, built-in error handling, support for conversational and batch modes, and seamless integration into any Python workflow. Langchain allowed the core LLM logic in Robocop to be written cleanly without reinventing the wheel on foundational components like prompting and response handling.

The secret sauce - Claude 2

Robocop utilizes Anthropic's Claude 2 as the underlying LLM engine. Claude 2 has a large context window, allowing it to ingest entire codebases to deeply understand context. This enables detecting subtle vulnerabilities arising from distant interactions. Claude 2 can also generate code based on instructions, facilitating test case and snippet generation. Most importantly: it’s free (right now).

Generating an “Audit Report” automatically

Let’s look at the Generate Report tool in Robocop. This tool takes one or more Solidity files and generates an “audit report” for the selected vulnerability types.

Step 1: Load the code from Github

Given a URL, Robocop fetches the code by crawling through a repo and selecting all Solidity files.

def load_text(clone_url, project_name):
    if project_option != "New project":
        project_name = project_option
    exists = check_if_dump_exists(project_name)
    if exists:
        data = get_github_files(project_name)
    else:
        loader = GitLoader(
            clone_url=clone_url,
            repo_path=tmpdirname,
            branch=commit_branch,
            file_filter=lambda file_path: file_path.endswith(".sol")
        )
        data = loader.load()
        save_github_files(data)
    st.session_state["raw_code"] = data
    return data

The tool saves the cloned repo to the cloud with Supabase so that we don’t have to load the same files over and over again (and don’t get rate-limited!). To make things easier and avoid annoying issues with iterating over folders, I just take the whole folder pickle it.

Step 2: Have the user select the target contracts and vulnerabilities

In this paper, researchers provide a really useful model to classify vulnerabilities. They provide a nice list of “bugs that can be detected using simple and general oracles and do not require an in-depth understanding of the code semantics”.

L1: Reentrancy.
L2: Rounding issues or precision loss.
L3: Bugs that are caused by using uninitialized variables.
L4: Bugs that are caused by exceeding the gas limitation.
L5: Storage collision and confusion between proxy and implementation.
L6: Arbitrary external function call.
L7: Integer overflow and underflow.
L8: Revert issues caused by low-level calls or external libraries.
L9: Bugs that are caused by writing to memory that does not apply to the storage.
LA: Cryptographic issues.
LB: Using tx.origin.

Step 3: Prompt engineering

Now that we have some target code and the user has selected the vulnerabilities to look for, we need to build our prompts for our LLMs. This was by far the most fun part of this project.

The final prompt is actually constructed using a combination of hard-coded instructions and the inputs below, which themselves are constructed with LLMs as seen below:

{context}
{rules}
{severity criteria}
{Input 1: code summary} // Uses LLM
{Input 2: code}
{task}
	{instruction}
  {Input 3: vulnerability examples} // Uses LLM
	{format}

  • Input 1: Code summary - given the some code, produce a summary of the code so that we can use provide it in the final prompt.
    Human: Summarize the code below (enclosed in the <code> tags) 
    and explain in bullet points what it does. 
    Write the response in markdown format starting with 
    `## Summary`
    
    Code to be summarized:
    <code>
    {code}
    </code>
    
    Assistant:
  • Input 2: Code - I look-up the code itself that Robocop just fetched from Github.
  • Input 3: Vulnerability Examples
    • The task prompt includes the name of the vulnerability and some examples. For each vulnerability, I manually generated a set of examples which can be found in examples.py. It’s just an array of key-value pairs containing an example of “flawed” code first, and an example of “fixed” code. For example:
    {
        "flawed": """contract IntegerOverflowMinimal {
            uint public count = 1;
    
            function run(uint256 input) public {
                count -= input;
            }
        }
        """,
        "fixed" : """contract IntegerOverflowMinimal {
            uint public count = 1;
    
            function run(uint256 input) public {
                count = sub(count,input);
            }
    
            //from SafeMath
            function sub(uint256 a, uint256 b) internal pure returns (uint256) {
                require(b <= a);//SafeMath uses assert here
                return a - b;
            }
        }
        """
    } 
    • We then construct a string of examples using generateExamples() and put everything in a dictionary that contains several examples per vulnerability type
    VULNERABILITIES = {
    	"reentrancy" : {
    	    "category" : "L1",
    	    "description": """One of the major dangers of calling 
    				external contracts is that they can take over the control
    			  flow. In the reentrancy attack (a.k.a. recursive call 
    	      attack), a malicious contract calls back into the 
    	      calling contract before the first invocation of the 
    	      function is finished. This may cause the different invocations 
    	      of the function to interact in undesirable ways.""",
    	    "examples" : generateExamples(prompts.examples.REENTRANCY_EXAMPLES)
    	},
    	...
    }

    Now the task string is constructed using a hard-coded prompt and the example vulnerabilities. We include the specific instruction for the LLM (Analyze the code for {type} and find ALL vulnerabilities, no matter how small. Minimize false positives. Only report vulnerabilities you are sure about.), the examples, and instructions how to format the response. I tell the LLM to output the response in this XML-like syntax so I can parse it more easily.

    Each vulnerability should follow the structure in <report></report>:
    <report>
    	<vulnerability>
    		<description>Description of the vulnerability. Reference a code snippet containing the vulnerability.</description>
    		<severity>Refer to the severity framework in <severity_criteria></severity_criteria> and determine the severity score for the vulnerability identified.</severity>
    		<impact>Describe the impact of this vulnerability and explain the attack vector. Provide a comprehensive assessment with code examples.</impact>
    		<recommendation>Provide a solution to this vulnerability and how to mitigate it. Provide a fix in the code. Use backticks for any code blocks.</recommendation>
    		</vulnerability>
    </report>

Step 3-and-a-half:

You’ll notice in the code that there’s another LLMChain that I didn’t mention above. As I built this project, I realized that one way to catch false positives is to actually create another “agent” that would verify outputs. This generator vs. discriminator is a common pattern in ML (eg. StyleGAN).

For each potential vulnerability identified, I pass the output into another prompt:

You are an expert security researcher with deep expertise in 
Solidity contracts. Given a potential vulnerability and code, 
it is your job to evaluate whether there is in fact a vulnerability 
that is exploitable. Be as skeptical as possible and do not make 
assumptions of correctness.
                
Vulnerability Report:
{report}

Associated code:
{code}

Review of the vulnerability report:""" 

Step 4: Inference and parsing the responses

Now that we’ve constructed these prompts, we have to actually run the models. Langchain makes this super easy. Just instantiate an LLMChain and run it:

formatted_task = prompts.USER_TEMPLATE_TASK.format(
                    type=bug_type, 
                    description=prompts.VULNERABILITIES[bug_type]["description"], 
                    examples=prompts.VULNERABILITIES[bug_type]["description"])

chain = LLMChain(llm=llm, prompt=prompts.USER_TEMPLATE_WITH_SUMMARY)
                response = chain.run({
                    "smart_contract_name": report,
                    "summary": summary,
                    "code": code,
                    "task": formatted_task
                    })

Now I use a handy parser to extract the relevant data from my structured response.

parser = etree.XMLParser(recover=True)
resp_parsed = etree.fromstring(response.strip(), parser=parser)
vulnerability_instance["description"] = vulnerability[0].text

Step 5: Generating a report

As we extract data from the response, I put everything into a output_txt along with some markdown formatting. Streamlit offers a convenient st.download_button() that we can use to allow the user to download the report.

Improving code exploration with LLMs

Robocop also offers users the ability to interactively explore the target codebase with a chat interface. To achieve this, Robocop uses some helpful components of Langchain:

  • Data Connectors: used to retrieve snippets of code from a vector database.
  • Memory: used to ensure the chatbot has conversation history it can refer to.
  • Conversational RetrieverChain: a retrieval-focused system that interacts with the data stored in a VectorStore

By leveraging VectorStores, Conversational RetrieverChain, and an LLM, Langchain can answer questions in the context of an entire GitHub repository or generate new code. The workflow looks like this:

  • Index the codebase:
    • First, we clone the target GitHub repository containing the codebase we want our assistant to understand. This provides the raw materials for comprehension. We load all the files, break them into digestible chunks, and kick off the indexing process so our LLM can later search across this corpus.
    • This is done using GitLoader.
  • Embed and store snippets:
    • Next we embed these code snippets using a code-aware embedding model that understands programming semantics. These embedded representations capture structural and contextual information. We store the embeddings and snippets in a vector database for efficient retrieval.
    • After chunking the text using CharacterTextSplitter, we compute the embeddings using OpenAIEmbeddings then upload them to Activeloop’s Deep Lake.
  • Understand the query:
    • When a user asks a question, our LLM assistant first processes the query to grasp the context and extract key details that will help find relevant code. It picks up on nuanced intent and any terminology used.
  • Retrieve relevant snippets:
    • Leveraging the indexed embeddings, a conversational retriever searches through the vector store to identify code snippets most relevant to the question. Like a code search engine, it returns pertinent results.
    • We can play around here with different retrieval strategies including the the distance metric used (eg. cosine similarity) and how many results to fetch (k).
  • Refine the model
    • We can customize retriever settings for our specific codebase and define any filters as needed to improve retrieval quality. The model gets tailored to the project.
  • Ask away!
    • Now we can have a conversation with our LLM assistant about the codebase, asking followup questions and driving an interactive dialogue. The assistant provides context-aware answers generated from the relevant snippets it identified, creating a smooth user experience.

So… Does it work?

Sometimes. False positives are a problem. With further examination we realize that false positives reported are either straight up hallucinations or, more interestingly, real insecure code but that in practice the exploit would be infeasible. LLMs can hallucinate and sometimes they are very confidently incorrect - that’s a problem because it takes long for the reviewer to verify that is in fact a false positive.

However, Robocop was actually able to produce valid vulnerabilities. Across a test set of 50 bugs identified from the Web3Bugs repo, Robocop identified 9. These results are anything but empirical but suggest there may be something here. As I explained above, I don’t think the goal should be to produce a tool that magically generates a bullet-proof audit report, but I do think that this kind of work can help improve workflows for security engineers to do their work smarter and faster.


See More Posts

background

"Crypto" vs. "Crypto-Crypto": More Than a Semantical Difference

João Fiadeiro

background

Navigating the Universe of AI-Generated Imagery with an Immersive 3D experience

João Fiadeiro

background

The Case for Data Bounty Hunters

João Fiadeiro

Show more

Work with me.

Interested in collaborating? Reach out!

Copyright © 2023 João Fiadeiro. All rights reserved.