João Fiadeiro • 2023-07-28
LLMs can be powerful tools in the arsenal of security researchers.
Web3 has seen over $4.7 billion lost to hacks since 2020, causing major harm to users, protocols, and the industry's reputation. Comprehensive auditing could prevent many issues, but remains prohibitively expensive at up to $100k per audit. Even audited protocols get breached as bugs slip through initial reviews or code changes later introduce new vulnerabilities.
Unlike traditional web2, web3 poses distinct security risks due to its immutable, exposed, and financially-motivated nature. Smart contracts are public and cannot be easily patched once deployed, allowing vulnerabilities to persist indefinitely. Further, the prominence of cryptocurrencies creates strong financial incentives for hackers, who can steal funds directly rather than target user data. With billions lost to bugs and scant resources for comprehensive auditing, the industry urgently needs tools to analyze contracts, generate test cases, and assist overburdened auditors. LLMs present an opportunity to amplify human efforts in securing web3 protocols and protecting users.
While (currently) unable to independently identify novel exploits, LLMs can help in several ways:
To demonstrate the potential of LLMs in auditing, I built a proof-of-concept tool called Robocop using Claude 2. It analyzes contracts, surfaces suspicious areas, generates unit tests, and summarizes its findings. While basic, this prototype achieved promising results detecting seeded vulnerabilities, showcasing how LLMs could enhance audits today and unlock further innovation through open models and fine-tuning.
With hackers constantly finding new ways to exploit contracts, the industry urgently needs more efficient tools to secure protocols. This article will highlight how LLMs are poised to accelerate audits and unit test generation, reducing risks for users and making web3 safer. Although human auditors cannot yet be replaced, LLMs present an opportunity to amplify their efforts. By open sourcing Robocop, I hope to move the community forward in adopting LLMs to better secure smart contracts.
Recent papers demonstrate machine learning models show increasing promise for assisting with smart contract auditing, but still require significant human review to filter false positives. Models have improved vulnerability detection accuracy, but manual oversight remains necessary for the foreseeable future.
One line of work focuses on better knowledge generation for vulnerability detection. Models struggle with complex contract semantics like global variables, but new techniques incorporate global variable analysis to improve results. Another approach is multimodal learning combining code and graph features, which shows state-of-the-art accuracy. This allows tailored strategies for each vulnerability type, outperforming black-box methods.
Additionally, heterogeneous graph learning methods extract semantic features between nodes, achieving strong results for line-level vulnerability detection. Multi-task learning jointly detects vulnerabilities and identifies types, improving over single-task models in accuracy, efficiency, and scalability.
In summary, combining semantic analysis, multimodal learning, graph methods, and multi-task training pushes the boundaries of ML-assisted auditing. However, human auditors are still essential to validate findings. Future directions could explore AI agents to explain results or adapt models to new smart contract languages beyond Solidity like Rust and Move. As blockchain ecosystems grow, developing language-agnostic techniques will be critical.
The key is striking the right balance between automation and human oversight. While ML models are unlikely to fully replace auditors soon, they present opportunities to accelerate and strengthen the process if thoughtfully integrated. Recent advances lay the foundation, but more research and tools like Robocop are needed to scale up adoption in practice.
AuditWizard
AuditWizard provides an integrated auditing platform aimed at streamlining the workflow for auditors, developers, and security engineers. It combines multiple capabilities including visualization, manual review, and ML-powered insights into a single web-based IDE. Key features include auto-generated audit reports, integration with bug bounty/contest platforms like Code4rena and Hats Finance, and developer-focused pre-audit capabilities.
AuditWizard consolidates necessary security tools and leverages AI to surface findings for human validation. For example, automated report generation allows auditors to easily compile results by adding notes throughout review. The tool also enables developers to self-audit code before deployment by making security analysis more accessible. Overall, AuditWizard demonstrates the potential for LLMs to enhance productivity by integrating insights across a unified auditing workflow.
Metatrust
Metatrust utilizes a hybrid approach combining LLMs with program analysis for automated vulnerability detection in smart contracts. Their proprietary threat model covers 12 known vulnerability categories including issues in DeFi, access control, cryptography, and compilers.
The Metatrust static analyzer tool applies over 100 predefined rules to surface potential bugs in code for manual review. This showcases the ability of LLMs to ingest patterns and learn to flag suspicious constructs. However, human auditors are still needed to validate, interpret, and contextualize the findings. Metatrust exemplifies how LLMs can be blended with traditional program analysis techniques to augment detection capabilities.
Olympix
Olympix embeds LLM guidance directly into developer workflows by scanning code as it is written and suggesting security fixes. This enables developers to shore up vulnerabilities early before deployment. Olympix also offers an advanced anomaly detector combining statistical methods, static analysis, and LLM insights to uncover subtle bugs that may be missed by standard audits.
Early results suggest Olympix could have prevented major exploits such as the $200 million Euler Finance hack. By surfacing issues in real-time during development, Olympix demonstrates how thoughtfully integrating LLM assistants could strengthen prevention and make security fundamentally more proactive.
Salus
Salus is building LLM-powered tools focused on automated pre-audit code analysis and end-to-end security reviews. The company recently received funding to scale its AI assistant technology for smart contract auditing. While details are sparse, Salus seems to be pursuing a full platform approach leveraging LLMs for automated vulnerability detection, exploit generation, test case creation, and report summarization.
The level of true automation versus human augmentation remains to be seen. But Salus shows the appetite for commercial LLM security products and the range of capabilities being explored, including pre-audit, full review, and post-audit assistance. Their efforts will provide further validation of how impactful LLMs can be in enhancing real-world auditing workflows.
Based on numerous conversations with security researchers, the excitement about LLMs is around the automation of certain parts of the workflow that lend themselves to it. Much of security engineers is to understand the code, use a robust set of tools (eg. using Slither or Semgrep to conduct static analysis) and then analyze the findings. Tools like static analysis engines tend to have a low signal-to-noise ratio: it turns out a lot of the engineers’ valuable time is in parsing the output and finding needles in the haystack. For that reason, I think the approach that’s taken by the companies above is the right one: integrate security analysis in the Dev(Sec)Ops workflows and make it a continuous process.
To be clear, I don’t think the goal should be to create a magic tool that produces an audit report. Security audits are high-stakes and we need to have a human-in-the-loop to verify the outputs, assess the impact, and refine the mitigation strategies.
But… wouldn’t it fun to try?
I wanted to get my hands dirty with LLMs and Langchain so I figured I might give it a shoot.
UI - Streamlit
Robocop's user interface is built using Streamlit, an open-source Python framework for creating beautiful, performant web apps for machine learning and data science. Streamlit allows for rapidly building an intuitive frontend to collect parameters and display outputs from the underlying LLM engine. With Streamlit, the Robocop UI could be constructed with just a few lines of simple Python code rather than weeks of traditional web development. This enabled quick iterative prototyping and a focus on the core LLM capabilities rather than UI implementation details. Overall, Streamlit provided an ideal framework to create a polished, usable interface with minimal effort.
Dealing with LLMs - Langchain
Robocop leverages Langchain, an open source library for easily integrating large language models into Python applications. Langchain provides a simple, production-ready API for prompting and parsing responses from LLMs like Claude 2 without needing to handle API keys or other boilerplate. Key benefits are simplified prompting with auto-conversion to text, built-in error handling, support for conversational and batch modes, and seamless integration into any Python workflow. Langchain allowed the core LLM logic in Robocop to be written cleanly without reinventing the wheel on foundational components like prompting and response handling.
The secret sauce - Claude 2
Robocop utilizes Anthropic's Claude 2 as the underlying LLM engine. Claude 2 has a large context window, allowing it to ingest entire codebases to deeply understand context. This enables detecting subtle vulnerabilities arising from distant interactions. Claude 2 can also generate code based on instructions, facilitating test case and snippet generation. Most importantly: it’s free (right now).
Let’s look at the Generate Report tool in Robocop. This tool takes one or more Solidity files and generates an “audit report” for the selected vulnerability types.
Step 1: Load the code from Github
Given a URL, Robocop fetches the code by crawling through a repo and selecting all Solidity files.
def load_text(clone_url, project_name):
if project_option != "New project":
project_name = project_option
exists = check_if_dump_exists(project_name)
if exists:
data = get_github_files(project_name)
else:
loader = GitLoader(
clone_url=clone_url,
repo_path=tmpdirname,
branch=commit_branch,
file_filter=lambda file_path: file_path.endswith(".sol")
)
data = loader.load()
save_github_files(data)
st.session_state["raw_code"] = data
return data
The tool saves the cloned repo to the cloud with Supabase so that we don’t have to load the same files over and over again (and don’t get rate-limited!). To make things easier and avoid annoying issues with iterating over folders, I just take the whole folder pickle it.
Step 2: Have the user select the target contracts and vulnerabilities
In this paper, researchers provide a really useful model to classify vulnerabilities. They provide a nice list of “bugs that can be detected using simple and general oracles and do not require an in-depth understanding of the code semantics”.
L1: Reentrancy.
L2: Rounding issues or precision loss.
L3: Bugs that are caused by using uninitialized variables.
L4: Bugs that are caused by exceeding the gas limitation.
L5: Storage collision and confusion between proxy and implementation.
L6: Arbitrary external function call.
L7: Integer overflow and underflow.
L8: Revert issues caused by low-level calls or external libraries.
L9: Bugs that are caused by writing to memory that does not apply to the storage.
LA: Cryptographic issues.
LB: Using tx.origin.
Step 3: Prompt engineering
Now that we have some target code and the user has selected the vulnerabilities to look for, we need to build our prompts for our LLMs. This was by far the most fun part of this project.
The final prompt is actually constructed using a combination of hard-coded instructions and the inputs below, which themselves are constructed with LLMs as seen below:
{context}
{rules}
{severity criteria}
{Input 1: code summary} // Uses LLM
{Input 2: code}
{task}
{instruction}
{Input 3: vulnerability examples} // Uses LLM
{format}
Human: Summarize the code below (enclosed in the <code> tags)
and explain in bullet points what it does.
Write the response in markdown format starting with
`## Summary`
Code to be summarized:
<code>
{code}
</code>
Assistant:
examples.py
. It’s just an array of key-value pairs containing an example of “flawed” code first, and an example of “fixed” code. For example:{
"flawed": """contract IntegerOverflowMinimal {
uint public count = 1;
function run(uint256 input) public {
count -= input;
}
}
""",
"fixed" : """contract IntegerOverflowMinimal {
uint public count = 1;
function run(uint256 input) public {
count = sub(count,input);
}
//from SafeMath
function sub(uint256 a, uint256 b) internal pure returns (uint256) {
require(b <= a);//SafeMath uses assert here
return a - b;
}
}
"""
}
generateExamples()
and put everything in a dictionary that contains several examples per vulnerability typeVULNERABILITIES = {
"reentrancy" : {
"category" : "L1",
"description": """One of the major dangers of calling
external contracts is that they can take over the control
flow. In the reentrancy attack (a.k.a. recursive call
attack), a malicious contract calls back into the
calling contract before the first invocation of the
function is finished. This may cause the different invocations
of the function to interact in undesirable ways.""",
"examples" : generateExamples(prompts.examples.REENTRANCY_EXAMPLES)
},
...
}
Now the task string is constructed using a hard-coded prompt and the example vulnerabilities. We include the specific instruction for the LLM (Analyze the code for {type} and find ALL vulnerabilities, no matter how small. Minimize false positives. Only report vulnerabilities you are sure about.
), the examples, and instructions how to format the response. I tell the LLM to output the response in this XML-like syntax so I can parse it more easily.
Each vulnerability should follow the structure in <report></report>:
<report>
<vulnerability>
<description>Description of the vulnerability. Reference a code snippet containing the vulnerability.</description>
<severity>Refer to the severity framework in <severity_criteria></severity_criteria> and determine the severity score for the vulnerability identified.</severity>
<impact>Describe the impact of this vulnerability and explain the attack vector. Provide a comprehensive assessment with code examples.</impact>
<recommendation>Provide a solution to this vulnerability and how to mitigate it. Provide a fix in the code. Use backticks for any code blocks.</recommendation>
</vulnerability>
</report>
Step 3-and-a-half:
You’ll notice in the code that there’s another LLMChain that I didn’t mention above. As I built this project, I realized that one way to catch false positives is to actually create another “agent” that would verify outputs. This generator vs. discriminator is a common pattern in ML (eg. StyleGAN).
For each potential vulnerability identified, I pass the output into another prompt:
You are an expert security researcher with deep expertise in
Solidity contracts. Given a potential vulnerability and code,
it is your job to evaluate whether there is in fact a vulnerability
that is exploitable. Be as skeptical as possible and do not make
assumptions of correctness.
Vulnerability Report:
{report}
Associated code:
{code}
Review of the vulnerability report:"""
Step 4: Inference and parsing the responses
Now that we’ve constructed these prompts, we have to actually run the models. Langchain makes this super easy. Just instantiate an LLMChain and run it:
formatted_task = prompts.USER_TEMPLATE_TASK.format(
type=bug_type,
description=prompts.VULNERABILITIES[bug_type]["description"],
examples=prompts.VULNERABILITIES[bug_type]["description"])
chain = LLMChain(llm=llm, prompt=prompts.USER_TEMPLATE_WITH_SUMMARY)
response = chain.run({
"smart_contract_name": report,
"summary": summary,
"code": code,
"task": formatted_task
})
Now I use a handy parser to extract the relevant data from my structured response.
parser = etree.XMLParser(recover=True)
resp_parsed = etree.fromstring(response.strip(), parser=parser)
vulnerability_instance["description"] = vulnerability[0].text
Step 5: Generating a report
As we extract data from the response, I put everything into a output_txt
along with some markdown formatting. Streamlit offers a convenient st.download_button()
that we can use to allow the user to download the report.
Robocop also offers users the ability to interactively explore the target codebase with a chat interface. To achieve this, Robocop uses some helpful components of Langchain:
By leveraging VectorStores, Conversational RetrieverChain, and an LLM, Langchain can answer questions in the context of an entire GitHub repository or generate new code. The workflow looks like this:
GitLoader
.CharacterTextSplitter
, we compute the embeddings using OpenAIEmbeddings
then upload them to Activeloop’s Deep Lake.Sometimes. False positives are a problem. With further examination we realize that false positives reported are either straight up hallucinations or, more interestingly, real insecure code but that in practice the exploit would be infeasible. LLMs can hallucinate and sometimes they are very confidently incorrect - that’s a problem because it takes long for the reviewer to verify that is in fact a false positive.
However, Robocop was actually able to produce valid vulnerabilities. Across a test set of 50 bugs identified from the Web3Bugs repo, Robocop identified 9. These results are anything but empirical but suggest there may be something here. As I explained above, I don’t think the goal should be to produce a tool that magically generates a bullet-proof audit report, but I do think that this kind of work can help improve workflows for security engineers to do their work smarter and faster.
Interested in collaborating? Reach out!
Copyright © 2023 João Fiadeiro. All rights reserved.