Background
When we set out to build DryRun Security, we had no idea that we were going to utilize LLMs in any capacity let alone put them at the heart of everything that we do. That work, the job of keeping software secure, is really the crux of why testing matters so much to us. The results of our product need to be accurate, consistent, and precise.
We need to be able to demonstrate that using an LLM-backed approach is more robust than the legacy Static Application Security Testing (SAST) tools.
The SAST tools you see today have been performing static analysis the same way for at least 26 years and while they are generally accepted as noisy and inaccurate by both the security and developer communities, they set out to do their analysis in a deterministic way. In other words, legacy SAST tools are tremendously limited in what they can do, however they can do their analysis in a very controlled and repeatable way.
In today’s world, and although we do believe this will definitely change, deterministic behavior affords the existing SAST solutions more trust by the security community than LLMs which use probabilistic methods to make its assertions and at times can appear to “make things up”.
We will be the first to admit that without proper testing and loads of work, LLMs often produce inconsistencies and inaccuracies in the answers they form. This is where proper testing really shines.
Testing our LLMs ability to produce consistent and accurate answers ensures that we do not have to trust the system’s answers but rather, we have the capability to verify them.
DryRun Security Components
Before we get into the meat of how we perform testing it would help to understand the components that make up our product and what we are specifically talking about testing in this post.
When GitHub pull requests are opened or updated, DryRun Security simultaneously runs a suite of analyzers that each perform some sort of evaluation of either pull request code OR other data points such as authorship, intent, or behavior. We call this Contextual Security Analysis or “CSA” and you can read more about it here.
In this article, we specifically focus on the testing behind our code analyzers. These code analyzers work by asking a set of code-specific questions (we call it “Code Inquiry”) about the code that is changing and then the responses will help our tool evaluate if the code changes introduces a security vulnerability.
The Importance of Testing
However, we can do NONE of this without proper benchmarking and testing. A product that purports to leapfrog not only the competition but an entire class of tools needs to be AT LEAST as useful as those tools and then show additional value. It needs to be able to demonstrate this through comparisons and repeatable testing.
LLMs themselves are difficult to make repeatable by the very nature of the technology. LLMs are literally designed to produce something slightly different in their responses primarily to avoid sounding robotic and if you are curious to learn more we recommend reading “What is ChatGPT Doing and Why Does It Work” by Stephen Wolfram. This is the reason that LLMs appear to quickly fall down when connected to your code base and asked to repeat the same evaluation multiple times. The first request might be great, the second ok, and by the third and fourth the LLM has already gone off the rails and responds with completely inaccurate results.
We are convinced LLMs producing different answers and factually incorrect information is a large part of the reason why there is still some hesitance amongst security technologists to embrace a new technology. We get that and of course it makes sense.
It took incredible amounts of experimentation and research to make LLMs behave in a consistent and accurate manner. It took so much work that we know most people securing software are unlikely to ever have the same time to invest in figuring out how to tame this technology so that it can be used for software security.
Without that significant investment of time and effort, the value just isn’t obvious. Luckily, we had the resources and desire and so we found not only could we make it work but it could do things that legacy SAST tools cannot.
However, once we had consistent and repeatable results, we needed to ensure those results did not change once we started making improvements and modifications to these analyzers. When talking about building around LLMs at a production scale, one needs to account for the variances in results that come from changing even a couple of words in a system prompt or trivially small changes to few show prompt examples or RAG-backed context, variables, and metadata. Any slight change, especially at the scale and level of complexity we are talking about, can cause massive variances and we need to be able to uncover those issues before shipping to customers.
Hopefully, dear reader, we’ve convinced you how important testing is to us and how important it should be to any organization building around LLMs.
Testing Overview
As previously mentioned and for the purposes of this article, we will focus on the testing of the analyzers that perform code analysis. Each time one of these analyzers believes it has discovered a valid vulnerability in code it will anonymize the data and send it to an ephemeral data store in what we call a “code hunk” format.
When an analyzer needs to use these code hunks to perform testing, we pull them down locally into a git repository, sort them into their respective buckets, and point the analyzer to this local git repository via a configuration file.
When we run our tests within the analyzer, it will use the code hunks to determine if it is performing correctly. If even one test fails, we know we’ve got work to do. We run these tests not once but multiple times to ensure consistency.
When we say “sort into buckets” we mean that in any given analyzer we may have several questions (if not more) that we ask. Each question is expected to produce a boolean answer. For example, imagine a command injection analyzer. It needs to know, at a very basic level, a few things:
- Is user supplied input present?
- Is that user input being placed into a system call?
- Is that system invocation vulnerable or being used in a vulnerable way?
This leads us to asking several questions about the code and backfilling the LLM with various bits of information it needs to know in order to answer these questions. Lastly, each answer needs to return with either a true or false condition (boolean answers).
In the above scenario, because we are asking 3 distinct questions our test cases would contain 3 separate folders all listed under a specific vulnerability type. So imagine the following folder structures:
- command_injection
- language
- framework/library
- detect_user_input
- inbox
- true
- false
- analyze_system_call
- inbox
- true
- false
- analyze_vuln
- inbox
- true
- false
When we run our synchronize script, the anonymized code hunks are placed into the inbox folder under their respective vulnerability type and “action” (the type of action taken, eg: detect_user_input).
Once the code hunks have been placed into the inbox folder, we humans go about reviewing where they belong and sorting them into the correct true/false folder. We then run our tests from the analyzer’s code base so in this case we would be running this from the command injection analyzer’s code base. Each analyzer is built on our analyzer framework so a developer only needs to point to the location of the test cases repo on their local machine and then define the vulnerability category type (directly corresponds to the vulnerability category folder name).
When we’re confident in the work we did to sort things correctly in the local test cases git repository, we’ll submit a pull request to add relevant code hunks to our analyzer tests repo.
Because all of our code specific analyzers are written in Python, we opted to use the PyTest framework. We primarily utilize unit-tests to ensure the format of our knowledge-base, which we use to provide the LLM with all of the relevant information it needs, is in the correct format and in working order. We utilize integration tests to ensure the LLM is responding correctly with each code hunk it evaluates.
Other relevant details to provide are:
- We have a tool to recreate pull requests from OSS repos so that we can test with live data in our staging environment
- We have the ability to run in production on silent mode so that when we make changes we can observe the effect without impacting customers
- We have multiple facets of observability in place to ensure that our tool is behaving as intended
In conclusion, while large language models (LLMs) are potent tools, their default configurations often fall short in providing the accuracy and consistency required for reliable, critical analysis. Developers leveraging LLMs must not only invest in rigorous initial adjustments to ensure optimal performance but also establish comprehensive testing protocols to maintain this standard over time.
Special thanks to Joshua on our engineering team for his work on designing the first version of our testing framework.
Try DryRun Security Yourself
Ready to get more out of your secure code review? Request a demo or try DryRun Security today free and see how our suite of analyzers can help you secure your code with confidence.
Explore more at DryRun Security and download our free Contextual Security Analysis Guide to learn more about our innovative approach to application security.