By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
AI in AppSec
October 23, 2024

One Year of Using LLMs for Application Security: What We Learned

Hi, I’m Ken Johnson, Co-founder and CTO at DryRun Security. If you are unfamiliar with DryRun Security, our product finds the needle in the haystack of code changes, so AppSec teams spot unknown risks before they start.

I wanted to take a moment to reflect on our journey with using LLMs for application security. There have been plenty of skeptics who doubted whether AI and LLMs could reliably detect vulnerabilities in source code—and unsurprisingly, many of those voices came from traditional SAST vendors. While our experience hasn’t been without its challenges, as is the case with any emerging technology, navigating the bumps along the way has been part of the process.

Over the past year, we at DryRun Security have immersed ourselves in leveraging large language models (LLMs) for application security. Our exploration of how best to use LLMs to assess risk in software development has been a challenging, yet enlightening, journey. In this post, I’ll share key insights from our experience, including the obstacles we encountered and the lessons we learned.

Why LLMs for Application Security?

Traditionally, identifying vulnerabilities in software has relied heavily on code scanning techniques. These methods typically involve parsing source code into an Abstract Syntax Tree (AST), building call graphs to map function interactions, and searching for specific patterns or naming conventions. 

Therein lies the crux of our issues. 

At the end of the day, we are still looking for fairly exact patterns. More nuanced flaws that require some level of intelligence don’t exist in yesterday’s approach.

Some of the common complaints with legacy tools include:

  1. High noise-to-signal ratio
  2. Difficulty adapting to new technology stacks
  3. Inability to detect complex or nuanced issues
  4. Lack of context for changes in code
  5. Designed primarily for security experts, not developers
  6. Slow performance

Our hypothesis was that LLMs, known for their proficiency in text summarization and apparent understanding of code intent and behavior, could help overcome these limitations. 

With LLMs, we believed we could go beyond exact pattern matching and detect a wider range of nuanced security issues in code. 

We also believed that we could build a product that spoke to developers like humans do, that we could help them learn while keeping their code safe.

How Did It Go?

Initially, our results were incredibly underwhelming. We encountered several challenges, including:

  1. Inconsistent quality across LLMs: Not all LLMs are equally effective at code analysis. Many performed poorly, delivering unreliable results.
  2. Inconsistent outputs: Even LLMs that excel at analyzing code often struggled to produce consistent results, which is critical for security assessments.
  3. Privacy and security concerns: Balancing the need for secure analysis with both privacy concerns and with the cost of running LLMs required a delicate tradeoff. We explored various solutions. We found hosting OSS models prohibitively expensive, OpenAI to be a non-starter for many of the organizations we talked to, and we needed strong privacy & security guarantees in any solution we purchased.
  4. Training LLMs isn’t a silver bullet: While training models on specific datasets seemed like a potential solution, it came with drawbacks, including being locked into a particular LLM and the complexity of maintaining the training process as well as limiting our ability to quickly add support for new technology stacks.

A crucial realization was that success with LLMs hinges on how they are used. 

In an ideal world, we’d send code changes to a well-trained LLM and receive accurate, consistent insights about potential issues—complete with suggestions on how to fix them. 

Unfortunately, the reality is more complicated. LLMs can’t be fed code and expected to identify vulnerabilities. Instead, they require a shift in approach and careful integration with existing tools and processes.

Key Lessons We Learned

While we’re still learning, here are some of the most important lessons we’ve discovered about using LLMs for application security:

  1. Choose the right LLM for the task. Different LLMs excel at different things. Sometimes you need a model specialized in embeddings; other times, you need one that can perform written tasks well or understand code deeply. Matching the right LLM to the specific job is critical.
  2. Ask the right questions. Treat LLMs like human code reviewers—broad or vague questions will yield unreliable answers. For example, asking “Does this code have SQL injection?” might result in an uncertain or incomplete response. Instead, the questions need to be specific, concrete, and backed by context. Don’t expect LLMs to solve complex issues on their own—break down your queries and use tuning techniques to guide the LLM effectively.
  3. LLMs don’t have all the answers, but they can learn. While training an LLM can help provide the necessary answers, it comes with the risk of vendor lock-in and a labor-intensive process. A more flexible approach is Retrieval-Augmented Generation (RAG), where you can quickly build a knowledge base without being tied to a specific LLM. This method also allows for more dynamic and scalable solutions.
  4. Robust testing is essential. Anytime you modify code, update your knowledge base, or switch to a different LLM, you need thorough testing in place. Without strong tests, you risk compromising the security insights you’ve worked so hard to generate. (Check out our detailed article on how we test LLMs here).
  5. LLMs excel at summarizing behavior. One area where LLMs truly shine is their ability to summarize the behavior of code. With the right setup, they can provide a clear, high-level understanding of what code is doing, which can be incredibly useful for spotting behavioral anomalies.
  6. Combining deterministic and probabilistic methods works best. While LLMs excel in certain areas, we found that accuracy and speed improved significantly when we combined deterministic and probabilistic methods. For example, using deterministic techniques to identify whether a specific library is present in a codebase provided useful context for the LLM. This context, when fed into the LLM for probabilistic analysis, helped the model perform more effectively. By blending both methods, we were able to leverage the strengths of each and reduce uncertainty.
  7. Agent-based execution enhances LLM performance. One of our biggest breakthroughs was realizing the effectiveness of using LLMs with an agent-based execution model. Instead of relying solely on single-shot question-and-answer interactions, we enabled the LLM to follow a series of steps—essentially mimicking a chain of thought. By giving the LLM access to external tools and documentation, as well as a structured process for gathering the information it needed, we saw a dramatic improvement in outcomes. This approach allowed the LLM to function more like a human code reviewer, providing deeper insights and more accurate analysis.

Problematic LLMs We Encountered

While many LLMs show promise, we also encountered several that presented too many issues for us to adopt them in our workflow. Below is a list of some of the problematic LLMs we tested, along with the key shortcomings:

  1. CodeLlama: While CodeLlama showed promise in certain areas, it struggled with consistent code analysis, particularly in distinguishing between different input types such as few-shot prompts, context, and the code itself. Additionally, using CodeLlama with cloud services like SageMaker proved challenging, requiring significant effort in prompt formatting and modifying libraries to ensure embeddings were processed correctly.
  2. LLaMA: At the time of this writing, LLaMA 3.2 has been released, but we have only tested versions 2 and 3. These earlier versions demonstrated notable limitations in understanding and processing more complex code, making them less effective for in-depth analysis. However, LLaMA excelled in other areas, particularly in writing and summarizing text, which can be useful for generating high-level overviews.
  3. Mixtral: With Mixtral we encountered problems with specific programming languages and results were highly inconsistent.
  4. Mistral: Mistral generated overly broad or irrelevant responses during security assessments.

This journey with LLMs has been filled with challenges, but the potential benefits for application security are immense. We’re excited to continue pushing the boundaries of what LLMs can do and to share our progress with you along the way.

Summary

Our journey with LLMs for application security has been both challenging and rewarding. While traditional static analysis tools have limitations, LLMs offer a new approach to code analysis that allows for greater nuance and insight into security risks. 

Through trial and error, we learned that success with LLMs requires the right combination of deterministic and probabilistic methods, agent-based execution, and robust testing frameworks. Although not all LLMs are well-suited for this task, those that excel can summarize code behavior and improve security outcomes when used correctly.

In case you were wondering, here’s an article on how we addressed privacy and security issues for our customers: How We Keep Your Code Safe at DryRun Security

As we continue to push the boundaries of what’s possible with AI in security, we’re excited to share more insights on how we scale and refine these methods in our next post. Stay tuned and follow us on LinkedIn to receive more updates.

Thanks for reading this far! If you’re interested in seeing how we leverage LLMs to find risk before it gets merged, then I’d recommend checking out our 3-min demo video or setting up a 1:1 personalized demo with our team.