Be careful with ‘open source’ AI

8 mins

Open source AI models may be appealing for developers, but there are still plenty of complex risks to assess.

Many generative AI models claim to be open source when they really aren’t. One Cornell paper describes the confusion around AI model restrictions as “open washing.” The US government is scratching its head, weighing the risks of open LLMs. Adding to the murkiness is the many lawsuits over alleged copyright infringement by AI model owners still working their way through various courts.

“Many ‘open’ models don’t disclose what’s in the training set, and they may be using factually wrong, biased, or copyright-restricted data,” says Mike Lieberman, the cofounder and CTO of software supply chain security company Kusari. “If that’s the case, the vendor may bear the liability for the material, but in a truly open model, the consumer may be liable.”

The open source AI dilemma is more than academic – the definition and inherent risks could have ripple effects on any developers using the latest AI technologies.

Defining open source AI

Defining open source AI is more complex than it seems. The IP isn’t just in the model’s source code – it’s also the training data, processing, weights, configurations, and more. This has caused widespread confusion around what actually constitutes open source in the AI era.

Nathan Lambert, a machine learning scientist at the Allen Institute For AI (AI2), acknowledges the complexity of defining open source large language models (LLMs) and the confusion in the market. Although works are in progress, the industry doesn’t yet have a universally accepted definition of ‘open source’ AI.

That hasn’t stopped the Open Source Initiative (OSI) from attempting to reach a universal definition of open source AI. The latest draft-in-progress, v. 0.0.8, says open source AI must make the source code, model parameters, and information about the training data free to use, study, modify, and share. Interestingly, the current OSI definition leaves the actual training data sets as an optional component.

“For me, everything has to be included, that includes the data, for it to be called open source,” says Marcus Edel, machine learning lead at Collabora. And ideally, the datasets and integrations behind the AI should be safe to use as well. “For true open-source AI, the LLM and the platform must be freely available and must consist of only content that is open source, creative commons, expressly approved or liberally licensed material,” adds Paul Harrison, senior security engineering lead at Mattermost.

For others, the definition should go beyond source code to consider not only the testing processes and training data, but also the openness around community participation. “To me, it means collaborating,” says Roman Shaposhnik, cofounder of Ainekko and the IP Paperclip Project. For him, true open source communities accept a wide range of contributors, as opposed to being supported by a single entity.

Many purportedly open source foundational models are fundamentally at odds with traditional OSS practices, given they use a closed development process. “They are developed centrally, typically within a single organization, using their proprietary data and compute power,” write Nathan Benaich and Alex Chalmers from the venture capital firm Air Street. “The core contributors are limited to a select group.”

Is it risky?

Engineering leaders should proceed with caution with any AI model that claims to be open source. While both open and proprietary models can suffer from hallucination, at least open models can be inspected.

The difficulty comes with purportedly open source LLMs that aren’t fully transparent, where training data and weights are often unavailable. For instance, one study ranked over 40 fine-tuned LLMs and found that very few projects are truly open source. “Similar to traditional software, an AI project can claim to be open source while not meeting the accepted definition,” says Lieberman.

Stefano Maffulli, executive director of OSI, points to Eleuther.ai as the golden example of a purely open source AI initiative, “with an intention to progress the science without a business model behind it,” he says. Less open models often have hidden provisions in their legal licenses that inhibit certain use cases. For instance, the license for Meta’s LLaMa 2 restricts usage by any organization with 700 or more million monthly active users. Other licenses explicitly prohibit using AI for illegal activities, which can vary widely country by country.

The gray area between proprietary and truly open source AI is where the risk lies, says Shaposhnik. “When push comes to shove, the risks are not that high on your side, it’s higher on the side of the vendor supplying the AI,” he says. “Vendors will protect you from lawsuits.” To his point, major AI lawsuits so far tend to be against companies supplying AI, not the consuming developers. Yet, you open yourself up to the same risks if you’re fine-tuning your own AI or using open components without clear data provenance, he says.

According to AI2’s Lambert, the risk depends. “Open source AI is a strategic lever, but there are different risks from open and closed models for security, cost, and speed.” Open-sourcing an AI model can help a company understand its users and the limitations of its pipeline. For instance, Lambert points to Command R+, a 104 billion parameter model based on retrieval augmented generation (RAG) on par with GPT-4 or Llama, which Cohere opened up to the praise of some developers. However, like other ‘open’ LLMs, Command R+ is only for non-commercial use cases.

Other potential risks are not necessarily unique to open source AI. For instance, privacy concerns and IP leakage are associated with LLM-based threats in general. Also, open source software supply chain issues are common, yet there’s a lack of oversight into what OSS packages developers include in their applications. In the context of open AI, this could materialize in training data poisoning or corrupted project dependencies.

“While open source AI seeks to address key issues with security and compliance risks, engineering leadership must still be cautious for a number of reasons, including accuracy, supply chain protections, and licensing challenges,” says Mattermost’s Harrison.

Potential benefits of transparent AI

On the other hand, open source grants benefits beyond just being free to download or fork. First, working in the open tends to improve security, since more eyes can spot bugs or vulnerabilities in a project.

Similarly, this transparency can help address AI’s ethical shortcomings, or reduce errors. “Using open source AI allows an organization to address biases and incorrect information in the training data in a way that proprietary models don’t allow,” says Kusari's Lieberman.

Collaboration can also reduce computing duplication, helping to mitigate the environmental effects of machine learning.

“From a risk perspective, I would say open source AI is actually safer in the sense that public eyeballs and scrutiny on these models are a natural enforcement mechanism,” says Leonard Tang, cofounder of Haize Labs.

Tips for evaluating open source AI

Engineering leaders are responsible for carefully controlling the AI they consume.

First, check that the tool is actually open. “Leadership has to verify that their organization’s use of the AI meets the AI’s license terms and doesn’t impose any undesired requirements on the project,” says Lieberman. As part of this, carefully check the AI’s license. “Definitely look at the licenses around weights and parameters,” adds Maffulli. “Trust but verify.”

In addition to checking the license, engineers should check the data, especially if they need to know about personal information in the model, says Lambert. “Make sure you have all the components available and know where the data is coming from,” adds Maffulli. “There's nothing more dangerous than deploying systems you don't know the provenance of.”

Next, verify the model’s behavior. “Engineering teams must also closely monitor the accuracy of their AI results and educate the users of the AI to validate any results it provides,” says Harrison. When self-hosting generative AI deployments, he suggests using your own secure inbound data when possible.

Like any other open source software, vet for compliance with security practices to avoid software supply chain risks. “There are a lot of open AI models today that aren’t following these practices, and consumers should be cautious,” says Lieberman. Select AI models that disclose provenance and provide ways to reproduce the model training, over the more opaque models.

Lastly, evaluate how collaborative the project is. “You are always betting on the community more than the technology,” he says. Is it actually a community-driven project? What does the governance framework look like? These answers could affect the long-term stability of the technology and whether or not you want to include it in your project.

Use open source AI, but govern it carefully

A July 2024 US National Telecommunications and Information Agency report concludes that open source AI poses a myriad of risks. “Making the weights of certain foundation models widely available could also engender harms and risks to national security, equity, safety, privacy, or civil rights through affirmative misuse, failures of effective oversight, or lack of clear accountability mechanisms,” says the report. That said, the US government has not restricted the use of open source AI, taking a “cautious yet optimistic” approach. This is a good rule of thumb for developers to follow.

Yet, most technology leaders will agree there is a pressing need for better governance of generative AI, including oversight of open source AI projects. “I absolutely believe we need more governance for AI,” says Harrison.

However, the answer shouldn't be to make unfunded demands on open source communities. “Instead, we need to give open source maintainers the knowledge and tools required to incorporate security practices into their projects,” says Lieberman. He points to tools like OpenSSF’s GUAC project to help developers understand their supply chains. LLMs themselves could even aid this type of inspection (see the Guac-AI-mole project, for example).

Seeking: clarity for open source AI

Concerns around the provenance of data, licensing stipulations, and supply chain issues aren’t going away for AI developers. For now, trust but verify, and stay abreast of industry efforts to define and govern open source models. “We need a lot more clarity,” says Shaposhnik.

Be careful with ‘open source’ AI

Posted in:

Written by:

Share:

Defining open source AI

Is it risky?

Potential benefits of transparent AI

Tips for evaluating open source AI

Use open source AI, but govern it carefully

Seeking: clarity for open source AI

Related content

How to choose the right generative AI vendor

How to get started with GenAI

The engineering leader's guide to becoming enterprise ready

The best feature management and experimentation software 2024

The cost of the great developer tool consolidation

Inside the tech overhaul of Big Brother’s voting system

Why Elixir is one of the hottest programming languages to learn

The case against building your own SSO and SCIM

The CrowdStrike disaster is a lesson about testing

Why Zig is one of the hottest programming languages to learn

How to build an effective technical strategy

Why OpenFeature is central to modern feature management

Understanding feature flags

What is retrieval-augmented generation (RAG) and are you ready for it?

How to standardize codebases across teams

WebAssembly is still waiting for its moment

Minimum viable architecture is the backbone of a successful product

A buyer’s checklist for AI coding assistants

5 mistakes to avoid when picking an AI coding assistant

The best AI coding assistants 2024

How to argue with the AI coding assistant skeptics

Partner Content: The Engineering Leader’s Guide to Goals and Reporting

PostgreSQL: The database that quietly ate the world

AI models can’t understand code. Does that matter?

6 questions to ask when buying a software developer metrics tool

How to combat generative AI security risks

How Zalando uses its own Tech Radar to make better technology choices

4 things you need to know from the latest Thoughtworks Tech Radar

9 women in AI you need to know about

AI and Kubernetes are pushing cloud costs out of control

How to write better AI prompts

A buyer’s checklist for software developer analytics tools

How to plan for and mitigate different types of tech debt

5 mistakes to avoid when choosing a software developer analytics tool

The best software development analytics tools 2024

Who holds the edge in the JavaScript framework wars?

11 generative AI programming tools for developers

Researchers say generative AI isn't replacing devs any time soon

Mastering tough technical decisions

Unlocking productivity with developer platforms

12 things to consider when assessing open source software

Choose a contextualized AI coding assistant

What developers need to know about generative AI in 2024

Leading open-source teams in large organizations

Whatever happened to Big Data?

A journey to tackle legacy code in online travel

6 steps to addressing legacy enterprise code

Learning to live with legacy code

How test coverage can improve code quality

What you need to know about Biden’s AI executive order

How OpenAI fought off security threats and GPU shortages to scale ChatGPT

Balancing build vs buy decisions in a post-boom world

3 strategies for maximizing your cloud savings

Building a cloud architecture that can scale to any challenge

Architecting for profit: A blueprint for modern cloud economics

How are engineering orgs achieving reliability in 2023?

Tech debt for engineering leaders: How a shortcut today impacts tomorrow

What AI has to offer: Using LLM tools in interviews

Tech debt traps to avoid

The 6 biggest generative AI risks for developers

7 generative AI productivity hacks for developers

SRE for engineering managers

Can platform engineering help you do more with less?

When to migrate from a monolithic to a distributed frontend architecture

The essential tools for software engineering managers

Let's mitigate bias in tech

Kubernetes for engineering managers

Solving the mean time to repair problem

The relationship between observability, OpenTelemetry, and UX