Large Language Models: Security & Data Privacy

Roeland Hofkens , Chief Product & Technology Officer, LanguageWire

Large Language Models (LLMs) don’t have a good reputation when it comes to data security and data privacy. You may have heard of private data being leaked in ChatGPT, i.e., data from customer A showing up in chat responses at customer B. This rightfully worries a lot of business users. How can we make sure that we use LLM technology in a safe way, compliant with European law?

Data privacy and data security risks with LLMs

The good news is that LLMs are not inherently insecure. A large model will not automatically start leaking private data by design. Such risks are all caused by the way a model services provider handles and operates the model. The two most important data security and data privacy risks in an LLM context are:

Data leaks

The model “leaks” private or confidential data from one customer context to another. This happens, for example, when the shared base model or instruction-tuned model is automatically trained with input data coming from customers. A prompt by customer B could trigger a model completion that cites verbatim data of customer A in this case. Other leaks could be caused by bad technical architecture, such as the use of a generic cache that doesn’t respect customer context.

Compliance risks

These risks are usually related to the way a model provider operates the infrastructure and how it stores the data the customers send to the model. If the provider simply takes the customer input and stores everything indefinitely (e.g., to use for model training), there’s a substantial risk that the provider is storing personal data for a long time. This might lead to GPDR issues if we need to remove personal data from all systems and subcontractor systems.

Note that in this blog article, we only look at the data and privacy risks. Other risks like the hallucinations, harmful content, toxicity etc. that are related to the “three H´s” (helpful, harmless, honest) are not in the scope of this post.

The LW approach to avoid data leaks

Let’s start with the data leak risk. As we explained in our previous blog post Large Language Models and Machine Translation: the era of hyper-personalization, fine-tuning the LLM with your own data is a great idea: it will increase the quality of the LLM output significantly. You just have to be sure that you are the only one who can use these customizations, meaning that we should not customize a shared base or instructed model with confidential or private customer data!

As the name indicates, Large Language Models are really big and it would be prohibitively expensive to customise a complete LLM for each of our customers. But there are other, more efficient techniques.

At LanguageWire, we use PEFT with LoRA and in-context learning and using these techniques, we can keep a customer-neutral base model and load the customizations at run-time or in the prompt. In this way, we keep customer data separated on the physical and logical level. There is no chance that the data of customer A could be leaked to customer B: depending on who is making the request, the model only has access to the data relevant to the customer.

This is illustrated in Figure 1.

Diagram of how data is not shared in an LLM

The blue parts are the software components and data that are shared between various customer requests. The parts in other colours are customer-specific. This clearly shows the strong separation of customer data:

The shared parts are customer-data agnostic.
The customer-specific data are only active and accessible to the customer who owns the data.

With this setup, we can guarantee our customers that we will never expose data from one customer to another. Problem solved!

How does LW handle compliance risks with LLMs?

To make sure we don’t run into compliance risks, the LanguageWire engineering team follows a clear set of guidelines related to LLMs and other AI models:

The infrastructure that provides the run-time environment for the LLM is located in data centres in the EU.
Customer data never leaves the EU.
Customer data is never stored during model inference.

We enforce these guidelines by being very careful about our operations and our infrastructure. LanguageWire uses two approaches to run LLMs.

The first way is that the LLM is fully operated by LW System Reliability Engineers and Product Engineers. This means that we provision and configure all of the necessary infrastructure resources like GPUs, networking, model storage, containers, etc. ourselves. There are no third-party players involved, giving us full control over all operational aspects. This makes it easy to comply with our guidelines so we don’t run into legal compliance risks. This approach works well for cases where we use open-source base LLMs like Llama 2 or Falcon with 7 to 10 billion parameters. In the future, we may also support running bigger models.

That also brings us to the second approach. For some complex use cases, we might need base models that are not open source or that are too big to run in a self-provisioned infrastructure. In this case, we will look for a managed model service that complies with our guidelines. LanguageWire is very strict in the model provider selection. We want to make sure the provider supports our stringent security and compliance requirements. A good example is the managed LLM service of Google for PaLM 2. PaLM 2 is a big and complex model that performs well in multilingual use cases, so it’s very interesting for our industry. Google guarantees EU infrastructure locality, zero data storage, full encryption, and offers customisation options with PEFT/LoRA (all in early preview)! A good match and safe to use for our customers.

Connecting LLMS to the LW ecosystem

We can conclude that it’s possible to operate LLMs in a secure way that also respects compliance regulations. It is definitely not trivial to do this, but the LW engineering team has implemented the infrastructure, policies, and software to enable this. Moreover, the LLM operations are fully integrated with the other parts of our secure tech ecosystem. This means LW can guarantee our customers the highest level of security and data privacy in the industry for the full end-to-end process, and also when LLMs are used in the delivery chain.

How can we help you?

Your journey to a powerful, seamless language management experience starts here! Tell us about your needs and we will tailor the perfect solution to your enterprise.

Founded in 2000, LanguageWire offers a language management ecosystem enabling enterprises to engage and communicate with any audience across the globe through bespoke AI technology and human expertise. Pursuing its vision of making global communications smarter and more efficient, LanguageWire adapts solutions to customer needs, automates workflows, and delivers multilingual communication services within a secure infrastructure.

LanguageWire Ltd | Rivington House, 82 Great Eastern Street | EC2A 3JF | London, United Kingdom
Tel: +44 20 3630 0216 | E-mail: contact@languagewire.com

Terms & Conditions Privacy Policy Email preferences Cookies © 2024 LanguageWire

Data privacy and data security risks with LLMs

Data leaks

Compliance risks

The LW approach to avoid data leaks

How does LW handle compliance risks with LLMs?

Connecting LLMS to the LW ecosystem

Other Articles of Interest

LLMs & Machine Translation

Machine Translation

CAB Retrospective: LLMs

How can we help you?

Company

Ecosystem

Resources

Contact