Large Language Models (LLMs) don’t have a good reputation when it comes to data security and data privacy. You may have heard of private data being leaked in ChatGPT, i.e., data from customer A showing up in chat responses at customer B. This rightfully worries a lot of business users. How can we make sure that we use LLM technology in a safe way, compliant with European law?
The good news is that LLMs are not inherently insecure. A large model will not automatically start leaking private data by design. Such risks are all caused by the way a model services provider handles and operates the model. The two most important data security and data privacy risks in an LLM context are:
The model “leaks” private or confidential data from one customer context to another. This happens, for example, when the shared base model or instruction-tuned model is automatically trained with input data coming from customers. A prompt by customer B could trigger a model completion that cites verbatim data of customer A in this case. Other leaks could be caused by bad technical architecture, such as the use of a generic cache that doesn’t respect customer context.
These risks are usually related to the way a model provider operates the infrastructure and how it stores the data the customers send to the model. If the provider simply takes the customer input and stores everything indefinitely (e.g., to use for model training), there’s a substantial risk that the provider is storing personal data for a long time. This might lead to GPDR issues if we need to remove personal data from all systems and subcontractor systems.
Note that in this blog article, we only look at the data and privacy risks. Other risks like the hallucinations, harmful content, toxicity etc. that are related to the “three H´s” (helpful, harmless, honest) are not in the scope of this post.
Let’s start with the data leak risk. As we explained in our previous blog post Large Language Models and Machine Translation: the era of hyper-personalization, fine-tuning the LLM with your own data is a great idea: it will increase the quality of the LLM output significantly. You just have to be sure that you are the only one who can use these customizations, meaning that we should not customize a shared base or instructed model with confidential or private customer data!
As the name indicates, Large Language Models are really big and it would be prohibitively expensive to customise a complete LLM for each of our customers. But there are other, more efficient techniques.
At LanguageWire, we use PEFT with LoRA and in-context learning and using these techniques, we can keep a customer-neutral base model and load the customizations at run-time or in the prompt. In this way, we keep customer data separated on the physical and logical level. There is no chance that the data of customer A could be leaked to customer B: depending on who is making the request, the model only has access to the data relevant to the customer.
This is illustrated in Figure 1.
The blue parts are the software components and data that are shared between various customer requests. The parts in other colours are customer-specific. This clearly shows the strong separation of customer data:
With this setup, we can guarantee our customers that we will never expose data from one customer to another. Problem solved!
To make sure we don’t run into compliance risks, the LanguageWire engineering team follows a clear set of guidelines related to LLMs and other AI models:
We enforce these guidelines by being very careful about our operations and our infrastructure. LanguageWire uses two approaches to run LLMs.
The first way is that the LLM is fully operated by LW System Reliability Engineers and Product Engineers. This means that we provision and configure all of the necessary infrastructure resources like GPUs, networking, model storage, containers, etc. ourselves. There are no third-party players involved, giving us full control over all operational aspects. This makes it easy to comply with our guidelines so we don’t run into legal compliance risks. This approach works well for cases where we use open-source base LLMs like Llama 2 or Falcon with 7 to 10 billion parameters. In the future, we may also support running bigger models.
That also brings us to the second approach. For some complex use cases, we might need base models that are not open source or that are too big to run in a self-provisioned infrastructure. In this case, we will look for a managed model service that complies with our guidelines. LanguageWire is very strict in the model provider selection. We want to make sure the provider supports our stringent security and compliance requirements. A good example is the managed LLM service of Google for PaLM 2. PaLM 2 is a big and complex model that performs well in multilingual use cases, so it’s very interesting for our industry. Google guarantees EU infrastructure locality, zero data storage, full encryption, and offers customisation options with PEFT/LoRA (all in early preview)! A good match and safe to use for our customers.
We can conclude that it’s possible to operate LLMs in a secure way that also respects compliance regulations. It is definitely not trivial to do this, but the LW engineering team has implemented the infrastructure, policies, and software to enable this. Moreover, the LLM operations are fully integrated with the other parts of our secure tech ecosystem. This means LW can guarantee our customers the highest level of security and data privacy in the industry for the full end-to-end process, and also when LLMs are used in the delivery chain.
Other Articles of Interest
Unlock the full potential of Large Language Models for machine translation. Learn more about the new era of language transformation.
How does using machine translation help improve your translation project workflows?