

Advice

Review Strategies for Multilingual Data in Technology-Assisted Review
- Document Review Services
- 3 mins
The growing digital landscape presents increasing complexity for legal teams managing multilingual data during investigations and litigation. The expanding diversity of languages can impact the accuracy and efficiency of document review, particularly when training models for technology-assisted review (TAR) workflows. To navigate multilingual review effectively, case teams need a clear understanding of the tools and strategies available.
Simple Active Learning (SAL) continues to be the foundational, and often required, approach for many document reviews. Unlike generative AI, which relies on large language models (LLMs) for subject matter expertise, a SAL approach relies on a Subject Matter Expert (SME) to train a binary classification model. Training TAR models becomes more complex when multilingual data is involved, as the SMEs must both understand the legal issues at hand and be fluent in the languages present in the dataset. If SMEs lack fluency in one or more languages, case teams must contemplate additional steps to ensure proper handling of the data.
Early Detection of Multilingual Data Drives Defensible TAR Review
Identifying whether custodians communicate in multiple languages is a critical step to help mitigate downstream risks, avoid costly rework, and ensure that review workflows are both accurate and defensible. Data collections often hold surprises, making early language identification essential for informed and effective review design. Custodian interviews help identify data sources and communication methods, but they often leave gaps. By proactively assessing the prevalence and scope of multilingual data within the review universe, a case team may lead an informed discussion about model training and review planning to align expectations with the requesting party, establish and control costs, and reduce risk.
Options for Multilingual Review Within TAR Workflows
When using a standard TAR workflow, there are two primary options for handling non-English content: a multilingual model or a language-specific model. If the review involves few languages with low data volume in each, a multilingual model may be a practical solution, preserving native languages within a single TAR model or translation workflow. Alternatively, if only certain languages are of interest and their volume is significant, it may be practical to consider a language-specific model.
Multilingual TAR Models: Key Considerations for Smarter Reviews
To assess whether a multilingual model approach is appropriate, case teams should consider SME availability and language fluency, data composition, and document volume to inform a cost-benefit analysis.
-
Fluent SME Review: This option involves fluent SMEs reviewing documents in their respective languages. While it is the least controversial, single-model option, sourcing a fluent SME with knowledge about the issues can be challenging. Similarly, reviewing a sufficient number of training documents in a specific language may require more total work for the SMEs.
-
Translation for SME Review: This option involves translating documents to English for SME review only, leaving the text in the model in the native language(s) and relying on the tool's language agnosticism. This option has the least expensive upfront translation cost; however, if the volume of non-English content is low, it could be difficult to surface enough features for model training. Additionally, it is possible that some context could be lost in translation.
-
Translation for Model Training: This option involves translating documents into English and adding the translated text to the model for training. Using translated text in a TAR model can significantly lower the training burden by standardizing all documents to a single language, which simplifies the feature extraction process. This approach also facilitates a more consistent review by enabling SMEs to evaluate documents in a familiar language, thereby enhancing the overall efficiency and accuracy of the review process. However, it has the highest upfront translation cost and similar concerns regarding the potential for missing context.
Language-Specific Model Considerations and Options
When a multilingual model is insufficient, case teams may consider a language-specific solution. These models can offer improved accuracy by better capturing linguistic nuances, especially in languages with complex grammar or syntax. They also help reduce the risk of misinterpretation and loss of context during translation. However, there are drawbacks. Sourcing SMEs is often costly and time-intensive, and coding inconsistencies may arise if these SMEs lack familiarity with the issues. The training and validation burden is also multiplied for each model, increasing SME workload and straining resources with the management of distinct sets of metrics. Therefore, while a language-specific model can enhance the overall review process, the marginal gains in accuracy may not justify the operational burden, and case teams often find that it is more efficient and economical to choose a model that can handle multiple languages with reasonable accuracy, thereby reducing costs and simplifying the review process.
Final Insights for Multilingual TAR in eDiscovery
Choosing the best approach for multilingual review is often an act of balancing the composition and complexity of the data (i.e., the scope and volume of the languages) with resources and time constraints. By identifying the language scope, planning model training and review, discussing the handling and production of non-English text, and choosing the appropriate approach early, organizations can navigate the complexities of multilingual data to mitigate downstream risks and avoid costly rework. There is no universal solution for multilingual data. Case teams should engage with a consultant to tailor an approach aligned with their specific data, priorities, and objectives.
While traditional TAR offers a structured approach to multilingual data, emerging tools are reshaping what’s possible. In future posts, we’ll explore how generative AI is transforming multilingual review.
Learn more about Epiq Document Review Services.
Desiree Marek, Analytics Consultant, Antitrust, Epiq
Desiree Marek is an Analytics Consultant for Advanced Technologies at Epiq, specializing in government investigations. She partners with outside counsel and review teams to consult on best practices and options for technology-assisted review (TAR).
Desiree has over two decades of eDiscovery experience. Prior to joining Epiq in 2020, she worked for two major eDiscovery vendors and in-house for a global law firm. Desiree holds bachelor's and master's degrees from the University of Montana and a wide range of eDiscovery certifications.
Desiree grew up in the mountains of Montana and currently resides in Northwest Washington.
The contents of this article are intended to convey general information only and not to provide legal advice or opinions.