Skip to content

zooman33/rcc-v16

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

RCC v16 - Reference Compliance Checker

Live demo → zooman33.github.io/rcc-v16 — interactive walkthrough with synthetic clinical trial data. No API key required.

A browser-based tool that compares new translations against client-provided reference files, flagging deviations from approved phrasing. Built for localization teams working on regulated content where reference reuse is mandatory.

About the demo: The live demo above simulates the RCC workflow using pre-computed matching results on fabricated data. It is not the real tool — it exists so recruiters and collaborators can see what RCC does without needing access to the proprietary codebase. See Note on code below.

The problem

In clinical trial localization, clients provide previously approved translations as reference files. Linguists are expected to reuse approved phrasing wherever the source text matches. Checking this manually means opening two files side by side and scanning segment by segment. On a 40-page protocol with 200+ reusable segments, that's hours of tedious cross-referencing per language pair, and things get missed.

What RCC does

RCC takes a new translation file (XLZ/XLIFF or DOCX) and one or more reference files, then uses an LLM to identify segments where the source text matches (or nearly matches) the reference source. For each match, it compares the new target against the approved target and flags any deviations.

How it works

Source file (XLZ/XLIFF/DOCX)
        |
        v
 +--------------+
 | Segment       |     Reference files (XLZ/XLIFF/DOCX)
 | Extraction    |            |
 +--------------+            v
        |            +--------------+
        |            | Segment       |
        |            | Extraction    |
        |            +--------------+
        v                    |
 +--------------+            |
 | Source-to-    |<-----------
 | Source Match  |
 | (LLM layer)  |
 +--------------+
        |
        v
 +--------------+
 | Target-to-    |
 | Target Compare|
 | (LLM layer)  |
 +--------------+
        |
        v
 +--------------+
 | Deviation     |
 | Report        |
 +--------------+

Key design decisions

  • Pure LLM matching, no LCS heuristics. Earlier versions (v1-v12) used longest-common-subsequence and fuzzy string matching. These broke on paraphrased sources and produced too many false positives. v13+ switched to LLM-based semantic matching, which handles paraphrase, reordering, and partial matches much better.
  • Placeholder filtering. Clinical trial documents are full of placeholders ([Study Drug Name], <Protocol Number>) that differ between source and reference. RCC strips these before comparison so they don't trigger false deviations.
  • Multiple LLM provider support. Runs against Anthropic Claude by default, but supports swapping providers. The matching prompts are tuned for clinical/regulatory content.
  • Browser-based, runs locally. No server, no installation. Open the HTML file, paste your API key, drop your files. Deployed across the team via OneDrive/SharePoint.

Tech stack

  • HTML/CSS/JavaScript (single-file browser app)
  • Anthropic Claude API (for semantic matching)
  • XLZ/XLIFF parsing (JavaScript, client-side)
  • DOCX extraction (JSZip + XML parsing)

What it replaced

Manual cross-referencing that took 2-4 hours per file per language pair, with an error rate that went up sharply after the first 100 segments (reviewer fatigue). RCC runs the same check in minutes and catches deviations that humans reliably miss on long documents.

Versions

Version Change
v1-v12 LCS/fuzzy string matching, high false positive rate
v13 Switched to LLM-based matching
v14 Added placeholder filtering
v15 Multi-reference file support
v16 Multiple LLM provider support, improved prompt tuning

Note on code

This tool was built for internal use at Lionbridge on the Merck/MSD account. The source code contains client-specific logic and is not published here. This README documents the architecture, design decisions, and problem it solves.

If you're building something similar for your own localization team, I'm happy to talk through the approach. Reach out at [email protected].

About

Reference Compliance Checker — compares translations against approved references, flags deviations. Live demo: zooman33.github.io/rcc-v16

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages