Live demo → zooman33.github.io/rcc-v16 — interactive walkthrough with synthetic clinical trial data. No API key required.
A browser-based tool that compares new translations against client-provided reference files, flagging deviations from approved phrasing. Built for localization teams working on regulated content where reference reuse is mandatory.
About the demo: The live demo above simulates the RCC workflow using pre-computed matching results on fabricated data. It is not the real tool — it exists so recruiters and collaborators can see what RCC does without needing access to the proprietary codebase. See Note on code below.
In clinical trial localization, clients provide previously approved translations as reference files. Linguists are expected to reuse approved phrasing wherever the source text matches. Checking this manually means opening two files side by side and scanning segment by segment. On a 40-page protocol with 200+ reusable segments, that's hours of tedious cross-referencing per language pair, and things get missed.
RCC takes a new translation file (XLZ/XLIFF or DOCX) and one or more reference files, then uses an LLM to identify segments where the source text matches (or nearly matches) the reference source. For each match, it compares the new target against the approved target and flags any deviations.
Source file (XLZ/XLIFF/DOCX)
|
v
+--------------+
| Segment | Reference files (XLZ/XLIFF/DOCX)
| Extraction | |
+--------------+ v
| +--------------+
| | Segment |
| | Extraction |
| +--------------+
v |
+--------------+ |
| Source-to- |<-----------
| Source Match |
| (LLM layer) |
+--------------+
|
v
+--------------+
| Target-to- |
| Target Compare|
| (LLM layer) |
+--------------+
|
v
+--------------+
| Deviation |
| Report |
+--------------+
- Pure LLM matching, no LCS heuristics. Earlier versions (v1-v12) used longest-common-subsequence and fuzzy string matching. These broke on paraphrased sources and produced too many false positives. v13+ switched to LLM-based semantic matching, which handles paraphrase, reordering, and partial matches much better.
- Placeholder filtering. Clinical trial documents are full of placeholders (
[Study Drug Name],<Protocol Number>) that differ between source and reference. RCC strips these before comparison so they don't trigger false deviations. - Multiple LLM provider support. Runs against Anthropic Claude by default, but supports swapping providers. The matching prompts are tuned for clinical/regulatory content.
- Browser-based, runs locally. No server, no installation. Open the HTML file, paste your API key, drop your files. Deployed across the team via OneDrive/SharePoint.
- HTML/CSS/JavaScript (single-file browser app)
- Anthropic Claude API (for semantic matching)
- XLZ/XLIFF parsing (JavaScript, client-side)
- DOCX extraction (JSZip + XML parsing)
Manual cross-referencing that took 2-4 hours per file per language pair, with an error rate that went up sharply after the first 100 segments (reviewer fatigue). RCC runs the same check in minutes and catches deviations that humans reliably miss on long documents.
| Version | Change |
|---|---|
| v1-v12 | LCS/fuzzy string matching, high false positive rate |
| v13 | Switched to LLM-based matching |
| v14 | Added placeholder filtering |
| v15 | Multi-reference file support |
| v16 | Multiple LLM provider support, improved prompt tuning |
This tool was built for internal use at Lionbridge on the Merck/MSD account. The source code contains client-specific logic and is not published here. This README documents the architecture, design decisions, and problem it solves.
If you're building something similar for your own localization team, I'm happy to talk through the approach. Reach out at [email protected].