Skip to content

zooman33/tc-qa-checker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

TC QA Checker - Tracked Changes Verification Tool

A browser-based tool that extracts tracked changes from a source DOCX, aligns them against a target DOCX via XLZ bridge files, and injects threaded Word comments wherever the linguist failed to implement a source change. Built for amendment-heavy clinical trial protocols.

The problem

When a clinical trial protocol gets amended, the client sends the source document with tracked changes (insertions, deletions, formatting changes). The linguist must implement every single tracked change in the target translation. Missing even one TC on a patient-facing document is a compliance issue.

The manual QA process: open the source, scroll to a tracked change, figure out what changed, open the target, find the corresponding segment, check if the change was implemented. Repeat for every TC. On amendment-heavy protocols (like MK-3475 Pembrolizumab studies), there can be hundreds of TCs in a single document. A manual pass takes hours and human reviewers start missing things after the first 50-60 changes.

What TC QA Checker does

Drop in the source DOCX (with tracked changes), the target DOCX, and optionally an XLZ bridge file for segment alignment. The tool:

  1. Extracts every tracked change from the source DOCX by parsing the underlying XML (w:ins, w:del, w:rPr change markers)
  2. Aligns source segments to target segments using the XLZ bridge file's segment IDs, or falls back to positional matching
  3. Checks each TC against the target to verify the corresponding change was made
  4. Injects Word comments directly into the target DOCX at the exact location where a TC was missed or partially implemented

The output is a copy of the target DOCX with threaded comments like:

"TC MISSED: Source deleted 'patients' and inserted 'participants' in this segment. Target still reads 'patients'."

Architecture

Source DOCX (with TCs)     XLZ Bridge File (optional)
        |                          |
        v                          v
 +--------------+          +--------------+
 | TC Extraction |          | Segment ID   |
 | (XML parsing) |          | Mapping      |
 +--------------+          +--------------+
        |                          |
        +----------+  +------------+
                   |  |
                   v  v
            +--------------+
            | Alignment &  |
            | TC Matching  |
            +--------------+
                   |
                   v              Target DOCX
            +--------------+          |
            | Comment       |<---------
            | Injection     |
            +--------------+
                   |
                   v
            Annotated Target DOCX
            (with threaded comments)

Key design decisions

  • XML-level TC extraction. python-docx doesn't expose tracked changes, so the tool unzips the DOCX and parses word/document.xml directly. This catches insertions, deletions, formatting changes, and moved content.
  • XLZ bridge alignment. In localization workflows, the XLZ file contains paired source/target segments with IDs. Using these IDs for alignment is far more reliable than trying to positionally match paragraphs between source and target.
  • Comment injection, not a separate report. QA findings go directly into the target DOCX as Word comments. The linguist opens one file, sees the comments in context, and fixes them. No need to cross-reference a separate spreadsheet.
  • Threaded comments. When a TC is partially implemented (e.g., deletion was done but insertion was missed), the comment explains exactly what's missing, not just "check this segment."

Tech stack

  • HTML/CSS/JavaScript (browser-based)
  • JSZip (DOCX/XLZ unpacking)
  • XML DOM parsing (TC extraction and comment injection)
  • OOXML manipulation (injecting w:comment elements into the DOCX package)

What it replaced

Multi-hour manual cross-checking on amendment-heavy protocols. A sub-minute automated pass now catches TC omissions that were routinely missed on documents with 100+ tracked changes. The tool was specifically built after a Merck ES-XL TC omission incident that required a formal root-cause analysis.

Note on code

Built for internal use at Lionbridge. Source code is not published due to client-specific logic. This README documents the architecture and approach.

About

Tracked Changes QA — extracts TCs from source DOCX, verifies linguist implementation, injects Word comments. Live demo: zooman33.github.io/tc-qa-checker

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages