Releases · PlateerLab/document-adapter

Highlights

Building on v0.2.0's merge-aware HWPX editing, this release extends parity to DOCX/PPTX and adds two label-friendly APIs that drastically improve the LLM form-filling experience.

New APIs

get_cell(table_index, row, col) — returns CellContent with the full (untruncated) text, per-paragraph list, is_anchor, anchor, span, and nested_table_indices. Complements inspect_document's 40-char preview.
append_to_cell(table_index, row, col, value, separator=" ") — appends value to the existing cell text so Korean form labels like "성 명" can be filled to "성 명 홍길동" without manually re-typing the label. Huge UX win for 관공서 양식.

Format parity

DOCX: merge detection via _tc identity dedup + position bounding box. preview[r][c] is None for non-anchor slots, set_cell rejects non-anchor writes, nested tables traversed with flat DFS and parent_path.
PPTX: uses cell.is_merge_origin / cell.is_spanned / cell.span_height / cell.span_width. Merge origin is reverse-scanned when writing to a spanned cell.
HWPX: append_row now implemented via last-row deepcopy with cellAddr.rowAddr / rowCnt updated. Cross-row-merge tables refuse to prevent corruption.

Custom exceptions

All subclass their stdlib counterpart so existing except ValueError / except IndexError keeps working:

MergedCellWriteError(ValueError)
CellOutOfBoundsError(IndexError)
TableIndexError(IndexError)
NotImplementedForFormat(NotImplementedError)

MCP / Claude API tools

Two new tools added to TOOL_DEFINITIONS:

get_cell — for when the preview is truncated or you need structural detail
append_to_cell — for label-preserving fills

Verification

Tested against real Korean government forms including the 지급정지요청서 (28×16, 57 merges). Filling 12 labeled fields via append_to_cell keeps every label intact and preserves all 57 merge anchors.

Tests

35 passing (12 new regression tests covering get_cell, append_to_cell, DOCX merge detection, PPTX merge detection, HWPX append_row).

Install

pip install -U document-adapter==0.3.0

PyPI: https://pypi.org/project/document-adapter/0.3.0/

Migration notes (from 0.2.0)

DocumentAdapter now has two additional abstract methods (get_cell, append_to_cell). Custom subclasses must implement them.
No other breaking changes.

Highlights

Korean HWPX documents often rely heavily on merged cells (관공서 양식, 보고서 표, 포상금 지급신청서 등). Before this release, the adapter silently flattened merged cells in get_tables(), misleading LLMs into writing to the wrong positions and overwriting merged headers.

Changes

Merge structure exposed: TableSchema.preview reports None for non-anchor slots of a merge; TableSchema.merges lists every merged region as {anchor, span}.
set_cell guard: writing to a non-anchor merge coordinate raises ValueError with a clear message pointing to the anchor. Opt in to legacy auto-redirect via allow_merge_redirect=True (emits a warning).
Nested table support: cells containing tables are traversed with flat DFS indexing; each table exposes parent_path (e.g. .tables[127].cell(0,0)).
Cell text isolation: outer cell previews no longer include nested table text (fixed a descendant-walk leak in python-hwpx's paragraph.text).
render_template optimization: each merged anchor is visited once instead of being re-touched per logical slot.

Verification

Tested against a 865KB real government HWPX with:

188 tables
44 tables with merges
Most complex: a 22×43 grid with 70 merges
Full 27×17 포상금 지급신청서 form round-tripped through set_cell across 24 fields with all 58 merge regions preserved.

23 automated tests pass, including 9 new HWPX-specific regression tests.

Migration notes

TableSchema.preview type changed from list[list[str]] to list[list[str | None]]. If you dereference preview slots, handle None.
Set allow_merge_redirect=True on set_cell to restore pre-0.2 behavior of writing merged-slot coordinates to the anchor.

Install

pip install -U document-adapter==0.2.0

PyPI: https://pypi.org/project/document-adapter/0.2.0/

Fix

Empty cells still lost formatting in v0.1.1.

The previous fix only covered cells that already had runs. Empty cells still fell back to cell.text = value, which silently dropped font information stored in <a:endParaRPr> (PPTX) and <w:pPr><w:rPr> (DOCX). Real-world templates stash font/size/bold there so unfilled cells still render with the template style — v0.1.1 was writing into them but losing that style in the process.

PPTX empty-cell path

When the text frame has no runs, manually build an <a:r> inside the existing <a:p>, clone <a:endParaRPr> into the new run's <a:rPr>, and write the value into <a:t>. Preserves lang, sz, b, latin typeface, etc.

<!-- before set_cell -->
<a:p>
  <a:endParaRPr lang="en-US" sz="1800" b="1">
    <a:latin typeface="Microsoft Sans Serif"/>
  </a:endParaRPr>
</a:p>

<!-- after v0.1.2 set_cell(..., "V-2024-001") -->
<a:p>
  <a:endParaRPr lang="en-US" sz="1800" b="1">
    <a:latin typeface="Microsoft Sans Serif"/>
  </a:endParaRPr>
  <a:r>
    <a:rPr lang="en-US" sz="1800" b="1">
      <a:latin typeface="Microsoft Sans Serif"/>
    </a:rPr>
    <a:t>V-2024-001</a:t>
  </a:r>
</a:p>

DOCX empty-cell path

Use paragraphs[0].add_run(value) instead of cell.text, then clone <w:pPr><w:rPr> into the new run's <w:rPr>. Preserves rFonts, sz, bold that templates stash on the paragraph marker.

Tests

test_pptx_empty_cell_preserves_endpararpr — crafts a cell with only <a:endParaRPr> (sz=1800, b=1, Microsoft Sans Serif), calls set_cell, asserts the new run carries the cloned rPr
test_docx_empty_cell_preserves_ppr_rpr — crafts a cell with only <w:pPr><w:rPr> (Malgun Gothic, sz=36, bold), calls set_cell, asserts the new run carries the cloned rPr

All 13 smoke tests pass. Reverting either fix to the old cell.text fallback makes the new tests fail with new run lost its rPr — confirmed they actually catch the regression.

Upgrade

pip install --upgrade document-adapter  # >= 0.1.2

If you were keeping a downstream monkey-patch for the empty-cell case, you can now remove it.

Fix

#1 — set_cell and append_row dropped run-level formatting in DOCX and PPTX.

`python-docx` and `python-pptx`'s `cell.text = value` setter deletes every run and creates a fresh one with default font/size, so any template formatting (font name, size, bold, color) was lost whenever the adapter modified a cell. This release mirrors `HwpxAdapter`'s run-preserving strategy: reuse the first existing run and blank the rest, falling back to the default setter only when the cell is truly empty.

Also fixes a subtle bug where `python-pptx` returns a fresh Python wrapper on every `paragraphs` access, causing `para is first_para` comparisons to fail and the "clear other paragraphs" loop to erase the run we just populated. The fix compares paragraphs by index instead.

Tests

`test_docx_set_cell_preserves_font` — verifies Malgun Gothic / 18pt / bold survive set_cell
`test_pptx_set_cell_preserves_font` — same for PPTX
`test_docx_append_row_preserves_formatting` — verifies new rows inherit the template row formatting

Total: 11 smoke tests passing.

Install

```bash
pip install --upgrade document-adapter
```

Upgrade guidance

If you were running a `cell.text`/monkey-patch workaround downstream (e.g. xgen-workflow), you can remove it after upgrading.

Closes #1.

Releases: PlateerLab/document-adapter

v0.3.0 — get_cell, append_to_cell, DOCX/PPTX merge parity

Highlights

New APIs

Format parity

Custom exceptions

MCP / Claude API tools

Verification

Tests

Install

Migration notes (from 0.2.0)

Uh oh!

v0.2.0 — Merge-aware HWPX editing

Highlights

Changes

Verification

Migration notes

Install

Uh oh!

v0.1.2 — preserve empty-cell formatting

Fix

PPTX empty-cell path

DOCX empty-cell path

Tests

Upgrade

Uh oh!

v0.1.1 — preserve run formatting in set_cell/append_row

Fix

Tests

Install

Upgrade guidance

Uh oh!