Releases: PlateerLab/document-adapter
v0.3.0 — get_cell, append_to_cell, DOCX/PPTX merge parity
Highlights
Building on v0.2.0's merge-aware HWPX editing, this release extends parity to DOCX/PPTX and adds two label-friendly APIs that drastically improve the LLM form-filling experience.
New APIs
get_cell(table_index, row, col)— returnsCellContentwith the full (untruncated) text, per-paragraph list,is_anchor,anchor,span, andnested_table_indices. Complementsinspect_document's 40-char preview.append_to_cell(table_index, row, col, value, separator=" ")— appends value to the existing cell text so Korean form labels like"성 명"can be filled to"성 명 홍길동"without manually re-typing the label. Huge UX win for 관공서 양식.
Format parity
- DOCX: merge detection via
_tcidentity dedup + position bounding box.preview[r][c]isNonefor non-anchor slots,set_cellrejects non-anchor writes, nested tables traversed with flat DFS andparent_path. - PPTX: uses
cell.is_merge_origin/cell.is_spanned/cell.span_height/cell.span_width. Merge origin is reverse-scanned when writing to a spanned cell. - HWPX:
append_rownow implemented via last-rowdeepcopywithcellAddr.rowAddr/rowCntupdated. Cross-row-merge tables refuse to prevent corruption.
Custom exceptions
All subclass their stdlib counterpart so existing except ValueError / except IndexError keeps working:
MergedCellWriteError(ValueError)CellOutOfBoundsError(IndexError)TableIndexError(IndexError)NotImplementedForFormat(NotImplementedError)
MCP / Claude API tools
Two new tools added to TOOL_DEFINITIONS:
get_cell— for when the preview is truncated or you need structural detailappend_to_cell— for label-preserving fills
Verification
Tested against real Korean government forms including the 지급정지요청서 (28×16, 57 merges). Filling 12 labeled fields via append_to_cell keeps every label intact and preserves all 57 merge anchors.
Tests
35 passing (12 new regression tests covering get_cell, append_to_cell, DOCX merge detection, PPTX merge detection, HWPX append_row).
Install
pip install -U document-adapter==0.3.0PyPI: https://pypi.org/project/document-adapter/0.3.0/
Migration notes (from 0.2.0)
DocumentAdapternow has two additional abstract methods (get_cell,append_to_cell). Custom subclasses must implement them.- No other breaking changes.
v0.2.0 — Merge-aware HWPX editing
Highlights
Korean HWPX documents often rely heavily on merged cells (관공서 양식, 보고서 표, 포상금 지급신청서 등). Before this release, the adapter silently flattened merged cells in get_tables(), misleading LLMs into writing to the wrong positions and overwriting merged headers.
Changes
- Merge structure exposed:
TableSchema.previewreportsNonefor non-anchor slots of a merge;TableSchema.mergeslists every merged region as{anchor, span}. - set_cell guard: writing to a non-anchor merge coordinate raises
ValueErrorwith a clear message pointing to the anchor. Opt in to legacy auto-redirect viaallow_merge_redirect=True(emits a warning). - Nested table support: cells containing tables are traversed with flat DFS indexing; each table exposes
parent_path(e.g..tables[127].cell(0,0)). - Cell text isolation: outer cell previews no longer include nested table text (fixed a descendant-walk leak in python-hwpx's
paragraph.text). - render_template optimization: each merged anchor is visited once instead of being re-touched per logical slot.
Verification
Tested against a 865KB real government HWPX with:
- 188 tables
- 44 tables with merges
- Most complex: a 22×43 grid with 70 merges
- Full 27×17
포상금 지급신청서form round-tripped throughset_cellacross 24 fields with all 58 merge regions preserved.
23 automated tests pass, including 9 new HWPX-specific regression tests.
Migration notes
TableSchema.previewtype changed fromlist[list[str]]tolist[list[str | None]]. If you dereference preview slots, handleNone.- Set
allow_merge_redirect=Trueonset_cellto restore pre-0.2 behavior of writing merged-slot coordinates to the anchor.
Install
pip install -U document-adapter==0.2.0v0.1.2 — preserve empty-cell formatting
Fix
Empty cells still lost formatting in v0.1.1.
The previous fix only covered cells that already had runs. Empty cells still fell back to cell.text = value, which silently dropped font information stored in <a:endParaRPr> (PPTX) and <w:pPr><w:rPr> (DOCX). Real-world templates stash font/size/bold there so unfilled cells still render with the template style — v0.1.1 was writing into them but losing that style in the process.
PPTX empty-cell path
When the text frame has no runs, manually build an <a:r> inside the existing <a:p>, clone <a:endParaRPr> into the new run's <a:rPr>, and write the value into <a:t>. Preserves lang, sz, b, latin typeface, etc.
<!-- before set_cell -->
<a:p>
<a:endParaRPr lang="en-US" sz="1800" b="1">
<a:latin typeface="Microsoft Sans Serif"/>
</a:endParaRPr>
</a:p>
<!-- after v0.1.2 set_cell(..., "V-2024-001") -->
<a:p>
<a:endParaRPr lang="en-US" sz="1800" b="1">
<a:latin typeface="Microsoft Sans Serif"/>
</a:endParaRPr>
<a:r>
<a:rPr lang="en-US" sz="1800" b="1">
<a:latin typeface="Microsoft Sans Serif"/>
</a:rPr>
<a:t>V-2024-001</a:t>
</a:r>
</a:p>DOCX empty-cell path
Use paragraphs[0].add_run(value) instead of cell.text, then clone <w:pPr><w:rPr> into the new run's <w:rPr>. Preserves rFonts, sz, bold that templates stash on the paragraph marker.
Tests
test_pptx_empty_cell_preserves_endpararpr— crafts a cell with only<a:endParaRPr>(sz=1800, b=1, Microsoft Sans Serif), callsset_cell, asserts the new run carries the clonedrPrtest_docx_empty_cell_preserves_ppr_rpr— crafts a cell with only<w:pPr><w:rPr>(Malgun Gothic, sz=36, bold), callsset_cell, asserts the new run carries the clonedrPr
All 13 smoke tests pass. Reverting either fix to the old cell.text fallback makes the new tests fail with new run lost its rPr — confirmed they actually catch the regression.
Upgrade
pip install --upgrade document-adapter # >= 0.1.2If you were keeping a downstream monkey-patch for the empty-cell case, you can now remove it.
v0.1.1 — preserve run formatting in set_cell/append_row
Fix
#1 — set_cell and append_row dropped run-level formatting in DOCX and PPTX.
`python-docx` and `python-pptx`'s `cell.text = value` setter deletes every run and creates a fresh one with default font/size, so any template formatting (font name, size, bold, color) was lost whenever the adapter modified a cell. This release mirrors `HwpxAdapter`'s run-preserving strategy: reuse the first existing run and blank the rest, falling back to the default setter only when the cell is truly empty.
Also fixes a subtle bug where `python-pptx` returns a fresh Python wrapper on every `paragraphs` access, causing `para is first_para` comparisons to fail and the "clear other paragraphs" loop to erase the run we just populated. The fix compares paragraphs by index instead.
Tests
- `test_docx_set_cell_preserves_font` — verifies Malgun Gothic / 18pt / bold survive set_cell
- `test_pptx_set_cell_preserves_font` — same for PPTX
- `test_docx_append_row_preserves_formatting` — verifies new rows inherit the template row formatting
Total: 11 smoke tests passing.
Install
```bash
pip install --upgrade document-adapter
```
Upgrade guidance
If you were running a `cell.text`/monkey-patch workaround downstream (e.g. xgen-workflow), you can remove it after upgrading.
Closes #1.