Round-tripping DOCX without losing your sanity (or your tables)
A field guide to the import-export pipeline, including the parts we wish someone had told us first.
DOCX is not a file format — it's a zip archive of XML nightmares held together by twenty years of Microsoft compatibility promises. We learned that the hard way so you don't have to.
Import: PDF and DOCX → structured editor state
Documatch ingests through a Python pipeline that maps Office Open XML and PDF layout into Lexical nodes: headings, lists, tables, footnotes, and inline styles become first-class structure — not a flattened text blob.
- We surface a retention score on import so you know what fidelity you're working with.
- Nested lists and multi-column tables are the usual trouble spots; we flag them in the ingest report.
- Scanned PDFs route through OCR with lower confidence markers on uncertain blocks.
Edit in the middle
The editor stores canonical state as Lexical JSON. AI edits propose hunks against that structure, so a table cell change stays a table cell change — not a new paragraph that used to be row 4, column 2.
Export: back to DOCX
Export walks the Lexical tree and emits OOXML via our export package. Styles map back to Word-native equivalents where possible; custom firm styles are preserved when they were present in the source file.
What still surprises people
- Floating text boxes from ancient templates may flatten — we document this in release notes.
- Complex field codes (SEQ, REF) round-trip on simple engagements; heavily cross-referenced briefs may need a quick Word refresh.
- Fonts fall back to Calibri if the original wasn't embedded and isn't on our server.
The goal isn't pixel-perfect Word recreation — it's partner-ready deliverables without a reformatting weekend. If export breaks your table of authorities, tell us; those edge cases are how the pipeline gets better.