This document describes how Docxify works and how it turns an html document into a docx document.
- parse HTML using
htmlparser2 - transform to docx hierarchy
- "Serialize" tags with their own "serializer".
- Pack up the docx from the serialized data
Even though docx uses xml, we cannot directly translate an html tag to a corresponding xml tag for docx. Docx also needs a strict hierarchy, which is outlined in the following list
- section
- block (paragraph/table)
- inline
This means that if that if there is a <p> inside of a <p>, the inner <p> must be "up" while still maintaining state (font size, text color, etc.).
This needs to be done such that all paragraphs are on the same level.
Inline elements might also need to be at the same level, with some exceptions. For example an anchor can still have inline children.
