使用 Python 自动化 Word 文档样式复制与内容生成

在办公自动化领域，如何高效地处理 Word 文档的样式和内容复制是一个常见需求。本文将通过一个完整的代码示例，展示如何利用 Python 的 python-docx 库实现 Word 文档样式的深度复制 和 动态内容生成，并结合知识库中的最佳实践优化文档处理流程。

一、为什么需要自动化 Word 文档处理？

手动处理 Word 文档（如复制样式、插入表格/图片）不仅耗时且容易出错。Python 提供了多种库（如 python-docx、pywin32、Spire.Doc）来自动化这些任务。例如，python-docx 可以直接操作 .docx 文件的段落、表格和样式，而无需依赖 Microsoft Office 软件。

二、核心功能实现：样式与表格的深度复制

1. 表格复制（含样式与内容）

以下函数 clone_table 实现了表格的 结构、样式和内容 的完整复制：

def clone_table(old_table, new_doc):
    """根据旧表格创建新表格"""
    # 创建新表格（行列数与原表一致）
    new_table = new_doc.add_table(rows=len(old_table.rows), cols=len(old_table.columns))
    
    # 复制表格样式（如边框、背景色）
    if old_table.style:
        new_table.style = old_table.style

    # 遍历单元格内容与样式
    for i, old_row in enumerate(old_table.rows):
        for j, old_cell in enumerate(old_row.cells):
            new_cell = new_table.cell(i, j)
            # 清空新单元格默认段落
            for paragraph in new_cell.paragraphs:
                new_cell._element.remove(paragraph._element)
            # 复制段落与样式
            for old_paragraph in old_cell.paragraphs:
                new_paragraph = new_cell.add_paragraph()
                for old_run in old_paragraph.runs:
                    new_run = new_paragraph.add_run(old_run.text)
                    copy_paragraph_style(old_run, new_run)  # 自定义样式复制函数
                new_paragraph.alignment = old_paragraph.alignment
            copy_cell_borders(old_cell, new_cell)  # 复制单元格边框

    # 复制列宽
    for i, col in enumerate(old_table.columns):
        if col.width is not None:
            new_table.columns[i].width = col.width

    return new_table

关键点解析：

表格样式保留：通过 new_table.style = old_table.style 直接继承原表格的样式。
单元格内容与格式分离处理：先清空新单元格的默认段落，再逐行复制文本和样式。
边框与列宽：通过 copy_cell_borders 和列宽设置确保视觉一致性。

2. 文档整体样式复制与内容生成

以下函数 clone_document 实现了从模板文档提取样式，并动态填充内容：

def clone_document(old_s, old_p, old_ws, new_doc_path):
    new_doc = Document()  # 创建新文档

    # 动态填充内容
    for para in old_p:
        k, v = para["sn"], para["ct"]  # 假设 old_p 包含样式名（sn）和内容（ct）

        if "image" in v:
            # 插入图片（需实现 copy_inline_shapes 函数）
            copy_inline_shapes(new_doc, k, [i for i in old_s if v in i][0][v])
        elif "table" == k:
            # 插入表格（需实现 html_table_to_docx 函数）
            html_table_to_docx(new_doc, v)
        else:
            # 段落处理
            style = [i for i in old_s if i["style"]["sn"] == k]
            style_ws = [i for i in old_ws if i["style"]["sn"] == k]
            clone_paragraph(style[0], v, new_doc, style_ws[0])  # 克隆段落样式

    new_doc.save(new_doc_path)  # 保存新文档

数据结构说明：

old_s：模板文档的样式定义（如字体、段落对齐方式）。
old_p：内容数据（含样式标签与实际内容）。
old_ws：工作表上下文（如表格所在位置）。

三、完整流程演示

1. 依赖准备

首先安装 python-docx：

pip install python-docx

2. 辅助函数实现

以下函数需额外实现（代码未展示完整）：

copy_paragraph_style：复制段落样式（如字体、颜色）。
copy_cell_borders：复制单元格边框样式。
get_para_style：从模板文档提取样式。
html_table_to_docx：将 HTML 表格转换为 Word 表格。

3. 主程序调用

if __name__ == "__main__":
    # 从模板提取样式与工作表
    body_ws, _ = get_para_style('demo_template.docx')
    body_s, body_p = get_para_style("1.docx")
    
    # 从 JSON 文件加载内容
    with open("1.json", "r", encoding="utf-8") as f:
        body_p = json.loads(f.read())
    
    # 生成新文档
    clone_document(body_s, body_p, body_ws, 'cloned_example.docx')

四、实际应用场景

报告自动生成
结合模板样式，动态填充数据库数据生成标准化报告。
批量文档处理
将多个 Excel 表格批量转换为 Word 文档（参考知识库中的 pywin32 与 python-docx 联合使用）。
博客内容迁移
将 Word 文档保存为 HTML 后，按知识库中的步骤导入 ZBlog 或 WordPress（见知识库 [2] 和 [5]）。

五、常见问题与优化建议

1. 样式丢失问题

原因：Word 文档的样式可能依赖隐式继承。
解决方案：使用 python-docx 的 style 属性显式设置样式，或参考知识库 [7] 使用 Spire.Doc 进行更复杂的样式处理。

2. 图片与表格嵌入异常

原因：路径错误或资源未正确加载。
解决方案：确保图片路径绝对化，或使用 docx.shared.Inches 显式指定尺寸。

3. 性能优化

大文档处理：避免频繁调用 add_paragraph，改用批量操作。
内存管理：及时释放 Document 对象（如 doc = None）。

六、总结

通过本文的代码示例和解析，您已掌握如何使用 Python 实现 Word 文档的 样式深度复制 和 动态内容生成。结合知识库中的其他技术（如 ZBlog 导入、Office 自动化），可进一步扩展至完整的文档工作流自动化。

希望这篇博客能帮助您高效实现文档自动化！如需进一步优化或功能扩展，欢迎留言讨论。



from docx.enum.text import WD_BREAK

from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.oxml import OxmlElement
from bs4 import BeautifulSoup

from docx.oxml.ns import qn


def docx_table_to_html(word_table):
    soup = BeautifulSoup(features='html.parser')
    html_table = soup.new_tag('table', style="border-collapse: collapse;")

    # 记录哪些单元格已经被合并
    merged_cells = [[False for _ in range(len(word_table.columns))] for _ in range(len(word_table.rows))]

    for row_idx, row in enumerate(word_table.rows):
        html_tr = soup.new_tag('tr')

        col_idx = 0
        while col_idx < len(row.cells):
            cell = row.cells[col_idx]

            # 如果该单元格已经被合并（被前面的 colspan 或 rowspan 占用），跳过
            if merged_cells[row_idx][col_idx]:
                col_idx += 1
                continue

            # 跳过纵向合并中被“continue”的单元格
            v_merge = cell._element.tcPr and cell._element.tcPr.find(qn('w:vMerge'))
            if v_merge is not None and v_merge.get(qn('w:val')) == 'continue':
                col_idx += 1
                continue

            td = soup.new_tag('td')

            # 设置文本内容
            td.string = cell.text.strip()

            # 初始化样式字符串
            td_style = ''

            # 获取单元格样式
            if cell._element.tcPr:
                tc_pr = cell._element.tcPr

                # 处理背景颜色
                shd = tc_pr.find(qn('w:shd'))
                if shd is not None:
                    bg_color = shd.get(qn('w:fill'))
                    if bg_color:
                        td_style += f'background-color:#{bg_color};'

                # 处理对齐方式
                jc = tc_pr.find(qn('w:jc'))
                if jc is not None:
                    align = jc.get(qn('w:val'))
                    if align == 'center':
                        td_style += 'text-align:center;'
                    elif align == 'right':
                        td_style += 'text-align:right;'
                    else:
                        td_style += 'text-align:left;'

                # 处理边框
                borders = tc_pr.find(qn('w:tcBorders'))
                if borders is not None:
                    for border_type in ['top', 'left', 'bottom', 'right']:
                        border = borders.find(qn(f'w:{border_type}'))
                        if border is not None:
                            color = border.get(qn('w:color'), '000000')
                            size = int(border.get(qn('w:sz'), '4'))  # 半点单位，1pt = 2sz
                            style = border.get(qn('w:val'), 'single')
                            td_style += f'border-{border_type}:{size // 2}px {style} #{color};'

                # 处理横向合并（colspan）
                grid_span = tc_pr.find(qn('w:gridSpan'))
                if grid_span is not None:
                    colspan = int(grid_span.get(qn('w:val'), '1'))
                    if colspan > 1:
                        td['colspan'] = colspan
                        # 标记后面被合并的单元格
                        for c in range(col_idx + 1, col_idx + colspan):
                            if c < len(row.cells):
                                merged_cells[row_idx][c] = True

                # 处理纵向合并（rowspan）
                v_merge = tc_pr.find(qn('w:vMerge'))
                if v_merge is not None and v_merge.get(qn('w:val')) != 'continue':
                    rowspan = 1
                    next_row_idx = row_idx + 1
                    while next_row_idx < len(word_table.rows):
                        next_cell = word_table.rows[next_row_idx].cells[col_idx]
                        next_v_merge = next_cell._element.tcPr and next_cell._element.tcPr.find(qn('w:vMerge'))
                        if next_v_merge is not None and next_v_merge.get(qn('w:val')) == 'continue':
                            rowspan += 1
                            next_row_idx += 1
                        else:
                            break
                    if rowspan > 1:
                        td['rowspan'] = rowspan
                        # 标记后面被合并的行
                        for r in range(row_idx + 1, row_idx + rowspan):
                            if r < len(word_table.rows):
                                merged_cells[r][col_idx] = True

            # 设置样式和默认边距
            td['style'] = td_style + "padding: 5px;"
            html_tr.append(td)

            # 更新列索引
            if 'colspan' in td.attrs:
                col_idx += int(td['colspan'])
            else:
                col_idx += 1

        html_table.append(html_tr)

    soup.append(html_table)
    return str(soup)


def set_cell_background(cell, color_hex):
    """设置单元格背景色"""
    color_hex = color_hex.lstrip('#')
    shading_elm = OxmlElement('w:shd')
    shading_elm.set(qn('w:fill'), color_hex)
    cell._tc.get_or_add_tcPr().append(shading_elm)


def html_table_to_docx(doc, html_content):
    """
    将 HTML 中的表格转换为 Word 文档中的表格
    :param html_content: HTML 字符串
    :param doc: python-docx Document 实例
    """
    soup = BeautifulSoup(html_content, 'html.parser')
    tables = soup.find_all('table')

    for html_table in tables:
        # 获取表格行数
        trs = html_table.find_all('tr')
        rows = len(trs)

        # 估算最大列数（考虑 colspan）
        cols = 0
        for tr in trs:
            col_count = 0
            for cell in tr.find_all(['td', 'th']):
                col_count += int(cell.get('colspan', 1))
            cols = max(cols, col_count)

        # 创建 Word 表格
        table = doc.add_table(rows=rows, cols=cols)
        table.style = 'Table Grid'

        # 记录已处理的单元格（用于处理合并）
        used_cells = [[False for _ in range(cols)] for _ in range(rows)]

        for row_idx, tr in enumerate(trs):
            cells = tr.find_all(['td', 'th'])
            col_idx = 0

            for cell in cells:
                while col_idx < cols and used_cells[row_idx][col_idx]:
                    col_idx += 1

                if col_idx >= cols:
                    break  # 避免越界

                # 获取 colspan 和 rowspan
                colspan = int(cell.get('colspan', 1))
                rowspan = int(cell.get('rowspan', 1))

                # 获取文本内容
                text = cell.get_text(strip=True)

                # 获取对齐方式
                align = cell.get('align')
                align_map = {
                    'left': WD_ALIGN_PARAGRAPH.LEFT,
                    'center': WD_ALIGN_PARAGRAPH.CENTER,
                    'right': WD_ALIGN_PARAGRAPH.RIGHT
                }

                # 获取背景颜色
                style = cell.get('style', '')
                bg_color = None
                for s in style.split(';'):
                    if 'background-color' in s or 'background' in s:
                        bg_color = s.split(':')[1].strip()
                        break

                # 获取 Word 单元格
                word_cell = table.cell(row_idx, col_idx)

                # 合并单元格
                if colspan > 1 or rowspan > 1:
                    end_row = min(row_idx + rowspan - 1, rows - 1)
                    end_col = min(col_idx + colspan - 1, cols - 1)
                    merged_cell = table.cell(row_idx, col_idx).merge(table.cell(end_row, end_col))
                    word_cell = merged_cell

                # 设置文本内容
                para = word_cell.paragraphs[0]
                para.text = text

                # 设置对齐方式
                if align in align_map:
                    para.alignment = align_map[align]

                # 设置背景颜色
                if bg_color:
                    try:
                        set_cell_background(word_cell, bg_color)
                    except:
                        pass  # 忽略无效颜色格式

                # 标记已使用的单元格
                for r in range(row_idx, min(row_idx + rowspan, rows)):
                    for c in range(col_idx, min(col_idx + colspan, cols)):
                        used_cells[r][c] = True

                # 移动到下一个可用列
                col_idx += colspan

        # 添加空段落分隔
        doc.add_paragraph()

    return doc


def copy_inline_shapes(old_paragraph):
    """复制段落中的所有内嵌形状（通常是图片）"""
    images = []
    for shape in old_paragraph._element.xpath('.//w:drawing'):
        blip = shape.find('.//a:blip', namespaces={'a': 'http://schemas.openxmlformats.org/drawingml/2006/main'})
        if blip is not None:
            rId = blip.attrib['{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed']
            image_part = old_paragraph.part.related_parts[rId]
            image_bytes = image_part.image.blob
            image_name=image_part.filename+";"+image_part.partname
            images.append([image_bytes,image_name, image_part.image.width, image_part.image.height])
    return images


def is_page_break(element):
    """判断元素是否为分页符（段落或表格后）"""
    if element.tag.endswith('p'):
        for child in element:
            if child.tag.endswith('br') and child.get(qn('type')) == 'page':
                return True
    elif element.tag.endswith('tbl'):
        # 表格后可能有分页符（通过下一个元素判断）
        if element.getnext() is not None:
            next_element = element.getnext()
            if next_element.tag.endswith('p'):
                for child in next_element:
                    if child.tag.endswith('br') and child.get(qn('type')) == 'page':
                        return True
    return False


def clone_paragraph(old_para):
    """根据旧段落创建新段落"""
    style = {"run_style": []}
    if old_para.style:
        # 这里保存style  主要通过字体识别   是 几级标题
        style_name_to_style_obj = {"sn":old_para.style.name + "_" + str(old_para.alignment).split()[0], "ct": old_para.style}
        style["style"] = style_name_to_style_obj
    paras = []
    for old_run in old_para.runs:
        text_to_style_name = {"ct":old_run.text, "sn":old_para.style.name + "_" + str(old_para.alignment).split()[0]}
        style["run_style"].append(old_run)
        paras.append(text_to_style_name)

    style_name_to_alignment = {"sn":old_para.style.name + "_" + str(old_para.alignment).split()[0],"ct":old_para.alignment}
    style["alignment"] = style_name_to_alignment

    images = copy_inline_shapes(old_para)
    if len(images):
        for  image_bytes,image_name, image_width, image_height in images:
            style[image_name.split(";")[-1]] = images
            paras.append({"sn":image_name.split(";")[0],"ct":image_name.split(";")[-1]})
    return style, paras


def clone_document(old_doc_path):
    try:
        old_doc = Document(old_doc_path)
        new_doc = Document()
        # 复制主体内容
        elements = old_doc.element.body
        para_index = 0
        table_index = 0
        index = 0

        body_style = []
        body_paras = []

        while index < len(elements):
            element = elements[index]
            if element.tag.endswith('p'):
                old_para = old_doc.paragraphs[para_index]
                style, paras = clone_paragraph(old_para)
                body_style.append(style)
                body_paras += paras
                para_index += 1
                index += 1
            elif element.tag.endswith('tbl'):
                old_table = old_doc.tables[table_index]
                body_paras += [{"sn":"table","ct":docx_table_to_html(old_table)}]
                table_index += 1
                index += 1
            elif element.tag.endswith('br') and element.get(qn('type')) == 'page':
                if index > 0:
                    body_paras.append("br")
                    new_doc.add_paragraph().add_run().add_break(WD_BREAK.PAGE)
                index += 1
            else:
                index += 1

            # 检查分页符
            if index < len(elements) and is_page_break(elements[index]):
                if index > 0:
                    new_doc.add_paragraph().add_run().add_break(WD_BREAK.PAGE)
                    body_paras.append("br")
                index += 1

        else:
            return body_style, body_paras
    except Exception as e:
        print(f"复制文档时发生错误：{e}")


# 使用示例
if __name__ == "__main__":
    # 示例HTML表格
    body_s, body_p = clone_document('1.docx')
    print()

import json

from docx import Document
from docx.oxml import OxmlElement
from docx.oxml.shared import qn
from wan_neng_copy_word import clone_document as get_para_style,html_table_to_docx
import io
# 剩余部分保持不变...

def copy_inline_shapes(new_doc,image_name, img):
    """复制段落中的所有内嵌形状（通常是图片）"""
    new_para = new_doc.add_paragraph()
    for image_bytes_src,_, w, h in img:
        try:
            with open(image_name, 'rb') as f:
                image_bytes = f.read()
        except:
            image_bytes = image_bytes_src
        # 添加图片到新段落
        new_para.add_run().add_picture(io.BytesIO(image_bytes), width=w, height=h)  # 设置宽度为1.25英寸或其他合适的值


def copy_paragraph_style(run_from, run_to):
    """复制 run 的样式"""
    run_to.bold = run_from.bold
    run_to.italic = run_from.italic
    run_to.underline = run_from.underline
    run_to.font.size = run_from.font.size
    run_to.font.color.rgb = run_from.font.color.rgb
    run_to.font.name = run_from.font.name
    run_to.font.all_caps = run_from.font.all_caps
    run_to.font.strike = run_from.font.strike
    run_to.font.shadow = run_from.font.shadow


def is_page_break(element):
    """判断元素是否为分页符（段落或表格后）"""
    if element.tag.endswith('p'):
        for child in element:
            if child.tag.endswith('br') and child.get(qn('type')) == 'page':
                return True
    elif element.tag.endswith('tbl'):
        # 表格后可能有分页符（通过下一个元素判断）
        if element.getnext() is not None:
            next_element = element.getnext()
            if next_element.tag.endswith('p'):
                for child in next_element:
                    if child.tag.endswith('br') and child.get(qn('type')) == 'page':
                        return True
    return False


def clone_paragraph(para_style, text, new_doc, para_style_ws):
    """根据旧段落创建新段落"""
    new_para = new_doc.add_paragraph()
    para_style_ws = para_style_ws["style"]["ct"]
    para_style_data = para_style["style"]["ct"]
    para_style_ws.font.size = para_style_data.font.size

    new_para.style = para_style_ws

    new_run = new_para.add_run(text)
    copy_paragraph_style(para_style["run_style"][0], new_run)
    new_para.alignment = para_style["alignment"]["ct"]

    return new_para


def copy_cell_borders(old_cell, new_cell):
    """复制单元格的边框样式"""
    old_tc = old_cell._tc
    new_tc = new_cell._tc

    old_borders = old_tc.xpath('.//w:tcBorders')
    if old_borders:
        old_border = old_borders[0]
        new_border = OxmlElement('w:tcBorders')

        border_types = ['top', 'left', 'bottom', 'right', 'insideH', 'insideV']
        for border_type in border_types:
            old_element = old_border.find(f'.//w:{border_type}', namespaces={
                'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'
            })
            if old_element is not None:
                new_element = OxmlElement(f'w:{border_type}')
                for attr, value in old_element.attrib.items():
                    new_element.set(attr, value)
                new_border.append(new_element)

        tc_pr = new_tc.get_or_add_tcPr()
        tc_pr.append(new_border)


def clone_table(old_table, new_doc):
    """根据旧表格创建新表格"""
    new_table = new_doc.add_table(rows=len(old_table.rows), cols=len(old_table.columns))
    if old_table.style:
        new_table.style = old_table.style

    for i, old_row in enumerate(old_table.rows):
        for j, old_cell in enumerate(old_row.cells):
            new_cell = new_table.cell(i, j)
            for paragraph in new_cell.paragraphs:
                new_cell._element.remove(paragraph._element)
            for old_paragraph in old_cell.paragraphs:
                new_paragraph = new_cell.add_paragraph()
                for old_run in old_paragraph.runs:
                    new_run = new_paragraph.add_run(old_run.text)
                    copy_paragraph_style(old_run, new_run)
                new_paragraph.alignment = old_paragraph.alignment
            copy_cell_borders(old_cell, new_cell)

    for i, col in enumerate(old_table.columns):
        if col.width is not None:
            new_table.columns[i].width = col.width

    return new_table


def clone_document(old_s, old_p, old_ws, new_doc_path):
    new_doc = Document()

    # 复制主体内容
    for para in old_p:
        k, v =para["sn"],para["ct"]

        if "image" in v:
            copy_inline_shapes(new_doc,k, [i for i in old_s if v in i ][0][v])
        elif "table" == k:
            html_table_to_docx(new_doc,v)
        else:
            style = [i for i in old_s if i["style"]["sn"]==k ]
            style_ws = [i for i in old_ws if i["style"]["sn"]==k ]
            clone_paragraph(style[0], v, new_doc, style_ws[0])

    new_doc.save(new_doc_path)


# 使用示例
if __name__ == "__main__":
    body_ws, _ = get_para_style('demo_template.docx')
    body_s, body_p = get_para_style("1.docx")
    # 将body_p 或者是压缩后的内容 给llm 如果希望llm 只是参考模版样式，可以压缩如果需要内容或者修改不可压缩
    # 而后得到json  1.json 进行word生成
    with open("1.json", "r", encoding="utf-8") as f:
        body_p=json.loads(f.read())
    print("获取样式完成",body_p)
    clone_document(body_s, body_p, body_ws, 'cloned_example.docx')

from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH

# 创建一个新的Word文档
doc = Document()
for align in [WD_ALIGN_PARAGRAPH.LEFT, WD_ALIGN_PARAGRAPH.RIGHT, WD_ALIGN_PARAGRAPH.CENTER, None]:
    for blod_flag in [True, False]:

        # 获取所有可用的段落样式名（只保留段落样式）
        paragraph_styles = [
            style for style in doc.styles if style.type == 1  # type == 1 表示段落样式
        ]

        # 输出样式数量
        print(f"共找到 {len(paragraph_styles)} 种段落样式：")
        for style in paragraph_styles:
            print(f"- {style.name}")

        # 在文档中添加每个样式对应的段落
        for style in paragraph_styles:
            heading = doc.add_paragraph()
            run = heading.add_run(f"样式名称: {style.name}")
            run.bold = blod_flag
            para = doc.add_paragraph(f"这是一个应用了 '{style.name}' 样式的段落示例。", style=style)
            para.alignment = align
            # 添加分隔线（可选）
            doc.add_paragraph("-" * 40)

# 保存为 demo_template.docx
doc.save("demo_template.docx")
print("\n✅ 已生成包含所有段落样式的模板文件：demo_template.docx")