2023-02-16:干活小计

news2025/12/16 23:48:33

数学公式表示学习：

大约耗时：2 hours

在做了一些工作后重读论文：MathBERT: A Pre-Trained Model for Mathematical Formula Understanding

这是本篇论文最重要的idea：Current pre-trained models neglect the structural features and the semantic correspondence between formula and its context.（其中很fancy的一点是注重每个数学公式的strctural features，即关注数学公式的结构）

用三个下游任务验证，并且效果很好：

mathematical information retrieval

formula topic classifification

formula headline generation

三个 预训练任务：

Masked Language Modeling (MLM) ：text representations

模仿BERT的MLM，其中三个字段即公式latex、context、OPT的信息可以互补。

Context Corre spondence Prediction (CCP)： latentrelationshipbetweenformula and context

模仿BERT的NSP，二分类任务。

Masked Substructure Prediction (MSP)： semantic-levelstructureofformula

预训练任务数据集：

We build a large dataset containing more than 8.7 million formula-context pairs which are extracted from scientifific articles published on arXiv.org 1 and train Math BERT on it.

Arxiv bulk data available from Amazon S3 2 is the complete set of arxiv documents which contains source TEX fifiles and processed PDF fifiles. “ \ begin { equation } . . . \ end { equation } ” is used as the matching pattern to extract single-line display formulas from L A TEX source in these TEX files.

toolkit L A TEX tokenizer in im2markup to tokenize separately formulas

OPT translator in TangentS 4 to convert L ATEX codes into OPTs

模型的backbone：

An enhanced multi-layer bidirectional Transformer [Vaswani et al. , 2017] is built as the backbone of MathBERT, which is modifified from vanilla BERT.

MathBERT的输入： we concatenate the for mula LA TEX tokens, context and operators together as the input of MathBERT.

attention 机制的细节： the attention mechanism in Trans former is modifified based on the structure of OPT to enhance its ability of capturing structural information

具体的细节看原文，这里上个图