学习周报三十六
摘要本周主要围绕论文《ThinkDiff》的复现工作展开。深入理解了该论文提出的新对齐范式即将视觉语言模型VLM的推理能力迁移至扩散模型通过训练阶段使用VLM与LLM解码器对齐、推理阶段替换为扩散解码器的方式实现符合推理逻辑的图像生成。本周完成了服务器环境配置及必要数据集的下载但尚未成功运行完整流程。此外配置并初步使用了Claude Code工具辅助代码学习与分析为后续复现工作提供了便利。AbstractThis week’s work primarily focused on the reproduction of the paper “ThinkDiff”. A deep understanding was gained of its proposed novel alignment paradigm, which transfers the reasoning capabilities of a Vision-Language Model (VLM) to a diffusion model. This is achieved by aligning the VLM with an LLM decoder during training and replacing it with a diffusion decoder during inference to generate images that align with the reasoning logic. The server environment was configured and necessary datasets were downloaded this week, though the full pipeline has not yet been successfully executed. Additionally, the Claude Code tool was set up and preliminarily used to assist in code learning and analysis, facilitating subsequent reproduction efforts.1、论文复现1.1 论文思想ThinkDiff 提出一种新的对齐范式将视觉-语言模型VLM的推理能力迁移到扩散模型中而不需要复杂的推理数据集或昂贵的训练。在训练时使用 VLM 处理图像和文本输出 token 特征再通过一个轻量级的 aligner network 映射到 LLM 解码器的输入空间。LLM 解码器根据这些特征生成文本与真实文本计算交叉熵损失。在推理时将 LLM 解码器替换为扩散解码器从而生成符合推理逻辑的图像。1.2 复现本周在服务器上进行环境的配置但是还没完全配置运行起来主要是数据集等东西的下载下周争取能够运行起来看看效果这周在vscode上配置了Claude Code对于代码的学习等都蛮有帮助。总结本周在论文复现方面取得了阶段性进展。
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2413864.html
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!