nlf loss 学习笔记

news2025/5/19 2:16:38

数据集：

3d 投影到2d 继续求loss

reconstruct_absolute

1. 功能概述

2. 参数详解

3. 两种重建模式对比

数据集：

agora3 | 5264/5264 [00:00<00:00, 143146.78it/s]

behave 37736/37736 [00:00<00:00, 76669.67it/s]

mads 32649/32649 [00:00<00:00, 32823.97it/s]

coco 38592/38592 [00:00<00:00, 150037.89it/s]

densepose_coco.pkl 29586/29586 [00:00<00:00, 143833.04it/s]

3d 投影到2d 继续求loss

compute_loss_with_3d_gt

   def compute_loss_with_3d_gt(self, inps, preds):
        losses = EasyDict()

        if inps.point_validity is None:
            inps.point_validity = tf.ones_like(preds.coords3d_abs[..., 0], dtype=tf.bool)

        diff = inps.coords3d_true - preds.coords3d_abs

        # CENTER-RELATIVE 3D LOSS
        # We now compute a "center-relative" error, which is either root-relative
        # (if there is a root joint present), or mean-relative (i.e. the mean is subtracted).
        meanrel_diff = tfu3d.center_relative_pose(
            diff, joint_validity_mask=inps.point_validity, center_is_mean=True)
        # root_index is a [batch_size] int tensor that holds which one is the root
        # diff is [batch_size, joint_cound, 3]
        # we now need to select the root joint from each batch element
        if inps.root_index.shape.ndims == 0:
            inps.root_index = tf.fill(tf.shape(diff)[:1], inps.root_index)

        # diff has shape N,P,3 for batch, point, coord
        # and root_index has shape N
        # and we want to select the root joint from each batch element
        sanitized_root_index = tf.where(
            inps.root_index == -1, tf.zeros_like(inps.root_index), inps.root_index)
        root_diff = tf.expand_dims(tf.gather_nd(diff, tf.stack(
            [tf.range(tf.shape(diff)[0]), sanitized_root_index], axis=1)), axis=1)

        rootrel_diff = diff - root_diff
        # Some elements of the batch do not have a root joint, which is marked as -1 as root_index.
        center_relative_diff = tf.where(
            inps.root_index[:, tf.newaxis, tf.newaxis] == -1, meanrel_diff, rootrel_diff)

        losses.loss3d = tfu.reduce_mean_masked(
            self.my_norm(center_relative_diff, preds.uncert), inps.point_validity)

        # ABSOLUTE 3D LOSS (camera-space)
        absdiff = tf.abs(diff)

        # Since the depth error will naturally scale linearly with distance, we scale the z-error
        # down to the level that we would get if the person was 5 m away.
        scale_factor_for_far = tf.minimum(
            np.float32(1), 5 / tf.abs(inps.coords3d_true[..., 2:]))
        absdiff_scaled = tf.concat(
            [absdiff[..., :2], absdiff[..., 2:] * scale_factor_for_far], axis=-1)

        # There are numerical difficulties for points too close to the camera, so we only
        # apply the absolute loss for points at least 30 cm away from the camera.
        is_far_enough = inps.coords3d_true[..., 2] > 0.3
        is_valid_and_far_enough = tf.logical_and(inps.point_validity, is_far_enough)

        # To make things simpler, we estimate one uncertainty and automatically
        # apply a factor of 4 to get the uncertainty for the absolute prediction
        # this is just an approximation, but it works well enough.
        # The uncertainty does not need to be perfect, it merely serves as a
        # self-gating mechanism, and the actual value of it is less important
        # compared to the relative values between different points.
        losses.loss3d_abs = tfu.reduce_mean_masked(
            self.my_norm(absdiff_scaled, preds.uncert * 4.),
            is_valid_and_far_enough)

        # 2D PROJECTION LOSS (pixel-space)
        # We also compute a loss in pixel space to encourage good image-alignment in the model.
        coords2d_pred = tfu3d.project_pose(preds.coords3d_abs, inps.intrinsics)
        coords2d_true = tfu3d.project_pose(inps.coords3d_true, inps.intrinsics)

        # Balance factor which considers the 2D image size equivalent to the 3D box size of the
        # volumetric heatmap. This is just a factor to get a rough ballpark.
        # It could be tuned further.
        scale_2d = 1 / FLAGS.proc_side * FLAGS.box_size_m

        # We only use the 2D loss for points that are in front of the camera and aren't
        # very far out of the field of view. It's not a problem that the point is outside
        # to a certain extent, because this will provide training signal to move points which
        # are outside the image, toward the image border. Therefore those point predictions
        # will gather up near the border and we can mask them out when doing the absolute
        # reconstruction.
        is_in_fov_pred = tf.logical_and(
            tfu3d.is_within_fov(coords2d_pred, border_factor=-20 * (FLAGS.proc_side / 256)),
            preds.coords3d_abs[..., 2] > 0.001)
        is_near_fov_true = tf.logical_and(
            tfu3d.is_within_fov(coords2d_true, border_factor=-20 * (FLAGS.proc_side / 256)),
            inps.coords3d_true[..., 2] > 0.001)
        losses.loss2d = tfu.reduce_mean_masked(
            self.my_norm((coords2d_true - coords2d_pred) * scale_2d, preds.uncert),
            tf.logical_and(
                is_valid_and_far_enough,
                tf.logical_and(is_in_fov_pred, is_near_fov_true)))

        return losses, tf.add_n([
            losses.loss3d,
            losses.loss2d,
            FLAGS.absloss_factor * self.stop_grad_before_step(
                losses.loss3d_abs, FLAGS.absloss_start_step)])

reconstruct_absolute

    def adjusted_train_counter(self):
        return self.train_counter // FLAGS.grad_accum_steps

    def reconstruct_absolute(
            self, head2d, head3d, intrinsics, mix_3d_inside_fov, point_validity_mask=None):
        return tf.cond(
            self.adjusted_train_counter() < 500,
            lambda: tfu3d.reconstruct_absolute(
                head2d, head3d, intrinsics, mix_3d_inside_fov=mix_3d_inside_fov,
                weak_perspective=True, point_validity_mask=point_validity_mask,
                border_factor1=1, border_factor2=0.55, mix_based_on_3d=False),
            lambda: tfu3d.reconstruct_absolute(
                head2d, head3d, intrinsics, mix_3d_inside_fov=mix_3d_inside_fov,
                weak_perspective=False, point_validity_mask=point_validity_mask,
                border_factor1=1, border_factor2=0.55, mix_based_on_3d=False))

1. 功能概述

该函数根据当前训练步数（adjusted_train_counter）选择两种不同的 3D重建策略：

训练初期（前500步）：使用 弱透视投影（Weak Perspective Projection） 模型，简化计算以稳定训练。
训练后期（500步之后）：切换为 更精确的投影模型（可能是全透视投影），提升重建精度。

2. 参数详解

参数	类型/范围	说明
`head2d`	Tensor	网络预测的2D坐标（像素空间）
`head3d`	Tensor	网络预测的3D坐标（相对于根关节的偏移量，可能未对齐绝对坐标系）
`intrinsics`	Tensor	相机内参矩阵（用于从3D到2D的投影）
`mix_3d_inside_fov`	Float [0,1]	控制视场内（FOV）点使用3D预测的权重（与2D反投影结果混合）
`point_validity_mask`	Tensor (bool)	标记哪些点是有效的（如过滤掉遮挡点或离群点）
`weak_perspective`	bool	是否使用弱透视投影（True：忽略深度变化；False：使用完整透视投影）
`border_factor1/2`	float	控制视场边缘的扩展范围（用于判断点是否在图像边界内）
`mix_based_on_3d`	bool	混合策略是否基于3D坐标（若为False，可能基于2D置信度）

3. 两种重建模式对比

特性	训练初期（weak_perspective=True）	训练后期（weak_perspective=False）
投影模型	弱透视投影（假设物体深度变化可忽略）	完整透视投影（考虑深度变化）
计算复杂度	低（适合训练初期快速收敛）	高（适合精细优化）
适用场景	初始阶段姿态大致对齐	需要高精度重建（如关节细节优化）
稳定性	对噪声和初始值更鲁棒	依赖准确的初始预测