[Triton笔记1]核心概念

news2026/5/4 20:49:43

目前 Triton 主要支持Linux系统并且需要拥有NVIDIA GPU通常要求 Compute Capability 7.0 及以上即 Volta 架构以后如 V100, RTX 20/30/40 系列。你可以使用pip快速安装pip install triton这里我们看Triton官方的第一个示例代码向量加法import torch import triton import triton.language as tl triton.jit def add_kernel( x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr ): pidtl.program_id(axis0) block_startpid*BLOCK_SIZE offsetsblock_starttl.arange(0,BLOCK_SIZE) maskoffsetsn_elements x:tl.tensor tl.load(x_ptr offsets, maskmask) y:tl.tensor tl.load(y_ptr offsets, maskmask) output x y # Write x y back to DRAM. # 将 x y 写回 DRAM。 tl.store(output_ptr offsets, output, maskmask) def add(x: torch.Tensor, y: torch.Tensor): # We need to preallocate the output. # 需要预分配输出。 output torch.empty_like(x) assert x.is_cuda and y.is_cuda and output.is_cuda n_elements output.numel() # The SPMD launch grid denotes the number of kernel instances that run in parallel. # SPMD 启动网格表示并行运行的内核实例的数量。 # It is analogous to CUDA launch grids. It can be either Tuple[int], or Callable(metaparameters) - Tuple[int]. # 它类似于 CUDA 启动网格。它可以是 Tuple[int]也可以是 Callable(metaparameters) - Tuple[int]。 # In this case, we use a 1D grid where the size is the number of blocks: # 在这种情况下使用 1D 网格其中大小是块的数量 grid lambda meta: (triton.cdiv(n_elements, meta[BLOCK_SIZE]), ) # NOTE: # 注意 # - Each torch.tensor object is implicitly converted into a pointer to its first element. # - 每个 torch.tensor 对象都会隐式转换为其第一个元素的指针。 # - triton.jited functions can be indexed with a launch grid to obtain a callable GPU kernel. # - triton.jit 函数可以通过启动网格索引来获得可调用的 GPU 内核。 # - Dont forget to pass meta-parameters as keywords arguments. # - 不要忘记以关键字参数传递元参数。 add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE1024) # We return a handle to z but, since torch.cuda.synchronize() hasnt been called, the kernel is still running asynchronously at this point. # 返回 z 的句柄但由于 torch.cuda.synchronize() 尚未被调用此时内核仍在异步运行。 return output torch.manual_seed(0) size 98432 x torch.rand(size, devicecuda) y torch.rand(size, devicecuda) output_torch x y output_triton add(x, y) print(output_torch) print(output_triton) print(fThe maximum difference between torch and triton is f{torch.max(torch.abs(output_torch - output_triton))})从 C CUDA 的视角来看Triton 的核心逻辑其实是将Thread-level (线程级)的显式控制隐藏了转而使用SIMD-style (单指令多数据)的向量化操作。我们可以通过下面这个表格和详细拆解将这段代码与 CUDA C 的概念一一对应Triton 代码 / 概念CUDA C 对应概念说明triton.jit__global__ void ...定义一个在 GPU 上启动的内核Kernel。x_ptrfloat* x_ptr传入的指针在内核内部被视为指向内存起始地址的基础。BLOCK_SIZE: tl.constexprtemplate int BLOCK_SIZE编译期常量类似于 CUDA 中的模板参数或宏定义。tl.program_id(0)blockIdx.x每一个 Triton 程序实例Instance对应 CUDA 中的一个 Block。tl.arange(0, BLOCK_SIZE)(无直接对应)创建一个长度为BLOCK_SIZE的局部索引序列。tl.load/tl.store内存 Load/StoreTriton 会自动优化为内存对齐的向量化加载Coalesced Access。triton.jit这本身是python的装饰器语法。语法部分暂时就不深究了主要了解一下作用对于习惯 C 编译流程预处理 - 编译 - 汇编 - 链接的开发者来说triton.jit是 Triton 的核心引擎它把 Python 变成了一个真正的硬件编译器前端。简单来说它的作用是将你的 Python 函数拦截下来将其转换成中间表示IR然后针对 GPU 硬件即时编译成二进制代码。当你运行一个被triton.jit装饰的函数时发生的不是简单的 Python 解释执行而是一套完整的编译链AST 解析Triton 遍历该函数的 Python 抽象语法树AST。生成 TTIR (Triton Intermediate Representation)将 Python 逻辑转换为 Triton 特有的中间表示。这一步是硬件无关的。生成 LLVM IR将 TTIR 转换为 LLVM 的中间表示进行通用的编译器优化如死代码消除、循环展开。生成 PTX / CUBIN针对你的 NVIDIA GPU 架构如 RTX 3090 的sm_86生成底层的汇编指令PTX和二进制镜像CUBIN。为什么要用装饰器动态编译的威力这种 JIT 方式比 CUDA C 的 AOTAhead-of-Time编译有巨大的灵活性优势特化编译 (Specialization) 内核会根据你传入的常量参数如BLOCK_SIZE和数据类型如fp32还是fp16生成不同的二进制版本。如果你调用add_kernel[grid](..., BLOCK_SIZE256)它编一个版本。如果你调用add_kernel[grid](..., BLOCK_SIZE1024)它又编一个优化过的版本。消除参数传递开销因为BLOCK_SIZE是tl.constexpr编译器在编译时就直接把它硬编码到指令里了不需要像 C 那样在运行时从寄存器读取。装饰器内部持有的“超能力”当你装饰了一个函数后原本的 Python 函数对象就被包装成了一个JITFunction对象。这个对象拥有几个开发者非常喜欢的属性属性/方法作用.get_ir()查看生成的中间表示TTIR这类似于查看 C 的汇编中间件。.get_ptx()这是最直接的你可以直接打印出 PTX 代码看看它生成的指令是不是你想要的比如有没有用到ld.global.nc。缓存机制Triton 会把编译好的.cubin缓存在磁盘通常是~/.triton/cache。下次运行相同参数时秒开无需重新编译。为什么不能在里面写普通的 Python由于triton.jit负责把代码翻译成 GPU 指令你不能在被装饰的函数内部调用非 Triton 提供的 Python 库❌ 不行import numpy; numpy.sin(x)CPU 库无法在 GPU 内核运行。❌ 不行使用list或dictGPU 显存里没有这些高级数据结构。✅ 可以使用triton.language(即tl) 提供的函数。这些函数都有对应的硬件映射。总结CUDA 视角下的triton.jit你可以把它理解为“带自动 JIT 触发功能的 C 模板”。// 它的效果等价于 template typename T, int BLOCK_SIZE __global__ void kernel(T* ptr) { ... } // 当你在 Python 里调用时 // kernel[grid](tensor, BLOCK_SIZE512) // 相当于 // 1. 检查是否有 kernelfloat, 512 的实例 // 2. 没有则立即调用 nvcc 现场编一个。 // 3. 运行它。这种设计让你既能享受 Python 快速迭代的快感改个参数直接运行又能获得 C 级别的执行效率。triton.language.tensor以下简写为tl.tensor在 Triton 中tl.tensor是一个逻辑抽象。它代表的是一个存储在寄存器Register中的多维数据块。对于习惯 C 和 CUDA 的开发者理解tl.tensor的关键在于它不是指向显存的指针而是已经被加载到计算单元内部的一组数据值。以下是它的核心特性它是“分块”的Block-based在 CUDA C 中你的基本操作对象通常是float标量。而在 Triton 中基本操作对象就是tl.tensor。特性它必须是 2 的幂次方大小如 128, 256, 512...。这是为了匹配 GPU 硬件的 SIMD/SIMT 架构。C 类比你可以把它想象成 C 里的std::arrayfloat, 1024但在硬件层面它直接映射到了一组向量寄存器上。它是“隐式并行”的Implicitly Parallel当你对两个tl.tensor执行加法时output x y # x 和 y 都是 tl.tensor这行代码在硬件上会自动并行化。Triton 编译器会将这个操作分配给该 Program即 CTA内部的所有线程。特性你不需要写for循环也不需要写threadIdx。所有的算术运算,-,*,/,tl.exp,tl.dot都是在整个 Tensor 块上同步执行的。强力类型转换与精度控制作为 AI Infra 开发者你肯定关心精度FP32, FP16, BF16。tl.tensor支持非常方便的类型转换这对编写高性能 Kernels如混合精度算子至关重要。x tl.load(ptr) # 假设读取的是 FP32 x_half x.to(tl.float16) # 类似于 static_casthalf(x)广播机制Broadcastingtl.tensor遵循类似 NumPy 的广播规则。这在处理 Bias偏置或者归一化Normalization时极其好用。例子如果你有一个 [128, 128] 的矩阵 Tensor 和一个 [128, 1] 的列向量 Tensor 相加Triton 会自动在寄存器层面处理广播逻辑。它是“无地址”的Addressless这一点对 C 开发者来说最容易混淆x_ptr是一个地址。x tl.load(x_ptr)执行完这行后x就是一个tl.tensor。它此时已经离开了显存DRAM住进了寄存器里。特性你不能对x进行取地址操作因为它已经不是内存里的东西了。静态维度约束Static Shapes虽然 Python 是动态的但 Triton 的tl.tensor维度在编译时JIT 阶段必须是确定的即tl.constexpr。如果你的BLOCK_SIZE是 1024那么tl.arange(0, BLOCK_SIZE)生成的 Tensor 形状就是固定的。这使得编译器可以生成极其精简的汇编代码省去了运行时的形状检查。triton.language.constexpr对应 C CUDA 的概念里tl.constexpr对应的不是变量而是模板参数Template Parameters或宏定义Macros。tl.constexpr告诉 Triton 编译器“这个值在编译期就已经是确定的常量了。”编译期求值当 Triton 的 JIT 编译器在处理triton.jit装饰的函数时它会为不同的constexpr值生成不同的硬件二进制文件 (CUBIN)。死代码消除如果你在代码里写if BLOCK_SIZE 512:而传入的constexpr是 256编译器在生成 PTX 指令时会直接把这个分支删掉完全没有运行时的if开销。寄存器分配的依据编译器必须知道具体的数值如 128, 256才能决定一个 Thread Block 到底要占用多少个寄存器以及如何分配 Shared Memory。// Triton BLOCK_SIZE: tl.constexpr // 对应 C CUDA template int BLOCK_SIZE __global__ void kernel(...) { ... }BLOCK_SIZE指定的是什么大小在 Triton 的典型语境下BLOCK_SIZE指定的是一个 Program (即一个 CUDA CTA/Block) 一次性处理的元素个数。我们可以从三个层面来解剖这个“大小”A. 逻辑层面Tile Size分块大小它定义了你那个“超级向量”的长度。如果你设置BLOCK_SIZE 1024那么tl.arange(0, BLOCK_SIZE)就会生成一个长度为 1024 的索引向量。这意味着你接下来的tl.load、tl.store和算术运算都是以 1024 个元素为一组进行的。B. 内存层面Memory Coalescing访存对齐它决定了访存的粒度。GPU 的显存带宽在连续访问时效率最高。BLOCK_SIZE通常设为 128 或更大这样 Triton 编译器就能一次性发出长向量加载指令如LDG.E.128完美对齐显存位宽。C. 硬件映射层面最关键虽然你指定的是元素个数但它间接决定了硬件线程的负载。线程数推导Triton 还有一个默认参数叫num_warps通常是 4 或 8。如果BLOCK_SIZE 1024且num_warps 8总线程数个线程。每个线程的负载。即每个线程的寄存器里存了 4 个元素。triton.language.program_idtl.program_id(axis)在底层语义上几乎完全等同于 CUDA 中的blockIdx。builtin def program_id(axis, _semanticNone): Returns the id of the current program instance along the given :code:axis. :param axis: The axis of the 3D launch grid. Must be 0, 1 or 2. :type axis: int # if axis -1: # pid0 _semantic.program_id(0) # pid1 _semantic.program_id(1) # pid2 _semantic.program_id(2) # npg0 _semantic.num_programs(0) # npg1 _semantic.num_programs(1) # return pid0 pid1*npg0 pid2*npg0*npg1 axis _unwrap_if_constexpr(axis) return _semantic.program_id(axis)TritonCUDA C说明tl.program_id(0)blockIdx.x获取当前程序实例在 X 轴的 IDtl.program_id(1)blockIdx.y获取当前程序实例在 Y 轴的 IDtl.program_id(2)blockIdx.z获取当前程序实例在 Z 轴的 IDtl.num_programs(0)gridDim.x获取该轴上一共有多少个程序实例如果axis-1那么返回的该CTA在全局的blockIdx编号triton.language.arangebuiltin def arange(start, end, _semanticNone): start _unwrap_if_constexpr(start) end _unwrap_if_constexpr(end) return _semantic.arange(start, end) arange.__doc__ f Returns contiguous values within the half-open interval :code:[start, end). :code:end - start must be less than or equal to :code:TRITON_MAX_TENSOR_NUMEL {TRITON_MAX_TENSOR_NUMEL} :param start: Start of the interval. Must be a power of two. :type start: int32 :param end: End of the interval. Must be a power of two greater than :code:start. :type end: int32 维度解释函数作用在寄存器中生成一个包含连续整数的 1D Tensor。这个 Tensor 通常被用作“基础索引”后续通过加上偏移量来访问内存。参数start区间的起点闭区间。在实际硬件映射中通常设为0。注意源码要求其为 2 的幂。参数end区间的终点开区间。它决定了生成的 Tensor 的长度即BLOCK_SIZE。注意源码要求其为 2 的幂且大于start。返回值一个形状为(end - start,)的tl.tensor类型通常为int32。按我们之前理解的tl.tensor是由每个线程寄存器合成的一个向量。所以arange返回的向量中每个线程寄存器保存了本线程要处理的(end-start)范围中元素所对应的下标索引。同理一个线程可能保存多个所要处理的元素下标。triton.language.loadbuiltin def load(pointer, maskNone, otherNone, boundary_check(), padding_option, cache_modifier, eviction_policy, volatileFalse, _semanticNone): Return a tensor of data whose values are loaded from memory at location defined by pointer: (1) If pointer is a single element pointer, a scalar is be loaded. In this case: - mask and other must also be scalars, - other is implicitly typecast to pointer.dtype.element_ty, and - boundary_check and padding_option must be empty. (2) If pointer is an N-dimensional tensor of pointers, an N-dimensional tensor is loaded. In this case: - mask and other are implicitly broadcast to pointer.shape, - other is implicitly typecast to pointer.dtype.element_ty, and - boundary_check and padding_option must be empty. (3) If pointer is a block pointer defined by make_block_ptr, a tensor is loaded. In this case: - mask and other must be None, and - boundary_check and padding_option can be specified to control the behavior of out-of-bound access. :param pointer: Pointer to the data to be loaded :type pointer: triton.PointerType, or block of dtypetriton.PointerType :param mask: if mask[idx] is false, do not load the data at address pointer[idx] (must be None with block pointers) :type mask: Block of triton.int1, optional :param other: if mask[idx] is false, return other[idx] :type other: Block, optional :param boundary_check: tuple of integers, indicating the dimensions which should do the boundary check :type boundary_check: tuple of ints, optional :param padding_option: should be one of {, zero, nan}, the padding value to use while out of bounds. means an undefined value. :param cache_modifier: changes cache option in NVIDIA PTX :type cache_modifier: str, optional, should be one of {, .ca, .cg, .cv}, where .ca stands for cache at all levels, .cg stands for cache at global level (cache in L2 and below, not L1), and .cv means don’t cache and fetch again. see cache operator https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators_ for more details. :param eviction_policy: changes eviction policy in NVIDIA PTX :type eviction_policy: str, optional :param volatile: changes volatile option in NVIDIA PTX :type volatile: bool, optional # mask and other can be constexpr mask _unwrap_if_constexpr(mask) other _unwrap_if_constexpr(other) if mask is not None: mask _semantic.to_tensor(mask) if other is not None: other _semantic.to_tensor(other) padding_option _unwrap_if_constexpr(padding_option) cache_modifier _unwrap_if_constexpr(cache_modifier) eviction_policy _unwrap_if_constexpr(eviction_policy) volatile _unwrap_if_constexpr(volatile) return _semantic.load(pointer, mask, other, boundary_check, padding_option, cache_modifier, eviction_policy, volatile)1. 三种加载模式The Three Modes注释里提到的 (1)(2)(3) 对应了从底层到高层的三种用法模式描述CUDA 对应概念适用场景(1) 标量加载pointer是一个单地址*ptr(直接取值)加载全局配置、单个阈值。(2) 指针张量加载pointer是一组地址 (N维 Tensor)Vectorized Load(LDG)最常用。结合arange实现连续或不连续的向量加载。(3) 块指针加载使用make_block_ptr定义的块TMA (Tensor Memory Accelerator)高级模式专门针对矩阵运算优化自动处理越界。add_kernel用的是模式 (2)x_ptr offsets产生了一个“指针 Tensor”然后tl.load一次性把这组地址对应的值全取回来。2. 核心控制参数安全与逻辑mask(掩码):作用决定哪些位置需要加载。如果mask[idx]为False对应的内存地址不会被访问。重要性这不仅是逻辑问题更是安全性问题。在 CUDA 里访问ptr[N1]会报错Illegal Memory Access在 Triton 里我们靠mask来挡住这些非法访问。other(填充值):作用当mask为False时寄存器里该位置存什么典型用法在做tl.sum时越界位置可以填0做tl.max时可以填-inf。3. 专家级性能开关NVIDIA PTX 特性映射这些参数直接暴露了 Triton 作为“显卡编译器”的本质它们会直接改写生成的 PTX 指令cache_modifier(缓存策略):.ca: 缓存到所有级别L1 L2。.cg: 仅缓存至全局级别跳过 L1直接进 L2。当你确定数据不会被重复使用时用这个可以防止 L1 缓存污染。.cv: 不缓存Volatile每次都从显存读。eviction_policy(逐出策略):控制缓存行Cache Line被换出的优先级。例如可以设置为evict_first或evict_last来优化复杂的流式数据处理。volatile:类似于 C 的volatile关键字确保不从寄存器或缓存缓存读取旧值强制内存同步。triton.language.store_tensor_member_fn builtin def store(pointer, value, maskNone, boundary_check(), cache_modifier, eviction_policy, _semanticNone): Store a tensor of data into memory locations defined by pointer. (1) If pointer is a single element pointer, a scalar is stored. In this case: - mask must also be scalar, and - boundary_check and padding_option must be empty. (2) If pointer is an N-dimensional tensor of pointers, an N-dimensional block is stored. In this case: - mask is implicitly broadcast to pointer.shape, and - boundary_check must be empty. (3) If pointer is a block pointer defined by make_block_ptr, a block of data is stored. In this case: - mask must be None, and - boundary_check can be specified to control the behavior of out-of-bound access. value is implicitly broadcast to pointer.shape and typecast to pointer.dtype.element_ty. :param pointer: The memory location where the elements of value are stored :type pointer: triton.PointerType, or block of dtypetriton.PointerType :param value: The tensor of elements to be stored :type value: Block :param mask: If mask[idx] is false, do not store value[idx] at pointer[idx] :type mask: Block of triton.int1, optional :param boundary_check: tuple of integers, indicating the dimensions which should do the boundary check :type boundary_check: tuple of ints, optional :param cache_modifier: changes cache option in NVIDIA PTX :type cache_modifier: str, optional, should be one of {, .wb, .cg, .cs, .wt}, where .wb stands for cache write-back all coherent levels, .cg stands for cache global, .cs stands for cache streaming, .wt stands for cache write-through, see cache operator https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators_ for more details. :param eviction_policy: changes eviction policy in NVIDIA PTX :type eviction_policy: str, optional, should be one of {, evict_first, evict_last} # value can be constexpr value _semantic.to_tensor(value) mask _unwrap_if_constexpr(mask) if mask is not None: mask _semantic.to_tensor(mask) cache_modifier _unwrap_if_constexpr(cache_modifier) eviction_policy _unwrap_if_constexpr(eviction_policy) return _semantic.store(pointer, value, mask, boundary_check, cache_modifier, eviction_policy)1. 三种存储模式与 Load 一一对应模式描述硬件行为(1) 标量存储pointer是单地址向特定地址写入一个数值。(2) 指针张量存储pointer是一组地址 TensorVectorized Store。这是最常用的模式配合offsets实现一整块寄存器数据的写回。(3) 块指针存储配合make_block_ptr使用TMA (Tensor Memory Accelerator)硬件单元进行写回自动处理 2D 边界。2. 写操作的“守卫”mask作用在 CUDA 中如果你尝试往ptr[N1]写数据程序会直接 Crash或者静默报错导致数据污染。Triton 逻辑如果mask[idx]为False对应的内存地址绝对不会被改写。3. 写回策略cache_modifier性能调优的核心tl.store暴露了 PTX 级别的控制.wb(Write-Back, 回写)默认模式。数据先写到缓存L1/L2等缓存行被替换时再写回显存。.cg(Cache Global)跳过 L1直接写到 L2。如果你确信这些输出数据在当前 Kernel 结束后不会被立刻读取用这个可以保护 L1 缓存不被“污染”。.cs(Cache Streaming)流式缓存。标记这块数据很快就会被丢弃让 L2 缓存优先回收这些空间。.wt(Write-Through, 直写)数据同时写入缓存和显存。这通常用于强同步场景开销很大。作为习惯 C 的开发者在使用tl.store时要记住原子性tl.store本身不是原子的。如果两个 Program (CTA) 往同一个地址写结果是未定义的Race Condition。这时候你需要的是tl.atomic_add。合并写入 (Coalescing)和load一样如果你的pointer里的地址是连续的比如由arange生成硬件能实现合并写入带宽利用率最高。类型匹配value会被自动强制转换成pointer所指向的类型隐式typecast。如果你的指针是fp16但你写的是fp32的计算结果这里会发生精度截断。启动在cuda中我们一般直接给gridSize传一个整数。但在 Triton 里grid往往是一个“在发射瞬间才计算结果的函数”。1.meta是什么在 Triton 的语境下meta指的是“元参数字典Meta-parameters Dictionary”。当你调用add_kernel[grid](..., BLOCK_SIZE1024)时方括号里的BLOCK_SIZE1024不仅仅是传给 Kernel 的常量它还会被 Triton 收集起来放进一个叫meta的 Python 字典里。此时meta的内容实际上就是{BLOCK_SIZE: 1024}。只有显式通过KEYVALUE关键字参数传递的东西或者在函数签名中被标记为特殊身份如tl.constexpr的东西才会被 Triton 塞进这个meta字典里。在 Triton 中函数签名里的: tl.constexpr就像是 C 里的templateint BLOCK_SIZE。当你执行add_kernel[grid](..., BLOCK_SIZE1024)时Triton 检查签名它发现BLOCK_SIZE被标注为了tl.constexpr。强制入库所有被标注为constexpr的参数无论你是怎么传进去的都会被视为“元参数Metadata”。关键字参数Kwargs优先Triton 的 Launcher 会优先把所有通过KEYVALUE形式传递的参数放进meta字典供grid函数使用。在例子里grid lambda meta: (triton.cdiv(n_elements, meta[BLOCK_SIZE]), )你会发现n_elements并没有写成meta[n_elements]而是直接写了变量名。这是 Python 的“闭包Closure”特性n_elements是在add函数作用域里定义的变量。lambda函数可以“看见”并“捕获”它外面的变量。所以n_elements不需要通过meta传递它已经在lambda的口袋里了。但是BLOCK_SIZE不行因为BLOCK_SIZE是在调用add_kernel的那一刻才确定的甚至可能由 Autotuner 动态决定所以它必须通过meta这个“官方信使”来传递。2. 为什么写成lambda而不直接写个数字这是为了实现“动态绑定”和“自动调优Autotune”。CUDA 方式静态你必须先在外面算好int gridSize (N 1023) / 1024然后再传给gridSize, ...。Triton 方式延迟计算Triton 允许grid依赖于代码里的tl.constexpr变量。triton.cdiv(n_elements, meta[BLOCK_SIZE])的意思是“请去 meta 字典里查一下当前的BLOCK_SIZE是多少然后用总元素量除以它算出我需要多少个 Program 实例。”这样做的好处如果你以后用 Triton 的triton.autotune功能让它在BLOCK_SIZE为 512, 1024, 2048 之间自动选最快的这个grid表达式就不需要改动它会自动根据不同的BLOCK_SIZE算出正确的 Grid 大小。3. 返回值为什么是元组(..., )这对应了 CUDA 的dim3x, y, zTriton 的 Grid 可以是 1D、2D 或 3D 的。即使是 1D 的也必须返回一个元组Python 中单元素元组要加逗号如(val, )。如果我们要写一个处理矩阵的 2D Grid就会写成grid lambda meta: (triton.cdiv(M, meta[BLOCK_SIZE_M]), triton.cdiv(N, meta[BLOCK_SIZE_N]))4. 整个发射流程的“分步动作”当你执行add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE1024)时后台发生了这些事收集元参数Triton 看到BLOCK_SIZE1024。调用 LambdaTriton 调用grid({BLOCK_SIZE: 1024})。计算网格Lambda 返回(97, )假设 $98432 / 1024 \approx 97$。寻找/编译内核Triton 去找一个“专门为BLOCK_SIZE1024编译好的二进制文件”。正式发射调用 CUDA Driver API启动 97 个 Blocks。5. C 视角下的等价替换如果你觉得lambda太绕这段代码在逻辑上完全等价于# 1. 定义一个普通函数来计算网格 def calculate_grid(meta): # meta 是由 Triton 传入的一个字典 num_blocks triton.cdiv(n_elements, meta[BLOCK_SIZE]) return (num_blocks, ) # 2. 传给启动器 add_kernel[calculate_grid](x, y, output, n_elements, BLOCK_SIZE1024)

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/2582794.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！